linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements
@ 2021-02-22 16:50 Alex Williamson
  2021-02-22 16:50 ` [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device Alex Williamson
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:50 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

This is a re-implementation of [1] following suggestions and code from
Jason Gunthorpe.  This is lightly tested but seems functional and
throws no lockdep warnings.  In this series we tremendously simplify
zapping of vmas mapping device memory using unmap_mapping_range(), we
create a protocol for looking up a vfio_device from a vma and provide
an interface to get a reference from that vma, using that device
reference, the caller can register a notifier for the device to
trigger on events such as device release.  This notifier is only
enabled here for vfio-pci, but both the vma policy and the notifier
trigger should be trivial to add to any vfio bus driver after RFC.

Does this look more like the direction we should go?

Note that like the last series we're still not dropping DMA mappings
on device memory disable as this would likely break userspace in some
instances, we don't have IOMMU interfaces to modify protection bits,
and it's not clear an IOMMU fault is absolutely better than the bus
error.  Thanks,

Alex

[1]https://lore.kernel.org/kvm/161315658638.7320.9686203003395567745.stgit@gimli.home/T/#m64859ccd7d92f39a924759c7423f2dcf7d367c84
---

Alex Williamson (10):
      vfio: Create vfio_fs_type with inode per device
      vfio: Update vfio_add_group_dev() API
      vfio: Export unmap_mapping_range() wrapper
      vfio/pci: Use vfio_device_unmap_mapping_range()
      vfio: Create a vfio_device from vma lookup
      vfio: Add a device notifier interface
      vfio/pci: Notify on device release
      vfio/type1: Refactor pfn_list clearing
      vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn
      vfio/type1: Register device notifier


 drivers/vfio/Kconfig                         |    1 
 drivers/vfio/fsl-mc/vfio_fsl_mc.c            |    6 -
 drivers/vfio/mdev/vfio_mdev.c                |    5 -
 drivers/vfio/pci/vfio_pci.c                  |  223 ++++----------------------
 drivers/vfio/pci/vfio_pci_private.h          |    3 
 drivers/vfio/platform/vfio_platform_common.c |    7 +
 drivers/vfio/vfio.c                          |  143 +++++++++++++++--
 drivers/vfio/vfio_iommu_type1.c              |  211 ++++++++++++++++++++-----
 include/linux/vfio.h                         |   19 ++
 9 files changed, 368 insertions(+), 250 deletions(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
@ 2021-02-22 16:50 ` Alex Williamson
  2021-02-26  5:38   ` Christoph Hellwig
  2021-02-22 16:50 ` [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API Alex Williamson
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:50 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

By linking all the device fds we provide to userspace to an
address space through a new pseudo fs, we can use tools like
unmap_mapping_range() to zap all vmas associated with a device.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..464caef97aff 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,11 +32,18 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/pseudo_fs.h>
+#include <linux/mount.h>
 
 #define DRIVER_VERSION	"0.3"
 #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
 #define DRIVER_DESC	"VFIO - User Level meta-driver"
 
+#define VFIO_MAGIC 0x5646494f /* "VFIO" */
+
+static int vfio_fs_cnt;
+static struct vfsmount *vfio_fs_mnt;
+
 static struct vfio {
 	struct class			*class;
 	struct list_head		iommu_drivers_list;
@@ -97,6 +104,7 @@ struct vfio_device {
 	struct vfio_group		*group;
 	struct list_head		group_next;
 	void				*device_data;
+	struct inode			*inode;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -529,6 +537,34 @@ static struct vfio_group *vfio_group_get_from_dev(struct device *dev)
 	return group;
 }
 
+static int vfio_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, VFIO_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type vfio_fs_type = {
+	.name = "vfio",
+	.owner = THIS_MODULE,
+	.init_fs_context = vfio_fs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static struct inode *vfio_fs_inode_new(void)
+{
+	struct inode *inode;
+	int ret;
+
+	ret = simple_pin_fs(&vfio_fs_type, &vfio_fs_mnt, &vfio_fs_cnt);
+	if (ret)
+		return ERR_PTR(ret);
+
+	inode = alloc_anon_inode(vfio_fs_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		simple_release_fs(&vfio_fs_mnt, &vfio_fs_cnt);
+
+	return inode;
+}
+
 /**
  * Device objects - create, release, get, put, search
  */
@@ -539,11 +575,19 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
 					     void *device_data)
 {
 	struct vfio_device *device;
+	struct inode *inode;
 
 	device = kzalloc(sizeof(*device), GFP_KERNEL);
 	if (!device)
 		return ERR_PTR(-ENOMEM);
 
+	inode = vfio_fs_inode_new();
+	if (IS_ERR(inode)) {
+		kfree(device);
+		return (struct vfio_device *)inode;
+	}
+	device->inode = inode;
+
 	kref_init(&device->kref);
 	device->dev = dev;
 	device->group = group;
@@ -574,6 +618,9 @@ static void vfio_device_release(struct kref *kref)
 
 	dev_set_drvdata(device->dev, NULL);
 
+	iput(device->inode);
+	simple_release_fs(&vfio_fs_mnt, &vfio_fs_cnt);
+
 	kfree(device);
 
 	/* vfio_del_group_dev may be waiting for this device */
@@ -1488,6 +1535,13 @@ static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
 	 */
 	filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
 
+	/*
+	 * Use the pseudo fs inode on the device to link all mmaps
+	 * to the same address space, allowing us to unmap all vmas
+	 * associated to this device using unmap_mapping_range().
+	 */
+	filep->f_mapping = device->inode->i_mapping;
+
 	atomic_inc(&group->container_users);
 
 	fd_install(ret, filep);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
  2021-02-22 16:50 ` [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device Alex Williamson
@ 2021-02-22 16:50 ` Alex Williamson
  2021-02-22 17:01   ` Jason Gunthorpe
  2021-02-22 16:50 ` [RFC PATCH 03/10] vfio: Export unmap_mapping_range() wrapper Alex Williamson
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:50 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Rather than an errno, return a pointer to the opaque vfio_device
to allow the bus driver to call into vfio-core without additional
lookups and references.  Note that bus drivers are still required
to use vfio_del_group_dev() to teardown the vfio_device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/fsl-mc/vfio_fsl_mc.c            |    6 ++++--
 drivers/vfio/mdev/vfio_mdev.c                |    5 ++++-
 drivers/vfio/pci/vfio_pci.c                  |    7 +++++--
 drivers/vfio/platform/vfio_platform_common.c |    7 +++++--
 drivers/vfio/vfio.c                          |   23 ++++++++++-------------
 include/linux/vfio.h                         |    6 +++---
 6 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
index f27e25112c40..a4c2d0b9cd51 100644
--- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
+++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
@@ -592,6 +592,7 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
 	struct iommu_group *group;
 	struct vfio_fsl_mc_device *vdev;
 	struct device *dev = &mc_dev->dev;
+	struct vfio_device *device;
 	int ret;
 
 	group = vfio_iommu_group_get(dev);
@@ -608,8 +609,9 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
 
 	vdev->mc_dev = mc_dev;
 
-	ret = vfio_add_group_dev(dev, &vfio_fsl_mc_ops, vdev);
-	if (ret) {
+	device = vfio_add_group_dev(dev, &vfio_fsl_mc_ops, vdev);
+	if (IS_ERR(device)) {
+		ret = PTR_ERR(device);
 		dev_err(dev, "VFIO_FSL_MC: Failed to add to vfio group\n");
 		goto out_group_put;
 	}
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index b52eea128549..32901b265864 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -124,8 +124,11 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
 static int vfio_mdev_probe(struct device *dev)
 {
 	struct mdev_device *mdev = to_mdev_device(dev);
+	struct vfio_device *device;
 
-	return vfio_add_group_dev(dev, &vfio_mdev_dev_ops, mdev);
+	device = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, mdev);
+
+	return IS_ERR(device) ? PTR_ERR(device) : 0;
 }
 
 static void vfio_mdev_remove(struct device *dev)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..f0a1d05f0137 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1926,6 +1926,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
+	struct vfio_device *device;
 	int ret;
 
 	if (vfio_pci_is_denylisted(pdev))
@@ -1968,9 +1969,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
 
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (ret)
+	device = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (IS_ERR(device)) {
+		ret = PTR_ERR(device);
 		goto out_free;
+	}
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret)
diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c
index fb4b385191f2..ff41fe0b758e 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -657,6 +657,7 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev,
 			       struct device *dev)
 {
 	struct iommu_group *group;
+	struct vfio_device *device;
 	int ret;
 
 	if (!vdev)
@@ -685,9 +686,11 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev,
 		goto put_reset;
 	}
 
-	ret = vfio_add_group_dev(dev, &vfio_platform_ops, vdev);
-	if (ret)
+	device = vfio_add_group_dev(dev, &vfio_platform_ops, vdev);
+	if (IS_ERR(device)) {
+		ret = PTR_ERR(device);
 		goto put_iommu;
+	}
 
 	mutex_init(&vdev->igate);
 
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 464caef97aff..067cd843961c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -848,8 +848,9 @@ static int vfio_iommu_group_notifier(struct notifier_block *nb,
 /**
  * VFIO driver API
  */
-int vfio_add_group_dev(struct device *dev,
-		       const struct vfio_device_ops *ops, void *device_data)
+struct vfio_device *vfio_add_group_dev(struct device *dev,
+				       const struct vfio_device_ops *ops,
+				       void *device_data)
 {
 	struct iommu_group *iommu_group;
 	struct vfio_group *group;
@@ -857,14 +858,14 @@ int vfio_add_group_dev(struct device *dev,
 
 	iommu_group = iommu_group_get(dev);
 	if (!iommu_group)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	group = vfio_group_get_from_iommu(iommu_group);
 	if (!group) {
 		group = vfio_create_group(iommu_group);
 		if (IS_ERR(group)) {
 			iommu_group_put(iommu_group);
-			return PTR_ERR(group);
+			return (struct vfio_device *)group;
 		}
 	} else {
 		/*
@@ -880,23 +881,19 @@ int vfio_add_group_dev(struct device *dev,
 			 iommu_group_id(iommu_group));
 		vfio_device_put(device);
 		vfio_group_put(group);
-		return -EBUSY;
+		return ERR_PTR(-EBUSY);
 	}
 
 	device = vfio_group_create_device(group, dev, ops, device_data);
-	if (IS_ERR(device)) {
-		vfio_group_put(group);
-		return PTR_ERR(device);
-	}
 
 	/*
-	 * Drop all but the vfio_device reference.  The vfio_device holds
-	 * a reference to the vfio_group, which holds a reference to the
-	 * iommu_group.
+	 * Drop all but the vfio_device reference.  The vfio_device, if
+	 * !IS_ERR() holds a reference to the vfio_group, which holds a
+	 * reference to the iommu_group.
 	 */
 	vfio_group_put(group);
 
-	return 0;
+	return device;
 }
 EXPORT_SYMBOL_GPL(vfio_add_group_dev);
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b7e18bde5aa8..b784463000d4 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -48,9 +48,9 @@ struct vfio_device_ops {
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
 extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
 
-extern int vfio_add_group_dev(struct device *dev,
-			      const struct vfio_device_ops *ops,
-			      void *device_data);
+extern struct vfio_device *vfio_add_group_dev(struct device *dev,
+					const struct vfio_device_ops *ops,
+					void *device_data);
 
 extern void *vfio_del_group_dev(struct device *dev);
 extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 03/10] vfio: Export unmap_mapping_range() wrapper
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
  2021-02-22 16:50 ` [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device Alex Williamson
  2021-02-22 16:50 ` [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API Alex Williamson
@ 2021-02-22 16:50 ` Alex Williamson
  2021-02-22 16:51 ` [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range() Alex Williamson
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:50 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Allow bus drivers to use vfio pseudo fs mapping to zap all mmaps
across a range of their device files.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio.c  |    7 +++++++
 include/linux/vfio.h |    2 ++
 2 files changed, 9 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 067cd843961c..da212425ab30 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -565,6 +565,13 @@ static struct inode *vfio_fs_inode_new(void)
 	return inode;
 }
 
+void vfio_device_unmap_mapping_range(struct vfio_device *device,
+				     loff_t start, loff_t len)
+{
+	unmap_mapping_range(device->inode->i_mapping, start, len, true);
+}
+EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
+
 /**
  * Device objects - create, release, get, put, search
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b784463000d4..f435dfca15eb 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,8 @@ extern void *vfio_del_group_dev(struct device *dev);
 extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
+extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
+					    loff_t start, loff_t len);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (2 preceding siblings ...)
  2021-02-22 16:50 ` [RFC PATCH 03/10] vfio: Export unmap_mapping_range() wrapper Alex Williamson
@ 2021-02-22 16:51 ` Alex Williamson
  2021-02-22 17:22   ` Jason Gunthorpe
  2021-02-22 16:51 ` [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup Alex Williamson
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:51 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated
lock ordering necessary to manually zap each related vma.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c         |  217 ++++-------------------------------
 drivers/vfio/pci/vfio_pci_private.h |    3 
 2 files changed, 28 insertions(+), 192 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f0a1d05f0137..115f10f7b096 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -225,7 +225,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
-static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
+static int vfio_pci_mem_trylock_and_zap_cb(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1168,7 +1168,7 @@ static long vfio_pci_ioctl(void *device_data,
 		struct vfio_pci_group_info info;
 		struct vfio_devices devs = { .cur_index = 0 };
 		bool slot = false;
-		int i, group_idx, mem_idx = 0, count = 0, ret = 0;
+		int i, group_idx, count = 0, ret = 0;
 
 		minsz = offsetofend(struct vfio_pci_hot_reset, count);
 
@@ -1268,32 +1268,16 @@ static long vfio_pci_ioctl(void *device_data,
 		}
 
 		/*
-		 * We need to get memory_lock for each device, but devices
-		 * can share mmap_lock, therefore we need to zap and hold
-		 * the vma_lock for each device, and only then get each
-		 * memory_lock.
+		 * Try to get the memory_lock write lock for all devices and
+		 * zap all BAR mmaps.
 		 */
 		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					    vfio_pci_try_zap_and_vma_lock_cb,
+					    vfio_pci_mem_trylock_and_zap_cb,
 					    &devs, slot);
-		if (ret)
-			goto hot_reset_release;
-
-		for (; mem_idx < devs.cur_index; mem_idx++) {
-			struct vfio_pci_device *tmp;
-
-			tmp = vfio_device_data(devs.devices[mem_idx]);
-
-			ret = down_write_trylock(&tmp->memory_lock);
-			if (!ret) {
-				ret = -EBUSY;
-				goto hot_reset_release;
-			}
-			mutex_unlock(&tmp->vma_lock);
-		}
 
 		/* User has access, do the reset */
-		ret = pci_reset_bus(vdev->pdev);
+		if (!ret)
+			ret = pci_reset_bus(vdev->pdev);
 
 hot_reset_release:
 		for (i = 0; i < devs.cur_index; i++) {
@@ -1303,10 +1287,7 @@ static long vfio_pci_ioctl(void *device_data,
 			device = devs.devices[i];
 			tmp = vfio_device_data(device);
 
-			if (i < mem_idx)
-				up_write(&tmp->memory_lock);
-			else
-				mutex_unlock(&tmp->vma_lock);
+			up_write(&tmp->memory_lock);
 			vfio_device_put(device);
 		}
 		kfree(devs.devices);
@@ -1452,100 +1433,18 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
+static void vfio_pci_zap_bars(struct vfio_pci_device *vdev)
 {
-	struct vfio_pci_mmap_vma *mmap_vma, *tmp;
-
-	/*
-	 * Lock ordering:
-	 * vma_lock is nested under mmap_lock for vm_ops callback paths.
-	 * The memory_lock semaphore is used by both code paths calling
-	 * into this function to zap vmas and the vm_ops.fault callback
-	 * to protect the memory enable state of the device.
-	 *
-	 * When zapping vmas we need to maintain the mmap_lock => vma_lock
-	 * ordering, which requires using vma_lock to walk vma_list to
-	 * acquire an mm, then dropping vma_lock to get the mmap_lock and
-	 * reacquiring vma_lock.  This logic is derived from similar
-	 * requirements in uverbs_user_mmap_disassociate().
-	 *
-	 * mmap_lock must always be the top-level lock when it is taken.
-	 * Therefore we can only hold the memory_lock write lock when
-	 * vma_list is empty, as we'd need to take mmap_lock to clear
-	 * entries.  vma_list can only be guaranteed empty when holding
-	 * vma_lock, thus memory_lock is nested under vma_lock.
-	 *
-	 * This enables the vm_ops.fault callback to acquire vma_lock,
-	 * followed by memory_lock read lock, while already holding
-	 * mmap_lock without risk of deadlock.
-	 */
-	while (1) {
-		struct mm_struct *mm = NULL;
-
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock))
-				return 0;
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		while (!list_empty(&vdev->vma_list)) {
-			mmap_vma = list_first_entry(&vdev->vma_list,
-						    struct vfio_pci_mmap_vma,
-						    vma_next);
-			mm = mmap_vma->vma->vm_mm;
-			if (mmget_not_zero(mm))
-				break;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			mm = NULL;
-		}
-		if (!mm)
-			return 1;
-		mutex_unlock(&vdev->vma_lock);
-
-		if (try) {
-			if (!mmap_read_trylock(mm)) {
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mmap_read_lock(mm);
-		}
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock)) {
-				mmap_read_unlock(mm);
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		list_for_each_entry_safe(mmap_vma, tmp,
-					 &vdev->vma_list, vma_next) {
-			struct vm_area_struct *vma = mmap_vma->vma;
-
-			if (vma->vm_mm != mm)
-				continue;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-
-			zap_vma_ptes(vma, vma->vm_start,
-				     vma->vm_end - vma->vm_start);
-		}
-		mutex_unlock(&vdev->vma_lock);
-		mmap_read_unlock(mm);
-		mmput(mm);
-	}
+	vfio_device_unmap_mapping_range(vdev->device,
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));
 }
 
 void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
 {
-	vfio_pci_zap_and_vma_lock(vdev, false);
 	down_write(&vdev->memory_lock);
-	mutex_unlock(&vdev->vma_lock);
+	vfio_pci_zap_bars(vdev);
 }
 
 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev)
@@ -1567,82 +1466,25 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
 	up_write(&vdev->memory_lock);
 }
 
-/* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
-			      struct vm_area_struct *vma)
-{
-	struct vfio_pci_mmap_vma *mmap_vma;
-
-	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
-	if (!mmap_vma)
-		return -ENOMEM;
-
-	mmap_vma->vma = vma;
-	list_add(&mmap_vma->vma_next, &vdev->vma_list);
-
-	return 0;
-}
-
-/*
- * Zap mmaps on open so that we can fault them in on access and therefore
- * our vma_list only tracks mappings accessed since last zap.
- */
-static void vfio_pci_mmap_open(struct vm_area_struct *vma)
-{
-	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-}
-
-static void vfio_pci_mmap_close(struct vm_area_struct *vma)
-{
-	struct vfio_pci_device *vdev = vma->vm_private_data;
-	struct vfio_pci_mmap_vma *mmap_vma;
-
-	mutex_lock(&vdev->vma_lock);
-	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
-		if (mmap_vma->vma == vma) {
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			break;
-		}
-	}
-	mutex_unlock(&vdev->vma_lock);
-}
-
 static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_device *vdev = vma->vm_private_data;
-	vm_fault_t ret = VM_FAULT_NOPAGE;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
 
-	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
 
-	if (!__vfio_pci_memory_enabled(vdev)) {
-		ret = VM_FAULT_SIGBUS;
-		mutex_unlock(&vdev->vma_lock);
-		goto up_out;
-	}
-
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
-		mutex_unlock(&vdev->vma_lock);
-		goto up_out;
-	}
-
-	mutex_unlock(&vdev->vma_lock);
+	if (__vfio_pci_memory_enabled(vdev) &&
+	    !io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				vma->vm_end - vma->vm_start, vma->vm_page_prot))
+		ret = VM_FAULT_NOPAGE;
 
-	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		ret = VM_FAULT_SIGBUS;
-
-up_out:
 	up_read(&vdev->memory_lock);
+
 	return ret;
 }
 
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
-	.open = vfio_pci_mmap_open,
-	.close = vfio_pci_mmap_close,
 	.fault = vfio_pci_mmap_fault,
 };
 
@@ -1926,7 +1768,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
-	struct vfio_device *device;
 	int ret;
 
 	if (vfio_pci_is_denylisted(pdev))
@@ -1965,13 +1806,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	mutex_init(&vdev->vma_lock);
-	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
 
-	device = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (IS_ERR(device)) {
-		ret = PTR_ERR(device);
+	vdev->device = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (IS_ERR(vdev->device)) {
+		ret = PTR_ERR(vdev->device);
 		goto out_free;
 	}
 
@@ -2253,7 +2092,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
+static int vfio_pci_mem_trylock_and_zap_cb(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
@@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
 
 	vdev = vfio_device_data(device);
 
-	/*
-	 * Locking multiple devices is prone to deadlock, runaway and
-	 * unwind if we hit contention.
-	 */
-	if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
+	if (!down_write_trylock(&vdev->memory_lock)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
+	vfio_pci_zap_bars(vdev);
+
 	devs->devices[devs->cur_index++] = device;
 	return 0;
 }
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9cd1882a05af..ba37f4eeefd0 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
 
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
+	struct vfio_device	*device;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;
@@ -139,8 +140,6 @@ struct vfio_pci_device {
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
 	struct notifier_block	nb;
-	struct mutex		vma_lock;
-	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 };
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (3 preceding siblings ...)
  2021-02-22 16:51 ` [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range() Alex Williamson
@ 2021-02-22 16:51 ` Alex Williamson
  2021-02-22 17:29   ` Jason Gunthorpe
  2021-02-22 16:51 ` [RFC PATCH 06/10] vfio: Add a device notifier interface Alex Williamson
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:51 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Introduce a vfio bus driver policy where using the exported
vfio_device_vma_open() as the vm_ops.open for a vma indicates
vm_private_data is the struct vfio_device pointer associated
to the vma.  This allows a direct vma to device lookup.

Operating on an active, open vma to the device, we should be
able to directly increment the vfio_device reference.

Implemented only for vfio-pci for now.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c |    6 ++++--
 drivers/vfio/vfio.c         |   24 ++++++++++++++++++++++++
 include/linux/vfio.h        |    2 ++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 115f10f7b096..f9529bac6c97 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1469,7 +1469,8 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
 static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_device *device = vma->vm_private_data;
+	struct vfio_pci_device *vdev = vfio_device_data(device);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 
 	down_read(&vdev->memory_lock);
@@ -1485,6 +1486,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 }
 
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
+	.open = vfio_device_vma_open,
 	.fault = vfio_pci_mmap_fault,
 };
 
@@ -1542,7 +1544,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		}
 	}
 
-	vma->vm_private_data = vdev;
+	vma->vm_private_data = vdev->device;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index da212425ab30..399c42b77fbb 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct vfio_device *device,
 }
 EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
 
+/*
+ * A VFIO bus driver using this open callback will provide a
+ * struct vfio_device pointer in the vm_private_data field.
+ */
+void vfio_device_vma_open(struct vm_area_struct *vma)
+{
+}
+EXPORT_SYMBOL_GPL(vfio_device_vma_open);
+
 /**
  * Device objects - create, release, get, put, search
  */
@@ -927,6 +936,21 @@ struct vfio_device *vfio_device_get_from_dev(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
 
+struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
+{
+	struct vfio_device *device;
+
+	if (vma->vm_ops->open != vfio_device_vma_open)
+		return ERR_PTR(-ENODEV);
+
+	device = vma->vm_private_data;
+
+	vfio_device_get(device);
+
+	return device;
+}
+EXPORT_SYMBOL_GPL(vfio_device_get_from_vma);
+
 static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 						     char *buf)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f435dfca15eb..188c2f3feed9 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -58,6 +58,8 @@ extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
 extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
 					    loff_t start, loff_t len);
+extern void vfio_device_vma_open(struct vm_area_struct *vma);
+extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 06/10] vfio: Add a device notifier interface
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (4 preceding siblings ...)
  2021-02-22 16:51 ` [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup Alex Williamson
@ 2021-02-22 16:51 ` Alex Williamson
  2021-02-22 16:51 ` [RFC PATCH 07/10] vfio/pci: Notify on device release Alex Williamson
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:51 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Using a vfio device, a notifier block can be registered to receive
select device events.  Notifiers can only be registered for contained
devices, ie. they are available through a user context.  Registration
of a notifier increments the reference to that container context
therefore notifiers must minimally respond to the release event by
asynchronously removing notifiers.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/Kconfig |    1 +
 drivers/vfio/vfio.c  |   35 +++++++++++++++++++++++++++++++++++
 include/linux/vfio.h |    9 +++++++++
 3 files changed, 45 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 5533df91b257..8ac1601b681b 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -23,6 +23,7 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
+	select SRCU
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/driver-api/vfio.rst for more details.
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 399c42b77fbb..1a1b46215ac4 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -105,6 +105,7 @@ struct vfio_device {
 	struct list_head		group_next;
 	void				*device_data;
 	struct inode			*inode;
+	struct srcu_notifier_head	notifier;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -610,6 +611,7 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
 	device->ops = ops;
 	device->device_data = device_data;
 	dev_set_drvdata(dev, device);
+	srcu_init_notifier_head(&device->notifier);
 
 	/* No need to get group_lock, caller has group reference */
 	vfio_group_get(group);
@@ -1778,6 +1780,39 @@ static const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
+int vfio_device_register_notifier(struct vfio_device *device,
+				  struct notifier_block *nb)
+{
+	int ret;
+
+	/* Container ref persists until unregister on success */
+	ret =  vfio_group_add_container_user(device->group);
+	if (ret)
+		return ret;
+
+	ret = srcu_notifier_chain_register(&device->notifier, nb);
+	if (ret)
+		vfio_group_try_dissolve_container(device->group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_register_notifier);
+
+void vfio_device_unregister_notifier(struct vfio_device *device,
+				    struct notifier_block *nb)
+{
+	if (!srcu_notifier_chain_unregister(&device->notifier, nb))
+		vfio_group_try_dissolve_container(device->group);
+}
+EXPORT_SYMBOL_GPL(vfio_device_unregister_notifier);
+
+int vfio_device_notifier_call(struct vfio_device *device,
+			      enum vfio_device_notify_type event)
+{
+	return srcu_notifier_call_chain(&device->notifier, event, NULL);
+}
+EXPORT_SYMBOL_GPL(vfio_device_notifier_call);
+
 /**
  * External user API, exported by symbols to be linked dynamically.
  *
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 188c2f3feed9..8217cd4ea53d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -60,6 +60,15 @@ extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
 					    loff_t start, loff_t len);
 extern void vfio_device_vma_open(struct vm_area_struct *vma);
 extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);
+extern int vfio_device_register_notifier(struct vfio_device *device,
+					 struct notifier_block *nb);
+extern void vfio_device_unregister_notifier(struct vfio_device *device,
+					    struct notifier_block *nb);
+enum vfio_device_notify_type {
+	VFIO_DEVICE_RELEASE = 0,
+};
+int vfio_device_notifier_call(struct vfio_device *device,
+			      enum vfio_device_notify_type event);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 07/10] vfio/pci: Notify on device release
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (5 preceding siblings ...)
  2021-02-22 16:51 ` [RFC PATCH 06/10] vfio: Add a device notifier interface Alex Williamson
@ 2021-02-22 16:51 ` Alex Williamson
  2021-02-22 16:52 ` [RFC PATCH 08/10] vfio/type1: Refactor pfn_list clearing Alex Williamson
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:51 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Trigger a release notifier call when open reference count is zero.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f9529bac6c97..fb8307430e24 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -560,6 +560,7 @@ static void vfio_pci_release(void *device_data)
 	mutex_lock(&vdev->reflck->lock);
 
 	if (!(--vdev->refcnt)) {
+		vfio_device_notifier_call(vdev->device, VFIO_DEVICE_RELEASE);
 		vfio_pci_vf_token_user_add(vdev, -1);
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 08/10] vfio/type1: Refactor pfn_list clearing
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (6 preceding siblings ...)
  2021-02-22 16:51 ` [RFC PATCH 07/10] vfio/pci: Notify on device release Alex Williamson
@ 2021-02-22 16:52 ` Alex Williamson
  2021-02-22 16:52 ` [RFC PATCH 09/10] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn Alex Williamson
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Pull code out to a function for re-use.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |   57 +++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b3df383d7028..5099b3c9dce0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -484,6 +484,39 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 	return ret;
 }
 
+/* Return 1 if iommu->lock dropped and notified, 0 if done */
+static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			      struct vfio_dma **dma_last, int *retries)
+{
+	if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
+		struct vfio_iommu_type1_dma_unmap nb_unmap;
+
+		if (*dma_last == dma) {
+			BUG_ON(++(*retries) > 10);
+		} else {
+			*dma_last = dma;
+			*retries = 0;
+		}
+
+		nb_unmap.iova = dma->iova;
+		nb_unmap.size = dma->size;
+
+		/*
+		 * Notify anyone (mdev vendor drivers) to invalidate and
+		 * unmap iovas within the range we're about to unmap.
+		 * Vendor drivers MUST unpin pages in response to an
+		 * invalidation.
+		 */
+		mutex_unlock(&iommu->lock);
+		blocking_notifier_call_chain(&iommu->notifier,
+					     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
+					     &nb_unmap);
+		return 1;
+	}
+
+	return 0;
+}
+
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 			 int prot, unsigned long *pfn)
 {
@@ -1307,29 +1340,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			continue;
 		}
 
-		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
-			struct vfio_iommu_type1_dma_unmap nb_unmap;
-
-			if (dma_last == dma) {
-				BUG_ON(++retries > 10);
-			} else {
-				dma_last = dma;
-				retries = 0;
-			}
-
-			nb_unmap.iova = dma->iova;
-			nb_unmap.size = dma->size;
-
-			/*
-			 * Notify anyone (mdev vendor drivers) to invalidate and
-			 * unmap iovas within the range we're about to unmap.
-			 * Vendor drivers MUST unpin pages in response to an
-			 * invalidation.
-			 */
-			mutex_unlock(&iommu->lock);
-			blocking_notifier_call_chain(&iommu->notifier,
-						    VFIO_IOMMU_NOTIFY_DMA_UNMAP,
-						    &nb_unmap);
+		if (unmap_dma_pfn_list(iommu, dma, &dma_last, &retries)) {
 			mutex_lock(&iommu->lock);
 			goto again;
 		}


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 09/10] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (7 preceding siblings ...)
  2021-02-22 16:52 ` [RFC PATCH 08/10] vfio/type1: Refactor pfn_list clearing Alex Williamson
@ 2021-02-22 16:52 ` Alex Williamson
  2021-02-22 16:52 ` [RFC PATCH 10/10] vfio/type1: Register device notifier Alex Williamson
  2021-02-22 18:00 ` [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Jason Gunthorpe
  10 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

We'll need these to track vfio device mappings.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |   30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5099b3c9dce0..b34ee4b96a4a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -517,15 +517,16 @@ static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	return 0;
 }
 
-static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
-			 int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			 struct mm_struct *mm, unsigned long vaddr,
+			 unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
 	unsigned int flags = 0;
 	int ret;
 
-	if (prot & IOMMU_WRITE)
+	if (dma->prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;
 
 	mmap_read_lock(mm);
@@ -543,7 +544,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+		ret = follow_fault_pfn(vma, mm, vaddr, pfn,
+				       dma->prot & IOMMU_WRITE);
 		if (ret == -EAGAIN)
 			goto retry;
 
@@ -615,7 +617,8 @@ static int vfio_wait_all_valid(struct vfio_iommu *iommu)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
+static long vfio_pin_pages_remote(struct vfio_iommu *iommu,
+				  struct vfio_dma *dma, unsigned long vaddr,
 				  long npage, unsigned long *pfn_base,
 				  unsigned long limit)
 {
@@ -628,7 +631,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+	ret = vaddr_get_pfn(iommu, dma, current->mm, vaddr, pfn_base);
 	if (ret)
 		return ret;
 
@@ -655,7 +658,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	/* Lock all the consecutive pages from pfn_base */
 	for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 	     pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
-		ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
+		ret = vaddr_get_pfn(iommu, dma, current->mm, vaddr, &pfn);
 		if (ret)
 			break;
 
@@ -715,7 +718,8 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	return unlocked;
 }
 
-static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
+static int vfio_pin_page_external(struct vfio_iommu *iommu,
+				  struct vfio_dma *dma, unsigned long vaddr,
 				  unsigned long *pfn_base, bool do_accounting)
 {
 	struct mm_struct *mm;
@@ -725,7 +729,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
+	ret = vaddr_get_pfn(iommu, dma, mm, vaddr, pfn_base);
 	if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
 		ret = vfio_lock_acct(dma, 1, true);
 		if (ret) {
@@ -833,8 +837,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		}
 
 		remote_vaddr = dma->vaddr + (iova - dma->iova);
-		ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i],
-					     do_accounting);
+		ret = vfio_pin_page_external(iommu, dma, remote_vaddr,
+					     &phys_pfn[i], do_accounting);
 		if (ret)
 			goto pin_unwind;
 
@@ -1404,7 +1408,7 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
+		npage = vfio_pin_pages_remote(iommu, dma, vaddr + dma->size,
 					      size >> PAGE_SHIFT, &pfn, limit);
 		if (npage <= 0) {
 			WARN_ON(!npage);
@@ -1660,7 +1664,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				size_t n = dma->iova + dma->size - iova;
 				long npage;
 
-				npage = vfio_pin_pages_remote(dma, vaddr,
+				npage = vfio_pin_pages_remote(iommu, dma, vaddr,
 							      n >> PAGE_SHIFT,
 							      &pfn, limit);
 				if (npage <= 0) {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (8 preceding siblings ...)
  2021-02-22 16:52 ` [RFC PATCH 09/10] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn Alex Williamson
@ 2021-02-22 16:52 ` Alex Williamson
  2021-02-22 17:55   ` Jason Gunthorpe
  2021-02-22 18:00 ` [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Jason Gunthorpe
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-22 16:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Introduce a new default strict MMIO mapping mode where the vma for
a VM_PFNMAP mapping must be backed by a vfio device.  This allows
holding a reference to the device and registering a notifier for the
device, which additionally keeps the device in an IOMMU context for
the extent of the DMA mapping.  On notification of device release,
automatically drop the DMA mappings for it.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |  124 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b34ee4b96a4a..2a16257bd5b6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -61,6 +61,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
 MODULE_PARM_DESC(dma_entry_limit,
 		 "Maximum number of user DMA mappings per container (65535).");
 
+static bool strict_mmio_maps = true;
+module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
+MODULE_PARM_DESC(strict_mmio_maps,
+		 "Restrict to safe DMA mappings of device memory (true).");
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct list_head	iova_list;
@@ -88,6 +93,14 @@ struct vfio_domain {
 	bool			fgsp;		/* Fine-grained super pages */
 };
 
+/* Req separate object for async removal from notifier vs dropping vfio_dma */
+struct pfnmap_obj {
+	struct notifier_block	nb;
+	struct work_struct	work;
+	struct vfio_iommu	*iommu;
+	struct vfio_device	*device;
+};
+
 struct vfio_dma {
 	struct rb_node		node;
 	dma_addr_t		iova;		/* Device address */
@@ -100,6 +113,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct pfnmap_obj	*pfnmap;
 };
 
 struct vfio_group {
@@ -517,6 +531,68 @@ static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	return 0;
 }
 
+static void unregister_device_bg(struct work_struct *work)
+{
+	struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
+
+	vfio_device_unregister_notifier(pfnmap->device, &pfnmap->nb);
+	vfio_device_put(pfnmap->device);
+	kfree(pfnmap);
+}
+
+/*
+ * pfnmap object can exist beyond the dma mapping referencing it, but it holds
+ * a container reference assuring the iommu exists.  Find the dma, if exists.
+ */
+struct vfio_dma *pfnmap_find_dma(struct pfnmap_obj *pfnmap)
+{
+	struct rb_node *n;
+
+	for (n = rb_first(&pfnmap->iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		if (dma->pfnmap == pfnmap)
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma);
+
+static int vfio_device_nb_cb(struct notifier_block *nb,
+			     unsigned long action, void *unused)
+{
+	struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
+
+	switch (action) {
+	case VFIO_DEVICE_RELEASE:
+	{
+		struct vfio_dma *dma, *dma_last = NULL;
+		int retries = 0;
+again:
+		mutex_lock(&pfnmap->iommu->lock);
+		dma = pfnmap_find_dma(pfnmap);
+		if (dma) {
+			if (unmap_dma_pfn_list(pfnmap->iommu, dma,
+					       &dma_last, &retries))
+				goto again;
+
+			dma->pfnmap = NULL;
+			vfio_remove_dma(pfnmap->iommu, dma);
+		}
+		mutex_unlock(&pfnmap->iommu->lock);
+
+		/* Cannot unregister notifier from callback chain */
+		INIT_WORK(&pfnmap->work, unregister_device_bg);
+		schedule_work(&pfnmap->work);
+		break;
+	}
+	}
+
+	return NOTIFY_OK;
+}
+
 static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			 struct mm_struct *mm, unsigned long vaddr,
 			 unsigned long *pfn)
@@ -549,8 +625,48 @@ static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
 		if (ret == -EAGAIN)
 			goto retry;
 
-		if (!ret && !is_invalid_reserved_pfn(*pfn))
+		if (!ret && !is_invalid_reserved_pfn(*pfn)) {
 			ret = -EFAULT;
+			goto done;
+		}
+
+		if (!dma->pfnmap) {
+			struct pfnmap_obj *pfnmap;
+			struct vfio_device *device;
+
+			pfnmap = kzalloc(sizeof(*pfnmap), GFP_KERNEL);
+			if (!pfnmap) {
+				ret = -ENOMEM;
+				goto done;
+			}
+
+			pfnmap->iommu = iommu;
+			pfnmap->nb.notifier_call = vfio_device_nb_cb;
+
+			device = vfio_device_get_from_vma(vma);
+			if (IS_ERR(device)) {
+				kfree(pfnmap);
+				if (strict_mmio_maps)
+					ret = PTR_ERR(device);
+
+				goto done;
+			}
+
+			pfnmap->device = device;
+			ret = vfio_device_register_notifier(device,
+							    &pfnmap->nb);
+			if (ret) {
+				vfio_device_put(device);
+				kfree(pfnmap);
+				if (!strict_mmio_maps)
+					ret = 0;
+
+				goto done;
+			}
+
+			dma->pfnmap = pfnmap;
+		}
+
 	}
 done:
 	mmap_read_unlock(mm);
@@ -1097,6 +1213,12 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
+	if (dma->pfnmap) {
+		vfio_device_unregister_notifier(dma->pfnmap->device,
+						&dma->pfnmap->nb);
+		vfio_device_put(dma->pfnmap->device);
+		kfree(dma->pfnmap);
+	}
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API
  2021-02-22 16:50 ` [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API Alex Williamson
@ 2021-02-22 17:01   ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 17:01 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 09:50:47AM -0700, Alex Williamson wrote:
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 464caef97aff..067cd843961c 100644
> +++ b/drivers/vfio/vfio.c
> @@ -848,8 +848,9 @@ static int vfio_iommu_group_notifier(struct notifier_block *nb,
>  /**
>   * VFIO driver API
>   */
> -int vfio_add_group_dev(struct device *dev,
> -		       const struct vfio_device_ops *ops, void *device_data)
> +struct vfio_device *vfio_add_group_dev(struct device *dev,
> +				       const struct vfio_device_ops *ops,
> +				       void *device_data)
>  {
>  	struct iommu_group *iommu_group;
>  	struct vfio_group *group;
> @@ -857,14 +858,14 @@ int vfio_add_group_dev(struct device *dev,
>  
>  	iommu_group = iommu_group_get(dev);
>  	if (!iommu_group)
> -		return -EINVAL;
> +		return ERR_PTR(-EINVAL);
>  
>  	group = vfio_group_get_from_iommu(iommu_group);
>  	if (!group) {
>  		group = vfio_create_group(iommu_group);
>  		if (IS_ERR(group)) {
>  			iommu_group_put(iommu_group);
> -			return PTR_ERR(group);
> +			return (struct vfio_device *)group;

Use ERR_CAST() here

Also, I've wrote a small series last week that goes further than this,
I made 'struct vfio_device *' the universal handle to refer to the
device, instead of using 'void *' or 'struct device *' as a surrogate.

It is interesting that you hit on the same issue as a blocker to this
series. So for I've found quite a few other things that are out of
sorts because of this.

Cheers,
Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()
  2021-02-22 16:51 ` [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range() Alex Williamson
@ 2021-02-22 17:22   ` Jason Gunthorpe
  2021-02-24 21:55     ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 17:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 09:51:13AM -0700, Alex Williamson wrote:

> +	vfio_device_unmap_mapping_range(vdev->device,
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));

Isn't this the same as invalidating everything? I see in
vfio_pci_mmap():

	if (index >= VFIO_PCI_ROM_REGION_INDEX)
		return -EINVAL;

> @@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
>  
>  	vdev = vfio_device_data(device);
>  
> -	/*
> -	 * Locking multiple devices is prone to deadlock, runaway and
> -	 * unwind if we hit contention.
> -	 */
> -	if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
> +	if (!down_write_trylock(&vdev->memory_lock)) {
>  		vfio_device_put(device);
>  		return -EBUSY;
>  	}

And this is only done as part of VFIO_DEVICE_PCI_HOT_RESET?

It looks like VFIO_DEVICE_PCI_HOT_RESET effects the entire slot?

How about putting the inode on the reflck structure, which is also
per-slot, and then a single unmap_mapping_range() will take care of
everything, no need to iterate over things in the driver core.

Note the vm->pg_off space doesn't have any special meaning, it is
fine that two struct vfio_pci_device's are sharing the same address
space and using an incompatible overlapping pg_offs

> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 9cd1882a05af..ba37f4eeefd0 100644
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
>  
>  struct vfio_pci_device {
>  	struct pci_dev		*pdev;
> +	struct vfio_device	*device;

Ah, I did this too, but I didn't use a pointer :)

All the places trying to call vfio_device_put() when they really want
a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
to have an array of vfio_pci_device, and get_pf_vdev() only needs to
return one pointer.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-22 16:51 ` [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup Alex Williamson
@ 2021-02-22 17:29   ` Jason Gunthorpe
  2021-02-24 21:55     ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 17:29 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 09:51:25AM -0700, Alex Williamson wrote:

> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index da212425ab30..399c42b77fbb 100644
> +++ b/drivers/vfio/vfio.c
> @@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct vfio_device *device,
>  }
>  EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
>  
> +/*
> + * A VFIO bus driver using this open callback will provide a
> + * struct vfio_device pointer in the vm_private_data field.

The vfio_device pointer should be stored in the struct file

> +struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device;
> +
> +	if (vma->vm_ops->open != vfio_device_vma_open)
> +		return ERR_PTR(-ENODEV);
> +

Having looked at VFIO alot more closely last week, this is even more
trivial - VFIO only creates mmaps of memory we want to invalidate, so
this is just very simple:

struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
{
       if (!vma->vm_file ||vma->vm_file->f_op != &vfio_device_fops)
	   return ERR_PTR(-ENODEV);
       return vma->vm_file->f_private;
}

The only use of the special ops would be if there are multiple types
of mmap's going on, but for this narrow use case those would be safely
distinguished by the vm_pgoff instead

> +extern void vfio_device_vma_open(struct vm_area_struct *vma);
> +extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);

No externs on function prototypes in new code please, we've been
slowly deleting them..

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-22 16:52 ` [RFC PATCH 10/10] vfio/type1: Register device notifier Alex Williamson
@ 2021-02-22 17:55   ` Jason Gunthorpe
  2021-02-24 21:55     ` Alex Williamson
  2021-02-26  5:47     ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 17:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 09:52:32AM -0700, Alex Williamson wrote:
> Introduce a new default strict MMIO mapping mode where the vma for
> a VM_PFNMAP mapping must be backed by a vfio device.  This allows
> holding a reference to the device and registering a notifier for the
> device, which additionally keeps the device in an IOMMU context for
> the extent of the DMA mapping.  On notification of device release,
> automatically drop the DMA mappings for it.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>  drivers/vfio/vfio_iommu_type1.c |  124 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 123 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b34ee4b96a4a..2a16257bd5b6 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -61,6 +61,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
>  MODULE_PARM_DESC(dma_entry_limit,
>  		 "Maximum number of user DMA mappings per container (65535).");
>  
> +static bool strict_mmio_maps = true;
> +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> +MODULE_PARM_DESC(strict_mmio_maps,
> +		 "Restrict to safe DMA mappings of device memory (true).");

I think this should be a kconfig, historically we've required kconfig
to opt-in to unsafe things that could violate kernel security. Someone
building a secure boot trusted kernel system should not have an
options for userspace to just turn off protections.

> +/* Req separate object for async removal from notifier vs dropping vfio_dma */
> +struct pfnmap_obj {
> +	struct notifier_block	nb;
> +	struct work_struct	work;
> +	struct vfio_iommu	*iommu;
> +	struct vfio_device	*device;
> +};

So this is basically the dmabuf, I think it would be simple enough to
go in here and change it down the road if someone had interest.

> +static void unregister_device_bg(struct work_struct *work)
> +{
> +	struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
> +
> +	vfio_device_unregister_notifier(pfnmap->device, &pfnmap->nb);
> +	vfio_device_put(pfnmap->device);

The device_put keeps the device from becoming unregistered, but what
happens during the hot reset case? Is this what the cover letter
was talking about? CPU access is revoked but P2P is still possible?

> +static int vfio_device_nb_cb(struct notifier_block *nb,
> +			     unsigned long action, void *unused)
> +{
> +	struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
> +
> +	switch (action) {
> +	case VFIO_DEVICE_RELEASE:
> +	{
> +		struct vfio_dma *dma, *dma_last = NULL;
> +		int retries = 0;
> +again:
> +		mutex_lock(&pfnmap->iommu->lock);
> +		dma = pfnmap_find_dma(pfnmap);

Feels a bit strange that the vfio_dma isn't linked to the pfnmap_obj
instead of searching the entire list?

> @@ -549,8 +625,48 @@ static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  		if (ret == -EAGAIN)
>  			goto retry;

I'd prefer this was written a bit differently, I would like it very
much if this doesn't mis-use follow_pte() by returning pfn outside
the lock.

vaddr_get_bar_pfn(..)
{
        vma = find_vma_intersection(mm, vaddr, vaddr + 1);
	if (!vma)
           return -ENOENT;
        if ((vma->vm_flags & VM_DENYWRITE) && (prot & PROT_WRITE)) // Check me
           return -EFAULT;
        device = vfio_device_get_from_vma(vma);
	if (!device)
           return -ENOENT;

	/*
         * Now do the same as vfio_pci_mmap_fault() - the vm_pgoff must
	 * be the physical pfn when using this mechanism. Delete follow_pte entirely()
         */
        pfn = (vaddr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff
	
        /* de-dup device and record that we are using device's pages in the
	   pfnmap */
        ...
}

This would be significantly better if it could do whole ranges instead
of page at a time.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements
  2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (9 preceding siblings ...)
  2021-02-22 16:52 ` [RFC PATCH 10/10] vfio/type1: Register device notifier Alex Williamson
@ 2021-02-22 18:00 ` Jason Gunthorpe
  10 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 18:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 09:50:22AM -0700, Alex Williamson wrote:
> This is a re-implementation of [1] following suggestions and code from
> Jason Gunthorpe.  This is lightly tested but seems functional and
> throws no lockdep warnings.  In this series we tremendously simplify
> zapping of vmas mapping device memory using unmap_mapping_range(), we
> create a protocol for looking up a vfio_device from a vma and provide
> an interface to get a reference from that vma, using that device
> reference, the caller can register a notifier for the device to
> trigger on events such as device release.  This notifier is only
> enabled here for vfio-pci, but both the vma policy and the notifier
> trigger should be trivial to add to any vfio bus driver after RFC.
> 
> Does this look more like the direction we should go?

Yep, it seems pretty good already, see my remarks in each patch

For security only vfio_device's that have been enabled for P2P, set
the vm_pgoff to the pfn, and trigger invalidation, should be used with
this mechanism.

I'd add some global opt-in so vfio_device_get_from_vma() will refuse
to return vfio_device's that don't declare they have support.

Add a flags member to vfio_device_ops would get it done fairly cleanly

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()
  2021-02-22 17:22   ` Jason Gunthorpe
@ 2021-02-24 21:55     ` Alex Williamson
  2021-02-25  0:57       ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-24 21:55 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, 22 Feb 2021 13:22:30 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 22, 2021 at 09:51:13AM -0700, Alex Williamson wrote:
> 
> > +	vfio_device_unmap_mapping_range(vdev->device,
> > +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> > +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> > +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));  
> 
> Isn't this the same as invalidating everything? I see in
> vfio_pci_mmap():
> 
> 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> 		return -EINVAL;

No, immediately above that is:

        if (index >= VFIO_PCI_NUM_REGIONS) {
                int regnum = index - VFIO_PCI_NUM_REGIONS;
                struct vfio_pci_region *region = vdev->region + regnum;

                if (region && region->ops && region->ops->mmap &&
                    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
                        return region->ops->mmap(vdev, region, vma);
                return -EINVAL;
        }

We can have device specific regions that can support mmap, but those
regions aren't necessarily on-device memory so we can't assume they're
tied to the memory bit in the command register.
 
> > @@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
> >  
> >  	vdev = vfio_device_data(device);
> >  
> > -	/*
> > -	 * Locking multiple devices is prone to deadlock, runaway and
> > -	 * unwind if we hit contention.
> > -	 */
> > -	if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
> > +	if (!down_write_trylock(&vdev->memory_lock)) {
> >  		vfio_device_put(device);
> >  		return -EBUSY;
> >  	}  
> 
> And this is only done as part of VFIO_DEVICE_PCI_HOT_RESET?

Yes.

> It looks like VFIO_DEVICE_PCI_HOT_RESET effects the entire slot?

Yes.

> How about putting the inode on the reflck structure, which is also
> per-slot, and then a single unmap_mapping_range() will take care of
> everything, no need to iterate over things in the driver core.
>
> Note the vm->pg_off space doesn't have any special meaning, it is
> fine that two struct vfio_pci_device's are sharing the same address
> space and using an incompatible overlapping pg_offs

Ok, but how does this really help us, unless you're also proposing some
redesign of the memory_lock semaphore?  Even if we're zapping all the
affected devices for a bus reset that doesn't eliminate that we still
require device level granularity for other events.  Maybe there's some
layering of the inodes that you're implying that allows both, but it
still feels like a minor optimization if we need to traverse devices
for the memory_lock.

> > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> > index 9cd1882a05af..ba37f4eeefd0 100644
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
> >  
> >  struct vfio_pci_device {
> >  	struct pci_dev		*pdev;
> > +	struct vfio_device	*device;  
> 
> Ah, I did this too, but I didn't use a pointer :)

vfio_device is embedded in vfio.c, so that worries me.

> All the places trying to call vfio_device_put() when they really want
> a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> return one pointer.

Sure, that example would be a good simplification.  I'm not sure see
other cases where we're going out of our way to manage the vfio_device
versus vfio_pci_device objects though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-22 17:29   ` Jason Gunthorpe
@ 2021-02-24 21:55     ` Alex Williamson
  2021-02-25  0:06       ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-24 21:55 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, 22 Feb 2021 13:29:13 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 22, 2021 at 09:51:25AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index da212425ab30..399c42b77fbb 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct vfio_device *device,
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
> >  
> > +/*
> > + * A VFIO bus driver using this open callback will provide a
> > + * struct vfio_device pointer in the vm_private_data field.  
> 
> The vfio_device pointer should be stored in the struct file
> 
> > +struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> > +{
> > +	struct vfio_device *device;
> > +
> > +	if (vma->vm_ops->open != vfio_device_vma_open)
> > +		return ERR_PTR(-ENODEV);
> > +  
> 
> Having looked at VFIO alot more closely last week, this is even more
> trivial - VFIO only creates mmaps of memory we want to invalidate, so
> this is just very simple:
> 
> struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> {
>        if (!vma->vm_file ||vma->vm_file->f_op != &vfio_device_fops)
> 	   return ERR_PTR(-ENODEV);
>        return vma->vm_file->f_private;
> }

That's pretty clever.

> The only use of the special ops would be if there are multiple types
> of mmap's going on, but for this narrow use case those would be safely
> distinguished by the vm_pgoff instead

We potentially do have device specific regions which can support mmap,
for example the migration region.  We'll need to think about how we
could even know if portions of those regions map to a device.  We could
use the notifier to announce it and require the code supporting those
device specific regions manage it.

I'm not really clear what you're getting at with vm_pgoff though, could
you explain further?

> > +extern void vfio_device_vma_open(struct vm_area_struct *vma);
> > +extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);  
> 
> No externs on function prototypes in new code please, we've been
> slowly deleting them..

For now it's consistent with what we have there already, I'd prefer a
separate cleanup of that before or after rather than introducing
inconsistency here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-22 17:55   ` Jason Gunthorpe
@ 2021-02-24 21:55     ` Alex Williamson
  2021-02-25  0:22       ` Jason Gunthorpe
  2021-02-26  5:47     ` Christoph Hellwig
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-24 21:55 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Mon, 22 Feb 2021 13:55:23 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 22, 2021 at 09:52:32AM -0700, Alex Williamson wrote:
> > Introduce a new default strict MMIO mapping mode where the vma for
> > a VM_PFNMAP mapping must be backed by a vfio device.  This allows
> > holding a reference to the device and registering a notifier for the
> > device, which additionally keeps the device in an IOMMU context for
> > the extent of the DMA mapping.  On notification of device release,
> > automatically drop the DMA mappings for it.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >  drivers/vfio/vfio_iommu_type1.c |  124 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 123 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index b34ee4b96a4a..2a16257bd5b6 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -61,6 +61,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
> >  MODULE_PARM_DESC(dma_entry_limit,
> >  		 "Maximum number of user DMA mappings per container (65535).");
> >  
> > +static bool strict_mmio_maps = true;
> > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > +MODULE_PARM_DESC(strict_mmio_maps,
> > +		 "Restrict to safe DMA mappings of device memory (true).");  
> 
> I think this should be a kconfig, historically we've required kconfig
> to opt-in to unsafe things that could violate kernel security. Someone
> building a secure boot trusted kernel system should not have an
> options for userspace to just turn off protections.

It could certainly be further protected that this option might not
exist based on a Kconfig, but I think we're already risking breaking
some existing users and I'd rather allow it with an opt-in (like we
already do for lack of interrupt isolation), possibly even with a
kernel taint if used, if necessary.

> > +/* Req separate object for async removal from notifier vs dropping vfio_dma */
> > +struct pfnmap_obj {
> > +	struct notifier_block	nb;
> > +	struct work_struct	work;
> > +	struct vfio_iommu	*iommu;
> > +	struct vfio_device	*device;
> > +};  
> 
> So this is basically the dmabuf, I think it would be simple enough to
> go in here and change it down the road if someone had interest.
> 
> > +static void unregister_device_bg(struct work_struct *work)
> > +{
> > +	struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
> > +
> > +	vfio_device_unregister_notifier(pfnmap->device, &pfnmap->nb);
> > +	vfio_device_put(pfnmap->device);  
> 
> The device_put keeps the device from becoming unregistered, but what
> happens during the hot reset case? Is this what the cover letter
> was talking about? CPU access is revoked but P2P is still possible?

Yes, this only addresses cutting off DMA mappings to a released device
where we can safely drop the DMA mapping entirely.  I think we can
consider complete removal of the mapping object safe in this case
because the user has no ongoing access to the device and after
re-opening the device it would be reasonable to expect mappings would
need to be re-established.

Doing the same around disabling device memory or a reset has a much
greater potential to break userspace.  In some of these cases QEMU will
do the right thing, but reset with dependent devices gets into
scenarios that I can't be sure about.  Other userspace drivers also
exist and I can't verify them all.  So ideally we'd want to temporarily
remove the IOMMU mapping, leaving the mapping object, and restoring it
on the other side.  But I don't think we have a way to do that in the
IOMMU API currently, other than unmap and later remap.  So we might
need to support a zombie mode for the mapping object or enhance the
IOMMU API where we could leave the iotlb entry in place but drop r+w
permissions.

At this point we're also just trying to define which error we get,
perhaps an unsupported request if we do nothing or an IOMMU fault if we
invalidate or suspend the mapping.  It's not guaranteed that one of
these has better behavior on the system than the other.
 
> > +static int vfio_device_nb_cb(struct notifier_block *nb,
> > +			     unsigned long action, void *unused)
> > +{
> > +	struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
> > +
> > +	switch (action) {
> > +	case VFIO_DEVICE_RELEASE:
> > +	{
> > +		struct vfio_dma *dma, *dma_last = NULL;
> > +		int retries = 0;
> > +again:
> > +		mutex_lock(&pfnmap->iommu->lock);
> > +		dma = pfnmap_find_dma(pfnmap);  
> 
> Feels a bit strange that the vfio_dma isn't linked to the pfnmap_obj
> instead of searching the entire list?

It does, I've been paranoid about whether we can trust that the
vfio_dma is still valid in all cases.  I had myself convinced that if
our notifier actions expand we could get another callback before our
workqueue removes the notifier, but that might be more simply handled
by having a vfio_dma pointer that gets cleared once we remove the
vfio_dma.  I'll take another look.


> > @@ -549,8 +625,48 @@ static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  		if (ret == -EAGAIN)
> >  			goto retry;  
> 
> I'd prefer this was written a bit differently, I would like it very
> much if this doesn't mis-use follow_pte() by returning pfn outside
> the lock.
> 
> vaddr_get_bar_pfn(..)
> {
>         vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> 	if (!vma)
>            return -ENOENT;
>         if ((vma->vm_flags & VM_DENYWRITE) && (prot & PROT_WRITE)) // Check me
>            return -EFAULT;
>         device = vfio_device_get_from_vma(vma);
> 	if (!device)
>            return -ENOENT;
> 
> 	/*
>          * Now do the same as vfio_pci_mmap_fault() - the vm_pgoff must
> 	 * be the physical pfn when using this mechanism. Delete follow_pte entirely()
>          */
>         pfn = (vaddr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff
> 	
>         /* de-dup device and record that we are using device's pages in the
> 	   pfnmap */
>         ...
> }


This seems to undo both:

5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()")

(which also suggests we are going to break users without the module
option opt-in above)

And:

41311242221e ("vfio/type1: Support faulting PFNMAP vmas")

So we'd have an alternate path in the un-safe mode and we'd lose the
ability to fault in mappings.

> This would be significantly better if it could do whole ranges instead
> of page at a time.

There are some patches I just put in from Daniel Jordan that use
batching.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-24 21:55     ` Alex Williamson
@ 2021-02-25  0:06       ` Jason Gunthorpe
  2021-02-25 22:21         ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25  0:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:

> > The only use of the special ops would be if there are multiple types
> > of mmap's going on, but for this narrow use case those would be safely
> > distinguished by the vm_pgoff instead
> 
> We potentially do have device specific regions which can support mmap,
> for example the migration region.  We'll need to think about how we
> could even know if portions of those regions map to a device.  We could
> use the notifier to announce it and require the code supporting those
> device specific regions manage it.

So, the above basically says any VFIO VMA is allowed for VFIO to map
to the IOMMU.

If there are places creating mmaps for VFIO that should not go to the
IOMMU then they need to return NULL from this function.

> I'm not really clear what you're getting at with vm_pgoff though, could
> you explain further?

Ah, so I have to take a side discussion to explain what I ment.

The vm_pgoff is a bit confused because we change it here in vfio_pci:

    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

But the address_space invalidation assumes it still has the region
based encoding:

+	vfio_device_unmap_mapping_range(vdev->device,
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));

Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
must always be set to the same thing - either the
VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Since you say we need a limited invalidation this looks like a bug to
me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

So, the PCI vma needs to get switched to use the
VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
region number from the vm_pgoff and thus access any additional data,
such as the base pfn or a flag saying this cannot be mapped to the
IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
information attached to that region ID.

All places creating vfio mmaps have to set the vm_pgoff to
VFIO_PCI_INDEX_TO_OFFSET().

But we have these violations that need fixing:

drivers/vfio/fsl-mc/vfio_fsl_mc.c:      vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;

Couldn't see any purpose to this code, cargo cult copy? Just delete
it.

drivers/vfio/pci/vfio_pci.c:    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

Used to implement fault() but we could get the region number and
extract the pfn from the vfio_pci_device's data easy enough.

I manually checked that other parts of VFIO not under drivers/vfio are
doing it OK, looks fine.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-24 21:55     ` Alex Williamson
@ 2021-02-25  0:22       ` Jason Gunthorpe
  2021-02-25 17:54         ` Peter Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25  0:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Wed, Feb 24, 2021 at 02:55:08PM -0700, Alex Williamson wrote:

> > > +static bool strict_mmio_maps = true;
> > > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > > +MODULE_PARM_DESC(strict_mmio_maps,
> > > +		 "Restrict to safe DMA mappings of device memory (true).");  
> > 
> > I think this should be a kconfig, historically we've required kconfig
> > to opt-in to unsafe things that could violate kernel security. Someone
> > building a secure boot trusted kernel system should not have an
> > options for userspace to just turn off protections.
> 
> It could certainly be further protected that this option might not
> exist based on a Kconfig, but I think we're already risking breaking
> some existing users and I'd rather allow it with an opt-in (like we
> already do for lack of interrupt isolation), possibly even with a
> kernel taint if used, if necessary.

Makes me nervous, security should not be optional.

> > I'd prefer this was written a bit differently, I would like it very
> > much if this doesn't mis-use follow_pte() by returning pfn outside
> > the lock.
> > 
> > vaddr_get_bar_pfn(..)
> > {
> >         vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> > 	if (!vma)
> >            return -ENOENT;
> >         if ((vma->vm_flags & VM_DENYWRITE) && (prot & PROT_WRITE)) // Check me
> >            return -EFAULT;
> >         device = vfio_device_get_from_vma(vma);
> > 	if (!device)
> >            return -ENOENT;
> > 
> > 	/*
> >          * Now do the same as vfio_pci_mmap_fault() - the vm_pgoff must
> > 	 * be the physical pfn when using this mechanism. Delete follow_pte entirely()
> >          */
> >         pfn = (vaddr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff
> > 	
> >         /* de-dup device and record that we are using device's pages in the
> > 	   pfnmap */
> >         ...
> > }
> 
> 
> This seems to undo both:
> 
> 5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()")

No, the bug this commit described is fixed by calling
vfio_device_get_from_vma() which excludes all non-VFIO VMAs already.

We can assert that the vm_pgoff is in a specific format because it is
a VFIO owned VMA and must follow the rules to be part of the address
space. See my last email

Here I was suggesting to use the vm_pgoff == PFN rule, but since
you've clarified that doesn't work we'd have to determine the PFN from
the region number through the vfio_device instead.

> (which also suggests we are going to break users without the module
> option opt-in above)

Not necessarily, this is complaining vfio crashes, it doesn't say they
actually needed the IOMMU to work on those VMAs because they are doing
P2P DMA.

I think, if this does break someone, they are on a real fringe and
must have already modified their kernel, so a kconfig is the right
approach. It is pretty hard to get non-GUP'able DMA'able memory into a
process with the stock kernel.

Generally speaking, I think Linus has felt security bug fixes like
this are more on the OK side of things to break fringe users.

> And:
> 
> 41311242221e ("vfio/type1: Support faulting PFNMAP vmas")
> 
> So we'd have an alternate path in the un-safe mode and we'd lose the
> ability to fault in mappings.

As above we already exclude VMAs that are not from VFIO, and VFIO
sourced VMA's do not meaningfully implement fault for this use
case. So calling fixup_user_fault() is pointless.

Peter just did this so we could ask him what it was for..

I feel pretty strongly that removing the call to follow_pte is
important here. Even if we do cover all the issues with mis-using the
API it just makes a maintenance problem to leave it in.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()
  2021-02-24 21:55     ` Alex Williamson
@ 2021-02-25  0:57       ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25  0:57 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Wed, Feb 24, 2021 at 02:55:05PM -0700, Alex Williamson wrote:

> Ok, but how does this really help us, unless you're also proposing some
> redesign of the memory_lock semaphore?  Even if we're zapping all the
> affected devices for a bus reset that doesn't eliminate that we still
> require device level granularity for other events.

Ok, I missed the device level one, forget this remark about the reflk
then, per vfio_pci_device is the right granularity

> > >  struct vfio_pci_device {
> > >  	struct pci_dev		*pdev;
> > > +	struct vfio_device	*device;  
> > 
> > Ah, I did this too, but I didn't use a pointer :)
> 
> vfio_device is embedded in vfio.c, so that worries me.

I'm working on what we talked about in the other thread to show how
VFIO looks if it follows the normal Linux container_of idiom and to
then remove the vfio_mdev.c indirection as an illustration.

I want to directly make the case the VFIO is better off following the
standard kernel designs and driver core norms so we can find an
agreement to get Max's work on vfio_pci going forward.

You were concerned about hand waving, and maintainability so I'm
willing to put in some more work to make it concrete.

I hope you'll keep an open mind

> > All the places trying to call vfio_device_put() when they really want
> > a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> > to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> > return one pointer.
> 
> Sure, that example would be a good simplification.  I'm not sure see
> other cases where we're going out of our way to manage the vfio_device
> versus vfio_pci_device objects though.

What I've found is there are lots of little costs all over the
place. The above was just easy to describe in this context.

In my mind the biggest negative cost is the type erasure. Lots of
places are using 'device *', 'void *', 'kobj *' as generic handles to
things that are actually already concretely typed. In the kernel I
would say concretely typing things is considered a virtue.

Having gone through alot of VFIO now, as it relates to the
vfio_device, I think the type erasure directly hurts readability and
maintainability. It just takes too much information away from the
reader and the compiler. The core code is managable, but when you
chuck mdev into it that follows the same, it really starts hurting.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-25  0:22       ` Jason Gunthorpe
@ 2021-02-25 17:54         ` Peter Xu
  2021-02-25 18:19           ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Xu @ 2021-02-25 17:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, cohuck, kvm, linux-kernel

On Wed, Feb 24, 2021 at 08:22:16PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 24, 2021 at 02:55:08PM -0700, Alex Williamson wrote:
> 
> > > > +static bool strict_mmio_maps = true;
> > > > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > > > +MODULE_PARM_DESC(strict_mmio_maps,
> > > > +		 "Restrict to safe DMA mappings of device memory (true).");  
> > > 
> > > I think this should be a kconfig, historically we've required kconfig
> > > to opt-in to unsafe things that could violate kernel security. Someone
> > > building a secure boot trusted kernel system should not have an
> > > options for userspace to just turn off protections.
> > 
> > It could certainly be further protected that this option might not
> > exist based on a Kconfig, but I think we're already risking breaking
> > some existing users and I'd rather allow it with an opt-in (like we
> > already do for lack of interrupt isolation), possibly even with a
> > kernel taint if used, if necessary.
> 
> Makes me nervous, security should not be optional.
> 
> > > I'd prefer this was written a bit differently, I would like it very
> > > much if this doesn't mis-use follow_pte() by returning pfn outside
> > > the lock.
> > > 
> > > vaddr_get_bar_pfn(..)
> > > {
> > >         vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> > > 	if (!vma)
> > >            return -ENOENT;
> > >         if ((vma->vm_flags & VM_DENYWRITE) && (prot & PROT_WRITE)) // Check me
> > >            return -EFAULT;
> > >         device = vfio_device_get_from_vma(vma);
> > > 	if (!device)
> > >            return -ENOENT;
> > > 
> > > 	/*
> > >          * Now do the same as vfio_pci_mmap_fault() - the vm_pgoff must
> > > 	 * be the physical pfn when using this mechanism. Delete follow_pte entirely()
> > >          */
> > >         pfn = (vaddr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff
> > > 	
> > >         /* de-dup device and record that we are using device's pages in the
> > > 	   pfnmap */
> > >         ...
> > > }
> > 
> > 
> > This seems to undo both:
> > 
> > 5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()")
> 
> No, the bug this commit described is fixed by calling
> vfio_device_get_from_vma() which excludes all non-VFIO VMAs already.
> 
> We can assert that the vm_pgoff is in a specific format because it is
> a VFIO owned VMA and must follow the rules to be part of the address
> space. See my last email
> 
> Here I was suggesting to use the vm_pgoff == PFN rule, but since
> you've clarified that doesn't work we'd have to determine the PFN from
> the region number through the vfio_device instead.
> 
> > (which also suggests we are going to break users without the module
> > option opt-in above)
> 
> Not necessarily, this is complaining vfio crashes, it doesn't say they
> actually needed the IOMMU to work on those VMAs because they are doing
> P2P DMA.
> 
> I think, if this does break someone, they are on a real fringe and
> must have already modified their kernel, so a kconfig is the right
> approach. It is pretty hard to get non-GUP'able DMA'able memory into a
> process with the stock kernel.
> 
> Generally speaking, I think Linus has felt security bug fixes like
> this are more on the OK side of things to break fringe users.
> 
> > And:
> > 
> > 41311242221e ("vfio/type1: Support faulting PFNMAP vmas")
> > 
> > So we'd have an alternate path in the un-safe mode and we'd lose the
> > ability to fault in mappings.
> 
> As above we already exclude VMAs that are not from VFIO, and VFIO
> sourced VMA's do not meaningfully implement fault for this use
> case. So calling fixup_user_fault() is pointless.
> 
> Peter just did this so we could ask him what it was for..
> 
> I feel pretty strongly that removing the call to follow_pte is
> important here. Even if we do cover all the issues with mis-using the
> API it just makes a maintenance problem to leave it in.

I can't say I fully understand the whole rational behind 5cbf3264bc71, but that
commit still sounds reasonable to me, since I don't see why VFIO cannot do
VFIO_IOMMU_MAP_DMA upon another memory range that's neither anonymous memory
nor vfio mapped MMIO range.  In those cases, vm_pgoff namespace defined by vfio
may not be true anymore, iiuc.

Then if with that follow_pfn() for non-vfio mappings, it seems also very
reasonable to have 41311242221e or similar as proposed by Alex to make sure pte
installed before calling that, for either vfio or other vma providers.

Or does it mean that we don't want to allow VFIO dma to those unknown memory
backends, for some reason?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-25 17:54         ` Peter Xu
@ 2021-02-25 18:19           ` Jason Gunthorpe
  2021-02-25 19:06             ` Peter Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25 18:19 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, cohuck, kvm, linux-kernel

On Thu, Feb 25, 2021 at 12:54:57PM -0500, Peter Xu wrote:
 
> I can't say I fully understand the whole rational behind 5cbf3264bc71, but that
> commit still sounds reasonable to me, since I don't see why VFIO cannot do
> VFIO_IOMMU_MAP_DMA upon another memory range that's neither anonymous memory
> nor vfio mapped MMIO range.

It is not so much it can't, more that it doesn't and doesn't need to.

> In those cases, vm_pgoff namespace defined by vfio may not be true
> anymore, iiuc.

Since this series is proposing linking the VMA to an address_space all
the vm_pgoffs must be in the same namespace

> Or does it mean that we don't want to allow VFIO dma to those unknown memory
> backends, for some reason?

Correct. VFIO can map into the IOMMU PFNs it can get a reference
to. pin_user_pages() works for the majority, special VFIO VMAs cover
the rest, and everthing else must be blocked for security.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-25 18:19           ` Jason Gunthorpe
@ 2021-02-25 19:06             ` Peter Xu
  2021-02-25 19:17               ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Xu @ 2021-02-25 19:06 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, cohuck, kvm, linux-kernel

On Thu, Feb 25, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 25, 2021 at 12:54:57PM -0500, Peter Xu wrote:
>  
> > I can't say I fully understand the whole rational behind 5cbf3264bc71, but that
> > commit still sounds reasonable to me, since I don't see why VFIO cannot do
> > VFIO_IOMMU_MAP_DMA upon another memory range that's neither anonymous memory
> > nor vfio mapped MMIO range.
> 
> It is not so much it can't, more that it doesn't and doesn't need to.

OK.

> 
> > In those cases, vm_pgoff namespace defined by vfio may not be true
> > anymore, iiuc.
> 
> Since this series is proposing linking the VMA to an address_space all
> the vm_pgoffs must be in the same namespace

Agreed.  I saw discussions around on redefining the vm_pgoff namespace, I can't
say I followed that closely either, but yes it definitely makes sense to always
use an unified namespace.  Maybe we should even comment it somewhere on how
vm_pgoff is encoded?

> 
> > Or does it mean that we don't want to allow VFIO dma to those unknown memory
> > backends, for some reason?
> 
> Correct. VFIO can map into the IOMMU PFNs it can get a reference
> to. pin_user_pages() works for the majority, special VFIO VMAs cover
> the rest, and everthing else must be blocked for security.

If we all agree that the current follow_pfn() should only apply to vfio
internal vmas, then it seems we can drop it indeed, as long as the crash
reported in 5cbf3264b would fail gracefully at e.g. VFIO_IOMMU_MAP_DMA rather
than triggering a kernel warning somehow.

However I'm still confused on why it's more secure - the current process to do
VFIO_IOMMU_MAP_DMA should at least has proper permission for everything to be
setup, including the special vma, right?  Say, if the process can write to
those memories, then shouldn't we also allow it to grant this write permission
to other devices too?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-25 19:06             ` Peter Xu
@ 2021-02-25 19:17               ` Jason Gunthorpe
  2021-02-25 19:54                 ` Peter Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25 19:17 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, cohuck, kvm, linux-kernel

On Thu, Feb 25, 2021 at 02:06:46PM -0500, Peter Xu wrote:

> Agreed.  I saw discussions around on redefining the vm_pgoff namespace, I can't
> say I followed that closely either, but yes it definitely makes sense to always
> use an unified namespace.  Maybe we should even comment it somewhere on how
> vm_pgoff is encoded?

Yes, it should be described, it is subtle
 
> > Correct. VFIO can map into the IOMMU PFNs it can get a reference
> > to. pin_user_pages() works for the majority, special VFIO VMAs cover
> > the rest, and everthing else must be blocked for security.
> 
> If we all agree that the current follow_pfn() should only apply to vfio
> internal vmas, 

I want to remvoe follow_pfn(). Internal VMAs can deduce the PFN from
the vm_pgoff, they don't need to do follow.

> then it seems we can drop it indeed, as long as the crash reported
> in 5cbf3264b would fail gracefully at e.g. VFIO_IOMMU_MAP_DMA rather
> than triggering a kernel warning somehow.

Yes, this will just fail the ioctl because pin_user_pages() failed and
the VMA was not VFIO.

> However I'm still confused on why it's more secure - the current process to do
> VFIO_IOMMU_MAP_DMA should at least has proper permission for everything to be
> setup, including the special vma, right?  Say, if the process can write to
> those memories, then shouldn't we also allow it to grant this write permission
> to other devices too?

It is a use-after-free. Once the PFN is programmed into the IOMMU it
becomes completely divorced from the VMA. Remember there is no
pin_user_page here, so the PFN has no reference count.

If the owner of the VMA decided to zap it or otherwise then the IOMMU
access keeps going - but now the owner thinks the PFN is free'd and
nobody is referencing it. Goes bad.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-25 19:17               ` Jason Gunthorpe
@ 2021-02-25 19:54                 ` Peter Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Xu @ 2021-02-25 19:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, cohuck, kvm, linux-kernel

On Thu, Feb 25, 2021 at 03:17:14PM -0400, Jason Gunthorpe wrote:
> It is a use-after-free. Once the PFN is programmed into the IOMMU it
> becomes completely divorced from the VMA. Remember there is no
> pin_user_page here, so the PFN has no reference count.
> 
> If the owner of the VMA decided to zap it or otherwise then the IOMMU
> access keeps going - but now the owner thinks the PFN is free'd and
> nobody is referencing it. Goes bad.

Sounds reasonable.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-25  0:06       ` Jason Gunthorpe
@ 2021-02-25 22:21         ` Alex Williamson
  2021-02-25 23:49           ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-02-25 22:21 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Wed, 24 Feb 2021 20:06:10 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:
> 
> > > The only use of the special ops would be if there are multiple types
> > > of mmap's going on, but for this narrow use case those would be safely
> > > distinguished by the vm_pgoff instead  
> > 
> > We potentially do have device specific regions which can support mmap,
> > for example the migration region.  We'll need to think about how we
> > could even know if portions of those regions map to a device.  We could
> > use the notifier to announce it and require the code supporting those
> > device specific regions manage it.  
> 
> So, the above basically says any VFIO VMA is allowed for VFIO to map
> to the IOMMU.
> 
> If there are places creating mmaps for VFIO that should not go to the
> IOMMU then they need to return NULL from this function.
> 
> > I'm not really clear what you're getting at with vm_pgoff though, could
> > you explain further?  
> 
> Ah, so I have to take a side discussion to explain what I ment.
> 
> The vm_pgoff is a bit confused because we change it here in vfio_pci:
> 
>     vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
> But the address_space invalidation assumes it still has the region
> based encoding:
> 
> +	vfio_device_unmap_mapping_range(vdev->device,
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));
> 
> Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
> must always be set to the same thing - either the
> VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Aha, I hadn't made that connection.

> Since you say we need a limited invalidation this looks like a bug to
> me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

Yes, this must have only worked in testing because I mmap'd BAR0 which
is at index/offset zero, so the pfn range overlapped the user offset.
I'm glad you caught that...

> So, the PCI vma needs to get switched to use the
> VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
> region number from the vm_pgoff and thus access any additional data,
> such as the base pfn or a flag saying this cannot be mapped to the
> IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
> information attached to that region ID.
> 
> All places creating vfio mmaps have to set the vm_pgoff to
> VFIO_PCI_INDEX_TO_OFFSET().

This is where it gets tricky.  The vm_pgoff we get from
file_operations.mmap is already essentially describing an offset from
the base of a specific resource.  We could convert that from an absolute
offset to a pfn offset, but it's only the bus driver code (ex.
vfio-pci) that knows how to get the base, assuming there is a single
base per region (we can't assume enough bits per region to store
absolute pfn).  Also note that you're suggesting that all vfio mmaps
would need to standardize on the vfio-pci implementation of region
layouts.  Not that most drivers haven't copied vfio-pci, but we've
specifically avoided exposing it as a fixed uAPI such that we could have
the flexibility for a bus driver to implement regions offsets however
they need.

So I'm not really sure what this looks like.  Within vfio-pci we could
keep the index bits in place to allow unmmap_mapping_range() to
selectively zap matching vm_pgoffs but expanding that to a vfio
standard such that the IOMMU backend can also extract a pfn looks very
limiting, or ugly.  Thanks,

Alex

> But we have these violations that need fixing:
> 
> drivers/vfio/fsl-mc/vfio_fsl_mc.c:      vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> 
> Couldn't see any purpose to this code, cargo cult copy? Just delete
> it.
> 
> drivers/vfio/pci/vfio_pci.c:    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
> Used to implement fault() but we could get the region number and
> extract the pfn from the vfio_pci_device's data easy enough.
> 
> I manually checked that other parts of VFIO not under drivers/vfio are
> doing it OK, looks fine.
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-25 22:21         ` Alex Williamson
@ 2021-02-25 23:49           ` Jason Gunthorpe
  2021-03-04 21:37             ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-25 23:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, Feb 25, 2021 at 03:21:13PM -0700, Alex Williamson wrote:

> This is where it gets tricky.  The vm_pgoff we get from
> file_operations.mmap is already essentially describing an offset from
> the base of a specific resource.  We could convert that from an absolute
> offset to a pfn offset, but it's only the bus driver code (ex.
> vfio-pci) that knows how to get the base, assuming there is a single
> base per region (we can't assume enough bits per region to store
> absolute pfn).  Also note that you're suggesting that all vfio mmaps
> would need to standardize on the vfio-pci implementation of region
> layouts.  Not that most drivers haven't copied vfio-pci, but we've
> specifically avoided exposing it as a fixed uAPI such that we could have
> the flexibility for a bus driver to implement regions offsets however
> they need.

Okay, well the bus driver owns the address space and the bus driver is
in control of the vm_pgoff. If it doesn't want to zap then it doesn't
need to do anything

vfio-pci can consistently use the index encoding and be fine
 
> So I'm not really sure what this looks like.  Within vfio-pci we could
> keep the index bits in place to allow unmmap_mapping_range() to
> selectively zap matching vm_pgoffs but expanding that to a vfio
> standard such that the IOMMU backend can also extract a pfn looks very
> limiting, or ugly.  Thanks,

Lets add a op to convert a vma into a PFN range. The map code will
pass the vma to the op and get back a pfn (or failure).

pci will switch the vm_pgoff to an index, find the bar base and
compute the pfn.

It is starting to look more and more like dma buf though

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device
  2021-02-22 16:50 ` [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device Alex Williamson
@ 2021-02-26  5:38   ` Christoph Hellwig
  2021-02-26 13:15     ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2021-02-26  5:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx, viro

On Mon, Feb 22, 2021 at 09:50:35AM -0700, Alex Williamson wrote:
> By linking all the device fds we provide to userspace to an
> address space through a new pseudo fs, we can use tools like
> unmap_mapping_range() to zap all vmas associated with a device.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Adding Al:

I hate how we're are growing these tiny file systems just to allocate an
anonymous inode all over.  Shouldn't we allow to enhance fs/anon_inodes.c
to add a new API to allocate a new specific inode from anon_inodefs
instead?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/10] vfio/type1: Register device notifier
  2021-02-22 17:55   ` Jason Gunthorpe
  2021-02-24 21:55     ` Alex Williamson
@ 2021-02-26  5:47     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2021-02-26  5:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, cohuck, kvm, linux-kernel, peterx

On Mon, Feb 22, 2021 at 01:55:23PM -0400, Jason Gunthorpe wrote:
> > +static bool strict_mmio_maps = true;
> > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > +MODULE_PARM_DESC(strict_mmio_maps,
> > +		 "Restrict to safe DMA mappings of device memory (true).");
> 
> I think this should be a kconfig, historically we've required kconfig
> to opt-in to unsafe things that could violate kernel security. Someone
> building a secure boot trusted kernel system should not have an
> options for userspace to just turn off protections.

Agreed, but I'd go one step further:  Why should we allow the unsafe
mode at all?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device
  2021-02-26  5:38   ` Christoph Hellwig
@ 2021-02-26 13:15     ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-02-26 13:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, cohuck, kvm, linux-kernel, peterx, viro

On Fri, Feb 26, 2021 at 05:38:04AM +0000, Christoph Hellwig wrote:
> On Mon, Feb 22, 2021 at 09:50:35AM -0700, Alex Williamson wrote:
> > By linking all the device fds we provide to userspace to an
> > address space through a new pseudo fs, we can use tools like
> > unmap_mapping_range() to zap all vmas associated with a device.
> > 
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Adding Al:
> 
> I hate how we're are growing these tiny file systems just to allocate an
> anonymous inode all over.  Shouldn't we allow to enhance fs/anon_inodes.c
> to add a new API to allocate a new specific inode from anon_inodefs
> instead?

+1 when I was researching this I also felt this was alot of
boilerplate to just get an inode. With vfio and rdma getting this it
is at least 5 places now.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-02-25 23:49           ` Jason Gunthorpe
@ 2021-03-04 21:37             ` Alex Williamson
  2021-03-04 23:16               ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-03-04 21:37 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, 25 Feb 2021 19:49:49 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Feb 25, 2021 at 03:21:13PM -0700, Alex Williamson wrote:
> 
> > This is where it gets tricky.  The vm_pgoff we get from
> > file_operations.mmap is already essentially describing an offset from
> > the base of a specific resource.  We could convert that from an absolute
> > offset to a pfn offset, but it's only the bus driver code (ex.
> > vfio-pci) that knows how to get the base, assuming there is a single
> > base per region (we can't assume enough bits per region to store
> > absolute pfn).  Also note that you're suggesting that all vfio mmaps
> > would need to standardize on the vfio-pci implementation of region
> > layouts.  Not that most drivers haven't copied vfio-pci, but we've
> > specifically avoided exposing it as a fixed uAPI such that we could have
> > the flexibility for a bus driver to implement regions offsets however
> > they need.  
> 
> Okay, well the bus driver owns the address space and the bus driver is
> in control of the vm_pgoff. If it doesn't want to zap then it doesn't
> need to do anything
> 
> vfio-pci can consistently use the index encoding and be fine
>  
> > So I'm not really sure what this looks like.  Within vfio-pci we could
> > keep the index bits in place to allow unmmap_mapping_range() to
> > selectively zap matching vm_pgoffs but expanding that to a vfio
> > standard such that the IOMMU backend can also extract a pfn looks very
> > limiting, or ugly.  Thanks,  
> 
> Lets add a op to convert a vma into a PFN range. The map code will
> pass the vma to the op and get back a pfn (or failure).
> 
> pci will switch the vm_pgoff to an index, find the bar base and
> compute the pfn.
> 
> It is starting to look more and more like dma buf though

How terrible would it be if vfio-core used a shared vm_private_data
structure and inserted itself into the vm_ops call chain to reference
count that structure?  We already wrap the device file descriptor via
vfio_device_fops.mmap, so we could:

	struct vfio_vma_private_data *priv;

	priv = kzalloc(...
	
	priv->device = device;
	vma->vm_private_data = priv;

	device->ops->mmap(device->device_data, vma);

	if (vma->vm_private_data == priv) {
		priv->vm_ops = vma->vm_ops;
		vma->vm_ops = &vfio_vm_ops;
	} else
		kfree(priv);

Where:

struct vfio_vma_private_data {
	struct vfio_device *device;
	unsigned long pfn_base;
	void *private_data; // maybe not needed
	const struct vm_operations_struct *vm_ops;
	struct kref kref;
	unsigned int release_notification:1;
};

Therefore unless a bus driver opts-out by replacing vm_private_data, we
can identify participating vmas by the vm_ops and have flags indicating
if the vma maps device memory such that vfio_get_device_from_vma()
should produce a device reference.  The vfio IOMMU backends would also
consume this, ie. if they get a valid vfio_device from the vma, use the
pfn_base field directly.  vfio_vm_ops would wrap the bus driver
callbacks and provide reference counting on open/close to release this
object.

I'm not thrilled with a vfio_device_ops callback plumbed through
vfio-core to do vma-to-pfn translation, so I thought this might be a
better alternative.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-03-04 21:37             ` Alex Williamson
@ 2021-03-04 23:16               ` Jason Gunthorpe
  2021-03-05  0:07                 ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-03-04 23:16 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:

> Therefore unless a bus driver opts-out by replacing vm_private_data, we
> can identify participating vmas by the vm_ops and have flags indicating
> if the vma maps device memory such that vfio_get_device_from_vma()
> should produce a device reference.  The vfio IOMMU backends would also
> consume this, ie. if they get a valid vfio_device from the vma, use the
> pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> callbacks and provide reference counting on open/close to release this
> object.

> I'm not thrilled with a vfio_device_ops callback plumbed through
> vfio-core to do vma-to-pfn translation, so I thought this might be a
> better alternative.  Thanks,

Maybe you could explain why, because I'm looking at this idea and
thinking it looks very complicated compared to a simple driver op
callback?

The implementation of such an op for vfio_pci is one line trivial, why
do we need allocated memory and a entire shim layer instead? 

Shim layers are bad.

We still need a driver op of some kind because only the driver can
convert a pg_off into a PFN. Remember the big point here is to remove
the sketchy follow_pte()...

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-03-04 23:16               ` Jason Gunthorpe
@ 2021-03-05  0:07                 ` Alex Williamson
  2021-03-05  0:36                   ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2021-03-05  0:07 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, 4 Mar 2021 19:16:33 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> 
> > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > can identify participating vmas by the vm_ops and have flags indicating
> > if the vma maps device memory such that vfio_get_device_from_vma()
> > should produce a device reference.  The vfio IOMMU backends would also
> > consume this, ie. if they get a valid vfio_device from the vma, use the
> > pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> > callbacks and provide reference counting on open/close to release this
> > object.  
> 
> > I'm not thrilled with a vfio_device_ops callback plumbed through
> > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > better alternative.  Thanks,  
> 
> Maybe you could explain why, because I'm looking at this idea and
> thinking it looks very complicated compared to a simple driver op
> callback?

vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
caller has already used vfio_device_get_from_vma(), but should still
validate the vma is one from a vfio device before calling this new
vfio_device_ops callback.  vfio-pci needs to validate the vm_pgoff
value falls within a BAR region, mask off the index and get the
pci_resource_start() for the BAR index.

Then we need a solution for how vfio_device_get_from_vma() determines
whether to grant a device reference for a given vma, where that vma may
map something other than device memory.  Are you imagining that we hand
out device references independently and vfio_vma_to_pfn() would return
an errno for vm_pgoff values that don't map device memory and the IOMMU
driver would release the reference?

It all seems rather ad-hoc.
 
> The implementation of such an op for vfio_pci is one line trivial, why
> do we need allocated memory and a entire shim layer instead? 
> 
> Shim layers are bad.

The only thing here I'd consider a shim layer is overriding vm_ops,
which just seemed like a cleaner and simpler solution than exporting
open/close functions and validating the bus driver installs them, and
the error path should they not.

> We still need a driver op of some kind because only the driver can
> convert a pg_off into a PFN. Remember the big point here is to remove
> the sketchy follow_pte()...

The bus driver simply writes the base_pfn value in the example
structure I outlined in its .mmap callback.  I'm just looking for an
alternative place to store our former vm_pgoff in a way that doesn't
prevent using unmmap_mapping_range().  The IOMMU backend, once it has a
vfio_device via vfio_device_get_from_vma() can know the format of
vm_private_data, cast it as a vfio_vma_private_data and directly use
base_pfn, accomplishing the big point.  They're all operating in the
agreed upon vm_private_data format.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup
  2021-03-05  0:07                 ` Alex Williamson
@ 2021-03-05  0:36                   ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-03-05  0:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, Mar 04, 2021 at 05:07:31PM -0700, Alex Williamson wrote:
> On Thu, 4 Mar 2021 19:16:33 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> > 
> > > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > > can identify participating vmas by the vm_ops and have flags indicating
> > > if the vma maps device memory such that vfio_get_device_from_vma()
> > > should produce a device reference.  The vfio IOMMU backends would also
> > > consume this, ie. if they get a valid vfio_device from the vma, use the
> > > pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> > > callbacks and provide reference counting on open/close to release this
> > > object.  
> > 
> > > I'm not thrilled with a vfio_device_ops callback plumbed through
> > > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > > better alternative.  Thanks,  
> > 
> > Maybe you could explain why, because I'm looking at this idea and
> > thinking it looks very complicated compared to a simple driver op
> > callback?
> 
> vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
> caller has already used vfio_device_get_from_vma(), but should still
> validate the vma is one from a vfio device before calling this new
> vfio_device_ops callback.

Huh? Validate? Why?

Something like this in the IOMMU stuff:

   struct vfio_device *vfio = vfio_device_get_from_vma(vma)

   if (!vfio->vma_to_pfn)
        return -EINVAL;
   vfio->ops->vma_to_pfn(vfio, vma, offset_from_vma_start)

Is fine, why do we need to over complicate?

I don't need to check that the vma belongs to the vfio because it is
the API contract that the caller will guarantee that.

This is the kernel, I can (and do!) check these sorts of things by
code inspection when working on stuff - I can look at every
implementation and every call site to prove these things.

IMHO doing an expensive check like that is a style of defensive
programming the kernel community frowns upon.

> vfio-pci needs to validate the vm_pgoff value falls within a BAR
> region, mask off the index and get the pci_resource_start() for the
> BAR index.

It needs to do the same thing fault() already does, which is currently
one line..

> Then we need a solution for how vfio_device_get_from_vma() determines
> whether to grant a device reference for a given vma, where that vma may
> map something other than device memory. Are you imagining that we hand
> out device references independently and vfio_vma_to_pfn() would return
> an errno for vm_pgoff values that don't map device memory and the IOMMU
> driver would release the reference?

That seems a reasonable place to start

> prevent using unmmap_mapping_range().  The IOMMU backend, once it has a
> vfio_device via vfio_device_get_from_vma() can know the format of
> vm_private_data, cast it as a vfio_vma_private_data and directly use
> base_pfn, accomplishing the big point.  They're all operating in the
> agreed upon vm_private_data format.  Thanks,

If we force all drivers into a mandatory (!) vm_private_data format
then every driver has to be audited and updated before the new pfn
code can be done. If any driver in the future makes a mistake here
(and omitting the unique vm_private_data magics is a very easy mistake
to make) then it will cause a kernel crash in an obscure scenario.

It is the "design the API to be hard to use wrong" philosophy.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-03-05  0:36 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-22 16:50 [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Alex Williamson
2021-02-22 16:50 ` [RFC PATCH 01/10] vfio: Create vfio_fs_type with inode per device Alex Williamson
2021-02-26  5:38   ` Christoph Hellwig
2021-02-26 13:15     ` Jason Gunthorpe
2021-02-22 16:50 ` [RFC PATCH 02/10] vfio: Update vfio_add_group_dev() API Alex Williamson
2021-02-22 17:01   ` Jason Gunthorpe
2021-02-22 16:50 ` [RFC PATCH 03/10] vfio: Export unmap_mapping_range() wrapper Alex Williamson
2021-02-22 16:51 ` [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range() Alex Williamson
2021-02-22 17:22   ` Jason Gunthorpe
2021-02-24 21:55     ` Alex Williamson
2021-02-25  0:57       ` Jason Gunthorpe
2021-02-22 16:51 ` [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup Alex Williamson
2021-02-22 17:29   ` Jason Gunthorpe
2021-02-24 21:55     ` Alex Williamson
2021-02-25  0:06       ` Jason Gunthorpe
2021-02-25 22:21         ` Alex Williamson
2021-02-25 23:49           ` Jason Gunthorpe
2021-03-04 21:37             ` Alex Williamson
2021-03-04 23:16               ` Jason Gunthorpe
2021-03-05  0:07                 ` Alex Williamson
2021-03-05  0:36                   ` Jason Gunthorpe
2021-02-22 16:51 ` [RFC PATCH 06/10] vfio: Add a device notifier interface Alex Williamson
2021-02-22 16:51 ` [RFC PATCH 07/10] vfio/pci: Notify on device release Alex Williamson
2021-02-22 16:52 ` [RFC PATCH 08/10] vfio/type1: Refactor pfn_list clearing Alex Williamson
2021-02-22 16:52 ` [RFC PATCH 09/10] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn Alex Williamson
2021-02-22 16:52 ` [RFC PATCH 10/10] vfio/type1: Register device notifier Alex Williamson
2021-02-22 17:55   ` Jason Gunthorpe
2021-02-24 21:55     ` Alex Williamson
2021-02-25  0:22       ` Jason Gunthorpe
2021-02-25 17:54         ` Peter Xu
2021-02-25 18:19           ` Jason Gunthorpe
2021-02-25 19:06             ` Peter Xu
2021-02-25 19:17               ` Jason Gunthorpe
2021-02-25 19:54                 ` Peter Xu
2021-02-26  5:47     ` Christoph Hellwig
2021-02-22 18:00 ` [RFC PATCH 00/10] vfio: Device memory DMA mapping improvements Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).