[PATCH 0/3] vfio-pci: Block user access to disabled device MMIO

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] vfio-pci: Block user access to disabled device MMIO
@ 2020-05-01 21:38 Alex Williamson
  2020-05-01 21:39 ` [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas Alex Williamson
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Alex Williamson @ 2020-05-01 21:38 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, cohuck, jgg, peterx

Add tracking of the device memory enable bit and block/fault accesses
to device MMIO space while disabled.  This provides synchronous fault
handling for CPU accesses to the device and prevents the user from
triggering platform level error handling present on some systems.
Device reset and MSI-X vector table accesses are also included such
that access is blocked across reset and vector table accesses do not
depend on the user configuration of the device.

This is based on the vfio for-linus branch currently in next, making
use of follow_pfn() in vaddr_get_pfn() and therefore requiring patch
1/ to force the user fault in the case that a PFNMAP vma might be
DMA mapped before user access.  Further PFNMAP iommu invalidation
tracking is not yet included here.

As noted in the comments, I'm copying quite a bit of the logic from
rdma code for performing the zap_vma_ptes() calls and I'm also
attempting to resolve lock ordering issues in the fault handler to
lockdep's satisfaction.  I appreciate extra eyes on these sections in
particular.

I expect this to be functionally equivalent for any well behaved
userspace driver, but obviously there is a potential for the user to
get -EIO or SIGBUS on device access.  The device is provided to the
user enabled and device resets will restore the command register, so
by my evaluation a user would need to explicitly disable the memory
enable bit to trigger these faults.  We could potentially remap vmas
to a zero page rather than SIGBUS if we experience regressions, but
without known code requiring that, SIGBUS seems the appropriate
response to this condition.  Thanks,

Alex

---

Alex Williamson (3):
      vfio/type1: Support faulting PFNMAP vmas
      vfio-pci: Fault mmaps to enable vma tracking
      vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

 drivers/vfio/pci/vfio_pci.c         |  268 +++++++++++++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_config.c  |   31 ++++
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 ++
 drivers/vfio/pci/vfio_pci_private.h |   11 +
 drivers/vfio/pci/vfio_pci_rdwr.c    |   12 ++
 drivers/vfio/vfio_iommu_type1.c     |   36 ++++-
 6 files changed, 356 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas
  2020-05-01 21:38 [PATCH 0/3] vfio-pci: Block user access to disabled device MMIO Alex Williamson
@ 2020-05-01 21:39 ` Alex Williamson
  2020-05-01 23:50   ` Jason Gunthorpe
  2020-05-01 21:39 ` [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking Alex Williamson
  2020-05-01 21:39 ` [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory Alex Williamson
  2 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-01 21:39 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, cohuck, jgg, peterx

With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma.  Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |   36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc1d64765ce7..4a4cb7cd86b2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
+static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
+			    unsigned long vaddr, unsigned long *pfn,
+			    bool write_fault)
+{
+	int ret;
+
+	ret = follow_pfn(vma, vaddr, pfn);
+	if (ret) {
+		bool unlocked = false;
+
+		ret = fixup_user_fault(NULL, mm, vaddr,
+				       FAULT_FLAG_REMOTE |
+				       (write_fault ?  FAULT_FLAG_WRITE : 0),
+				       &unlocked);
+		if (unlocked)
+			return -EAGAIN;
+
+		if (ret)
+			return ret;
+
+		ret = follow_pfn(vma, vaddr, pfn);
+	}
+
+	return ret;
+}
+
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 			 int prot, unsigned long *pfn)
 {
@@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 
 	vaddr = untagged_addr(vaddr);
 
+retry:
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		if (!follow_pfn(vma, vaddr, pfn) &&
-		    is_invalid_reserved_pfn(*pfn))
-			ret = 0;
+		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+		if (ret == -EAGAIN)
+			goto retry;
+
+		if (!ret && !is_invalid_reserved_pfn(*pfn))
+			ret = -EFAULT;
 	}
 done:
 	up_read(&mm->mmap_sem);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking
  2020-05-01 21:38 [PATCH 0/3] vfio-pci: Block user access to disabled device MMIO Alex Williamson
  2020-05-01 21:39 ` [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas Alex Williamson
@ 2020-05-01 21:39 ` Alex Williamson
  2020-05-01 23:25   ` Jason Gunthorpe
  2020-05-01 21:39 ` [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory Alex Williamson
  2 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-01 21:39 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, cohuck, jgg, peterx

Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |    7 +++
 2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..da2fef666d9c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+			    struct vm_area_struct *vma)
+{
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mmap_vma = kzalloc(sizeof(*mmap_vma), GFP_KERNEL);
+	if (!mmap_vma)
+		return -ENOMEM;
+
+	mmap_vma->vma = vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_add(&mmap_vma->vma_next, &vdev->vma_list);
+	mutex_unlock(&vdev->vma_lock);
+
+	return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+		if (mmap_vma->vma == vma) {
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+			break;
+		}
+	}
+	mutex_unlock(&vdev->vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+
+	if (vfio_pci_add_vma(vdev, vma))
+		return VM_FAULT_OOM;
+
+	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+	.open = vfio_pci_mmap_open,
+	.close = vfio_pci_mmap_close,
+	.fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
-	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
+	/*
+	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
+	 * change vm_flags within the fault handler.  Set them now.
+	 */
+	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_ops = &vfio_pci_mmap_ops;
+
+	return 0;
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	mutex_init(&vdev->vma_lock);
+	INIT_LIST_HEAD(&vdev->vma_list);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
 	int			users;
 };
 
+struct vfio_pci_mmap_vma {
+	struct vm_area_struct	*vma;
+	struct list_head	vma_next;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
 	struct notifier_block	nb;
+	struct mutex		vma_lock;
+	struct list_head	vma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-01 21:38 [PATCH 0/3] vfio-pci: Block user access to disabled device MMIO Alex Williamson
  2020-05-01 21:39 ` [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas Alex Williamson
  2020-05-01 21:39 ` [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking Alex Williamson
@ 2020-05-01 21:39 ` Alex Williamson
  2020-05-01 23:48   ` Jason Gunthorpe
  2 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-01 21:39 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, cohuck, jgg, peterx

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express.  The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host.  Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled.  We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel.  Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device.  Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access.  In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register.  Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS.  This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c         |  200 ++++++++++++++++++++++++++++++++---
 drivers/vfio/pci/vfio_pci_config.c  |   31 +++++
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 +++
 drivers/vfio/pci/vfio_pci_private.h |    4 +
 drivers/vfio/pci/vfio_pci_rdwr.c    |   12 ++
 5 files changed, 246 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index da2fef666d9c..ce2bb3e62b18 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -26,6 +26,7 @@
 #include <linux/vfio.h>
 #include <linux/vgaarb.h>
 #include <linux/nospec.h>
+#include <linux/sched/mm.h>
 
 #include "vfio_pci_private.h"
 
@@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static int vfio_pci_lock_mem(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
+struct vfio_devices {
+	struct vfio_device **devices;
+	int cur_index;
+	int max_index;
+};
+
 static long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
@@ -984,8 +992,17 @@ static long vfio_pci_ioctl(void *device_data,
 		return ret;
 
 	} else if (cmd == VFIO_DEVICE_RESET) {
-		return vdev->reset_works ?
-			pci_try_reset_function(vdev->pdev) : -EINVAL;
+		int ret;
+
+		if (!vdev->reset_works)
+			return -EINVAL;
+
+		down_write(&vdev->memory_lock);
+		vfio_pci_zap_mmap_vmas(vdev);
+		ret = pci_try_reset_function(vdev->pdev);
+		up_write(&vdev->memory_lock);
+
+		return ret;
 
 	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
 		struct vfio_pci_hot_reset_info hdr;
@@ -1065,6 +1082,7 @@ static long vfio_pci_ioctl(void *device_data,
 		int32_t *group_fds;
 		struct vfio_pci_group_entry *groups;
 		struct vfio_pci_group_info info;
+		struct vfio_devices devs = { .cur_index = 0 };
 		bool slot = false;
 		int i, count = 0, ret = 0;
 
@@ -1153,11 +1171,39 @@ static long vfio_pci_ioctl(void *device_data,
 		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
 						    vfio_pci_validate_devs,
 						    &info, slot);
-		if (!ret)
-			/* User has access, do the reset */
-			ret = pci_reset_bus(vdev->pdev);
+		if (ret)
+			goto hot_reset_release;
+
+		devs.max_index = count;
+		devs.devices = kcalloc(count, sizeof(struct vfio_device *),
+				       GFP_KERNEL);
+		if (!devs.devices) {
+			ret = -ENOMEM;
+			goto hot_reset_release;
+		}
+
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_lock_mem,
+						    &devs, slot);
+		if (ret)
+			goto hot_reset_release;
+
+		/* User has access, do the reset */
+		ret = pci_reset_bus(vdev->pdev);
 
 hot_reset_release:
+		while (devs.cur_index) {
+			struct vfio_device *device;
+			struct vfio_pci_device *tmp;
+
+			device = devs.devices[--devs.cur_index];
+			tmp = vfio_device_data(device);
+
+			up_write(&tmp->memory_lock);
+			vfio_device_put(device);
+		}
+		kfree(devs.devices);
+
 		for (i--; i >= 0; i--)
 			vfio_group_put_external_user(groups[i].group);
 
@@ -1299,6 +1345,64 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+/* Zap and remove vma tracking */
+void vfio_pci_zap_mmap_vmas(struct vfio_pci_device *vdev)
+{
+	struct vfio_pci_mmap_vma *mmap_vma, *tmp;
+
+	/*
+	 * vma_lock is necessarily nested under the mmap_sem as the latter
+	 * is implicitly held for the vm_ops callbacks.  Therefore we need
+	 * to do a little dance to keep the locks in the same order here.
+	 * All vmas will typically use the same mm.  Trickery derived from
+	 * uverbs_user_mmap_disassociate()
+	 */
+	while (1) {
+		struct mm_struct *mm = NULL;
+
+		mutex_lock(&vdev->vma_lock);
+		while (!list_empty(&vdev->vma_list)) {
+			mmap_vma = list_first_entry(&vdev->vma_list,
+						    struct vfio_pci_mmap_vma,
+						    vma_next);
+			mm = mmap_vma->vma->vm_mm;
+			if (mmget_not_zero(mm))
+				break;
+
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+			mm = NULL;
+		}
+		mutex_unlock(&vdev->vma_lock);
+
+		if (!mm)
+			return;
+
+		down_read(&mm->mmap_sem);
+		if (!mmget_still_valid(mm))
+			goto skip_mm;
+
+		mutex_lock(&vdev->vma_lock);
+		list_for_each_entry_safe(mmap_vma, tmp,
+					 &vdev->vma_list, vma_next) {
+			struct vm_area_struct *vma = mmap_vma->vma;
+
+			if (vma->vm_mm != mm)
+				continue;
+
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+
+			zap_vma_ptes(vma, vma->vm_start,
+				     vma->vm_end - vma->vm_start);
+		}
+		mutex_unlock(&vdev->vma_lock);
+skip_mm:
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+}
+
 static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
 			    struct vm_area_struct *vma)
 {
@@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_device *vdev = vma->vm_private_data;
+	vm_fault_t ret = VM_FAULT_NOPAGE;
 
-	if (vfio_pci_add_vma(vdev, vma))
-		return VM_FAULT_OOM;
+	/*
+	 * Zap callers hold memory_lock and acquire mmap_sem, we hold
+	 * mmap_sem and need to acquire memory_lock to avoid races with
+	 * memory bit settings.  Release mmap_sem, wait, and retry, or fail.
+	 */
+	if (unlikely(!down_read_trylock(&vdev->memory_lock))) {
+		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+				return VM_FAULT_RETRY;
+
+			up_read(&vma->vm_mm->mmap_sem);
+
+			if (vmf->flags & FAULT_FLAG_KILLABLE) {
+				if (!down_read_killable(&vdev->memory_lock))
+					up_read(&vdev->memory_lock);
+			} else {
+				down_read(&vdev->memory_lock);
+				up_read(&vdev->memory_lock);
+			}
+			return VM_FAULT_RETRY;
+		}
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (!__vfio_pci_memory_enabled(vdev)) {
+		ret = VM_FAULT_SIGBUS;
+		goto up_out;
+	}
+
+	if (vfio_pci_add_vma(vdev, vma)) {
+		ret = VM_FAULT_OOM;
+		goto up_out;
+	}
 
 	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
 			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		return VM_FAULT_SIGBUS;
+		ret = VM_FAULT_SIGBUS;
 
-	return VM_FAULT_NOPAGE;
+up_out:
+	up_read(&vdev->memory_lock);
+	return ret;
 }
 
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
@@ -1680,6 +1818,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
+	init_rwsem(&vdev->memory_lock);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret)
@@ -1933,12 +2072,6 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
 }
 
-struct vfio_devices {
-	struct vfio_device **devices;
-	int cur_index;
-	int max_index;
-};
-
 static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
@@ -1969,6 +2102,43 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
+static int vfio_pci_lock_mem(struct pci_dev *pdev, void *data)
+{
+	struct vfio_devices *devs = data;
+	struct vfio_device *device;
+	struct vfio_pci_device *vdev;
+	int locked;
+
+	if (devs->cur_index == devs->max_index)
+		return -ENOSPC;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return -EINVAL;
+
+	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	vdev = vfio_device_data(device);
+
+	/*
+	 * Locking multiple devices is prone to deadlock, runaway and
+	 * unwind if we hit contention.
+	 */
+	locked = down_write_trylock(&vdev->memory_lock);
+	if (!locked) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	vfio_pci_zap_mmap_vmas(vdev);
+
+	devs->devices[devs->cur_index++] = device;
+	return 0;
+}
+
 /*
  * If a bus or slot reset is available for the provided device and:
  *  - All of the devices affected by that bus or slot reset are unused
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 90c0b80f8acf..87d0cc8c86ad 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -395,6 +395,14 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
 	*(__le32 *)(&p->write[off]) = cpu_to_le32(write);
 }
 
+/* Caller should hold memory_lock semaphore */
+bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
+{
+	u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
+
+	return cmd & PCI_COMMAND_MEMORY;
+}
+
 /*
  * Restore the *real* BARs after we detect a FLR or backdoor reset.
  * (backdoor = some device specific technique that we didn't catch)
@@ -560,6 +568,10 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
+		down_write(&vdev->memory_lock);
+		if (!new_mem)
+			vfio_pci_zap_mmap_vmas(vdev);
+
 		phys_io = !!(phys_cmd & PCI_COMMAND_IO);
 		virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
 		new_io = !!(new_cmd & PCI_COMMAND_IO);
@@ -579,8 +591,11 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
 	}
 
 	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
-	if (count < 0)
+	if (count < 0) {
+		if (offset == PCI_COMMAND)
+			up_write(&vdev->memory_lock);
 		return count;
+	}
 
 	/*
 	 * Save current memory/io enable bits in vconfig to allow for
@@ -591,6 +606,8 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
 
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
+
+		up_write(&vdev->memory_lock);
 	}
 
 	/* Emulate INTx disable */
@@ -828,8 +845,12 @@ static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos,
 						 pos - offset + PCI_EXP_DEVCAP,
 						 &cap);
 
-		if (!ret && (cap & PCI_EXP_DEVCAP_FLR))
+		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
+			down_write(&vdev->memory_lock);
+			vfio_pci_zap_mmap_vmas(vdev);
 			pci_try_reset_function(vdev->pdev);
+			up_write(&vdev->memory_lock);
+		}
 	}
 
 	/*
@@ -907,8 +928,12 @@ static int vfio_af_config_write(struct vfio_pci_device *vdev, int pos,
 						pos - offset + PCI_AF_CAP,
 						&cap);
 
-		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP))
+		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
+			down_write(&vdev->memory_lock);
+			vfio_pci_zap_mmap_vmas(vdev);
 			pci_try_reset_function(vdev->pdev);
+			up_write(&vdev->memory_lock);
+		}
 	}
 
 	return count;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 2056f3f85f59..54102a7eb9d3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -626,6 +626,8 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
 	int (*func)(struct vfio_pci_device *vdev, unsigned index,
 		    unsigned start, unsigned count, uint32_t flags,
 		    void *data) = NULL;
+	int ret;
+	u16 cmd;
 
 	switch (index) {
 	case VFIO_PCI_INTX_IRQ_INDEX:
@@ -673,5 +675,19 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
 	if (!func)
 		return -ENOTTY;
 
-	return func(vdev, index, start, count, flags, data);
+	if (index == VFIO_PCI_MSIX_IRQ_INDEX) {
+		down_write(&vdev->memory_lock);
+		pci_read_config_word(vdev->pdev, PCI_COMMAND, &cmd);
+		pci_write_config_word(vdev->pdev, PCI_COMMAND,
+				      cmd | PCI_COMMAND_MEMORY);
+	}
+
+	ret = func(vdev, index, start, count, flags, data);
+
+	if (index == VFIO_PCI_MSIX_IRQ_INDEX) {
+		pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
+		up_write(&vdev->memory_lock);
+	}
+
+	return ret;
 }
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9b25f9f6ce1d..9e10e6ba8682 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -139,6 +139,7 @@ struct vfio_pci_device {
 	struct notifier_block	nb;
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
+	struct rw_semaphore	memory_lock;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -181,6 +182,9 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
 				    pci_power_t state);
 
+extern bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev);
+extern void vfio_pci_zap_mmap_vmas(struct vfio_pci_device *vdev);
+
 #ifdef CONFIG_VFIO_PCI_IGD
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
 #else
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a87992892a9f..f58c45308682 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -162,6 +162,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	size_t x_start = 0, x_end = 0;
 	resource_size_t end;
 	void __iomem *io;
+	struct resource *res = &vdev->pdev->resource[bar];
 	ssize_t done;
 
 	if (pci_resource_start(pdev, bar))
@@ -200,8 +201,19 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
 
+	if (res->flags & IORESOURCE_MEM) {
+		down_read(&vdev->memory_lock);
+		if (!__vfio_pci_memory_enabled(vdev)) {
+			up_read(&vdev->memory_lock);
+			return -EIO;
+		}
+	}
+
 	done = do_io_rw(io, buf, pos, count, x_start, x_end, iswrite);
 
+	if (res->flags & IORESOURCE_MEM)
+		up_read(&vdev->memory_lock);
+
 	if (done >= 0)
 		*ppos += done;
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking
  2020-05-01 21:39 ` [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking Alex Williamson
@ 2020-05-01 23:25   ` Jason Gunthorpe
  2020-05-04 14:20     ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-01 23:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:
> Rather than calling remap_pfn_range() when a region is mmap'd, setup
> a vm_ops handler to support dynamic faulting of the range on access.
> This allows us to manage a list of vmas actively mapping the area that
> we can later use to invalidate those mappings.  The open callback
> invalidates the vma range so that all tracking is inserted in the
> fault handler and removed in the close handler.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |    7 +++
>  2 files changed, 81 insertions(+), 2 deletions(-)

> +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_device *vdev = vma->vm_private_data;
> +
> +	if (vfio_pci_add_vma(vdev, vma))
> +		return VM_FAULT_OOM;
> +
> +	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> +			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
> +		return VM_FAULT_SIGBUS;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> +	.open = vfio_pci_mmap_open,
> +	.close = vfio_pci_mmap_close,
> +	.fault = vfio_pci_mmap_fault,
> +};
> +
>  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  {
>  	struct vfio_pci_device *vdev = device_data;
> @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
>  
> -	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -			       req_len, vma->vm_page_prot);
> +	/*
> +	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> +	 * change vm_flags within the fault handler.  Set them now.
> +	 */
> +	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_ops = &vfio_pci_mmap_ops;

Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
BAR is activated ? That way a fully populated BAR is presented in the
common case and avoids taking a fault path?

But it does seem OK as is

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-01 21:39 ` [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory Alex Williamson
@ 2020-05-01 23:48   ` Jason Gunthorpe
  2020-05-04 18:26     ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-01 23:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:

>  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
>  			    struct vm_area_struct *vma)
>  {
> @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
> +	vm_fault_t ret = VM_FAULT_NOPAGE;
>  
> -	if (vfio_pci_add_vma(vdev, vma))
> -		return VM_FAULT_OOM;
> +	/*
> +	 * Zap callers hold memory_lock and acquire mmap_sem, we hold
> +	 * mmap_sem and need to acquire memory_lock to avoid races with
> +	 * memory bit settings.  Release mmap_sem, wait, and retry, or fail.
> +	 */
> +	if (unlikely(!down_read_trylock(&vdev->memory_lock))) {
> +		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> +			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> +				return VM_FAULT_RETRY;
> +
> +			up_read(&vma->vm_mm->mmap_sem);
> +
> +			if (vmf->flags & FAULT_FLAG_KILLABLE) {
> +				if (!down_read_killable(&vdev->memory_lock))
> +					up_read(&vdev->memory_lock);
> +			} else {
> +				down_read(&vdev->memory_lock);
> +				up_read(&vdev->memory_lock);
> +			}
> +			return VM_FAULT_RETRY;
> +		}
> +		return VM_FAULT_SIGBUS;
> +	}

So, why have the wait? It isn't reliable - if this gets faulted from a
call site that can't handle retry then it will SIGBUS anyhow?

The weird use of a rwsem as a completion suggest that perhaps using
wait_event might improve things:

disable:
  // Clean out the vma list with zap, then:

  down_read(mm->mmap_sem)
  mutex_lock(vma_lock);
  list_for_each_entry_safe()
     // zap and remove all vmas

  pause_faults = true;
  mutex_write(vma_lock);

fault:
  // Already have down_read(mmap_sem)
  mutex_lock(vma_lock);
  while (pause_faults) {
     mutex_unlock(vma_lock)
     wait_event(..., !pause_faults)
     mutex_lock(vma_lock)
  }
  list_add()
  remap_pfn()
  mutex_unlock(vma_lock)

enable:
  pause_faults = false
  wake_event()

The only requirement here is that while inside the write side of
memory_lock you cannot touch user pages (ie no copy_from_user/etc)

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas
  2020-05-01 21:39 ` [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas Alex Williamson
@ 2020-05-01 23:50   ` Jason Gunthorpe
  2020-05-04 14:06     ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-01 23:50 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, May 01, 2020 at 03:39:08PM -0600, Alex Williamson wrote:
> With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> the range being faulted into the vma.  Add support to manually provide
> that, in the same way as done on KVM with hva_to_pfn_remapped().
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>  drivers/vfio/vfio_iommu_type1.c |   36 +++++++++++++++++++++++++++++++++---
>  1 file changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index cc1d64765ce7..4a4cb7cd86b2 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> +			    unsigned long vaddr, unsigned long *pfn,
> +			    bool write_fault)
> +{
> +	int ret;
> +
> +	ret = follow_pfn(vma, vaddr, pfn);
> +	if (ret) {
> +		bool unlocked = false;
> +
> +		ret = fixup_user_fault(NULL, mm, vaddr,
> +				       FAULT_FLAG_REMOTE |
> +				       (write_fault ?  FAULT_FLAG_WRITE : 0),
> +				       &unlocked);
> +		if (unlocked)
> +			return -EAGAIN;
> +
> +		if (ret)
> +			return ret;
> +
> +		ret = follow_pfn(vma, vaddr, pfn);
> +	}
> +
> +	return ret;
> +}
> +
>  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  			 int prot, unsigned long *pfn)
>  {
> @@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  	vaddr = untagged_addr(vaddr);
>  
> +retry:
>  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> -		if (!follow_pfn(vma, vaddr, pfn) &&
> -		    is_invalid_reserved_pfn(*pfn))
> -			ret = 0;
> +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> +		if (ret == -EAGAIN)
> +			goto retry;
> +
> +		if (!ret && !is_invalid_reserved_pfn(*pfn))
> +			ret = -EFAULT;

I suggest checking vma->vm_ops == &vfio_pci_mmap_ops and adding a
comment that this is racy and needs to be fixed up. The ops check
makes this only used by other vfio bars and should prevent some
abuses of this hacky thing

However, I wonder if this chould just link itself into the
vma->private data so that when the vfio that owns the bar goes away,
so does the iommu mapping?

I feel like this patch set is not complete unless it also handles the
shootdown of this path too?

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas
  2020-05-01 23:50   ` Jason Gunthorpe
@ 2020-05-04 14:06     ` Alex Williamson
  2020-05-04 15:02       ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-04 14:06 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, 1 May 2020 20:50:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, May 01, 2020 at 03:39:08PM -0600, Alex Williamson wrote:
> > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > the range being faulted into the vma.  Add support to manually provide
> > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >  drivers/vfio/vfio_iommu_type1.c |   36 +++++++++++++++++++++++++++++++++---
> >  1 file changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index cc1d64765ce7..4a4cb7cd86b2 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> >  	return 0;
> >  }
> >  
> > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > +			    unsigned long vaddr, unsigned long *pfn,
> > +			    bool write_fault)
> > +{
> > +	int ret;
> > +
> > +	ret = follow_pfn(vma, vaddr, pfn);
> > +	if (ret) {
> > +		bool unlocked = false;
> > +
> > +		ret = fixup_user_fault(NULL, mm, vaddr,
> > +				       FAULT_FLAG_REMOTE |
> > +				       (write_fault ?  FAULT_FLAG_WRITE : 0),
> > +				       &unlocked);
> > +		if (unlocked)
> > +			return -EAGAIN;
> > +
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = follow_pfn(vma, vaddr, pfn);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >  			 int prot, unsigned long *pfn)
> >  {
> > @@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >  
> >  	vaddr = untagged_addr(vaddr);
> >  
> > +retry:
> >  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> >  
> >  	if (vma && vma->vm_flags & VM_PFNMAP) {
> > -		if (!follow_pfn(vma, vaddr, pfn) &&
> > -		    is_invalid_reserved_pfn(*pfn))
> > -			ret = 0;
> > +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > +		if (ret == -EAGAIN)
> > +			goto retry;
> > +
> > +		if (!ret && !is_invalid_reserved_pfn(*pfn))
> > +			ret = -EFAULT;  
> 
> I suggest checking vma->vm_ops == &vfio_pci_mmap_ops and adding a
> comment that this is racy and needs to be fixed up. The ops check
> makes this only used by other vfio bars and should prevent some
> abuses of this hacky thing

We can't do that, vfio-pci is only one bus driver within the vfio
ecosystem.

> However, I wonder if this chould just link itself into the
> vma->private data so that when the vfio that owns the bar goes away,
> so does the iommu mapping?

I don't really see why we wouldn't use mmu notifiers so that the vfio
iommu backend and vfio bus driver remain independent.

> I feel like this patch set is not complete unless it also handles the
> shootdown of this path too?

It would be nice to solve both issues and I'll start working on the mmu
notifier side of things, but this series does solve a real issue on
its own and we're not changing the iommu mapping behavior here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking
  2020-05-01 23:25   ` Jason Gunthorpe
@ 2020-05-04 14:20     ` Alex Williamson
  2020-05-04 15:05       ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-04 14:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, 1 May 2020 20:25:50 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:
> > Rather than calling remap_pfn_range() when a region is mmap'd, setup
> > a vm_ops handler to support dynamic faulting of the range on access.
> > This allows us to manage a list of vmas actively mapping the area that
> > we can later use to invalidate those mappings.  The open callback
> > invalidates the vma range so that all tracking is inserted in the
> > fault handler and removed in the close handler.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
> >  drivers/vfio/pci/vfio_pci_private.h |    7 +++
> >  2 files changed, 81 insertions(+), 2 deletions(-)  
> 
> > +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct vfio_pci_device *vdev = vma->vm_private_data;
> > +
> > +	if (vfio_pci_add_vma(vdev, vma))
> > +		return VM_FAULT_OOM;
> > +
> > +	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > +			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > +	.open = vfio_pci_mmap_open,
> > +	.close = vfio_pci_mmap_close,
> > +	.fault = vfio_pci_mmap_fault,
> > +};
> > +
> >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  {
> >  	struct vfio_pci_device *vdev = device_data;
> > @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> >  
> > -	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > -			       req_len, vma->vm_page_prot);
> > +	/*
> > +	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> > +	 * change vm_flags within the fault handler.  Set them now.
> > +	 */
> > +	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > +	vma->vm_ops = &vfio_pci_mmap_ops;  
> 
> Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
> BAR is activated ? That way a fully populated BAR is presented in the
> common case and avoids taking a fault path?
> 
> But it does seem OK as is

Thanks for reviewing.  There's also an argument that we defer
remap_pfn_range() until the device is actually touched, which might
reduce the startup latency.  It's also a bit inconsistent with the
vm_ops.open() path where I can't return error, so I can't call
vfio_pci_add_vma(), I can only zap the vma so that the fault handler
can return an error if necessary.  Therefore it felt more consistent,
with potential startup latency improvements, to defer all mappings to
the fault handler.  If there's a good reason to do otherwise, I can
make the change, but I doubt I'd have encountered the dma mapping of an
unfaulted vma issue had I done it this way, so maybe there's a test
coverage argument as well.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas
  2020-05-04 14:06     ` Alex Williamson
@ 2020-05-04 15:02       ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-04 15:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, May 04, 2020 at 08:06:30AM -0600, Alex Williamson wrote:
> On Fri, 1 May 2020 20:50:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Fri, May 01, 2020 at 03:39:08PM -0600, Alex Williamson wrote:
> > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > the range being faulted into the vma.  Add support to manually provide
> > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >  drivers/vfio/vfio_iommu_type1.c |   36 +++++++++++++++++++++++++++++++++---
> > >  1 file changed, 33 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > >  	return 0;
> > >  }
> > >  
> > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > +			    unsigned long vaddr, unsigned long *pfn,
> > > +			    bool write_fault)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = follow_pfn(vma, vaddr, pfn);
> > > +	if (ret) {
> > > +		bool unlocked = false;
> > > +
> > > +		ret = fixup_user_fault(NULL, mm, vaddr,
> > > +				       FAULT_FLAG_REMOTE |
> > > +				       (write_fault ?  FAULT_FLAG_WRITE : 0),
> > > +				       &unlocked);
> > > +		if (unlocked)
> > > +			return -EAGAIN;
> > > +
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		ret = follow_pfn(vma, vaddr, pfn);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> > >  			 int prot, unsigned long *pfn)
> > >  {
> > > @@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> > >  
> > >  	vaddr = untagged_addr(vaddr);
> > >  
> > > +retry:
> > >  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> > >  
> > >  	if (vma && vma->vm_flags & VM_PFNMAP) {
> > > -		if (!follow_pfn(vma, vaddr, pfn) &&
> > > -		    is_invalid_reserved_pfn(*pfn))
> > > -			ret = 0;
> > > +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > > +		if (ret == -EAGAIN)
> > > +			goto retry;
> > > +
> > > +		if (!ret && !is_invalid_reserved_pfn(*pfn))
> > > +			ret = -EFAULT;  
> > 
> > I suggest checking vma->vm_ops == &vfio_pci_mmap_ops and adding a
> > comment that this is racy and needs to be fixed up. The ops check
> > makes this only used by other vfio bars and should prevent some
> > abuses of this hacky thing
> 
> We can't do that, vfio-pci is only one bus driver within the vfio
> ecosystem.

Given this flow is already hacky, maybe it is OK?

> > However, I wonder if this chould just link itself into the
> > vma->private data so that when the vfio that owns the bar goes away,
> > so does the iommu mapping?
> 
> I don't really see why we wouldn't use mmu notifiers so that the vfio
> iommu backend and vfio bus driver remain independent.

mmu notifiers have tended to be complicated enough that if they can be
avoided it is usually better.

eg you can't just use mmu notifiers here, you have to use an entire
whole pinless page faulting scheme with the locking like
hmm_range_fault uses.

You also have to be very very careful with locking around invalidation
of the iommu to avoid deadlock. For instance the notifier invalidate
cannot do GFP_KERNEL memory allocations.

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking
  2020-05-04 14:20     ` Alex Williamson
@ 2020-05-04 15:05       ` Jason Gunthorpe
  2020-05-04 15:53         ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-04 15:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, May 04, 2020 at 08:20:55AM -0600, Alex Williamson wrote:
> On Fri, 1 May 2020 20:25:50 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:
> > > Rather than calling remap_pfn_range() when a region is mmap'd, setup
> > > a vm_ops handler to support dynamic faulting of the range on access.
> > > This allows us to manage a list of vmas actively mapping the area that
> > > we can later use to invalidate those mappings.  The open callback
> > > invalidates the vma range so that all tracking is inserted in the
> > > fault handler and removed in the close handler.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >  drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
> > >  drivers/vfio/pci/vfio_pci_private.h |    7 +++
> > >  2 files changed, 81 insertions(+), 2 deletions(-)  
> > 
> > > +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > > +{
> > > +	struct vm_area_struct *vma = vmf->vma;
> > > +	struct vfio_pci_device *vdev = vma->vm_private_data;
> > > +
> > > +	if (vfio_pci_add_vma(vdev, vma))
> > > +		return VM_FAULT_OOM;
> > > +
> > > +	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > +			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	return VM_FAULT_NOPAGE;
> > > +}
> > > +
> > > +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > > +	.open = vfio_pci_mmap_open,
> > > +	.close = vfio_pci_mmap_close,
> > > +	.fault = vfio_pci_mmap_fault,
> > > +};
> > > +
> > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > >  {
> > >  	struct vfio_pci_device *vdev = device_data;
> > > @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> > >  
> > > -	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > -			       req_len, vma->vm_page_prot);
> > > +	/*
> > > +	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> > > +	 * change vm_flags within the fault handler.  Set them now.
> > > +	 */
> > > +	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > > +	vma->vm_ops = &vfio_pci_mmap_ops;  
> > 
> > Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
> > BAR is activated ? That way a fully populated BAR is presented in the
> > common case and avoids taking a fault path?
> > 
> > But it does seem OK as is
> 
> Thanks for reviewing.  There's also an argument that we defer
> remap_pfn_range() until the device is actually touched, which might
> reduce the startup latency.

But not startup to a functional VM as that will now have to take the
slower fault path.

> It's also a bit inconsistent with the vm_ops.open() path where I
> can't return error, so I can't call vfio_pci_add_vma(), I can only
> zap the vma so that the fault handler can return an error if
> necessary.

open could allocate memory so the zap isn't needed. If allocation
fails then do the zap and take the slow path.

> handler.  If there's a good reason to do otherwise, I can make the
> change, but I doubt I'd have encountered the dma mapping of an
> unfaulted vma issue had I done it this way, so maybe there's a test
> coverage argument as well.  Thanks,

This test is best done by having one thread disable the other bar
while another thread is trying to map it

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking
  2020-05-04 15:05       ` Jason Gunthorpe
@ 2020-05-04 15:53         ` Alex Williamson
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2020-05-04 15:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, 4 May 2020 12:05:56 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, May 04, 2020 at 08:20:55AM -0600, Alex Williamson wrote:
> > On Fri, 1 May 2020 20:25:50 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:  
> > > > Rather than calling remap_pfn_range() when a region is mmap'd, setup
> > > > a vm_ops handler to support dynamic faulting of the range on access.
> > > > This allows us to manage a list of vmas actively mapping the area that
> > > > we can later use to invalidate those mappings.  The open callback
> > > > invalidates the vma range so that all tracking is inserted in the
> > > > fault handler and removed in the close handler.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > >  drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
> > > >  drivers/vfio/pci/vfio_pci_private.h |    7 +++
> > > >  2 files changed, 81 insertions(+), 2 deletions(-)    
> > >   
> > > > +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > > > +{
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct vfio_pci_device *vdev = vma->vm_private_data;
> > > > +
> > > > +	if (vfio_pci_add_vma(vdev, vma))
> > > > +		return VM_FAULT_OOM;
> > > > +
> > > > +	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > +			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +
> > > > +	return VM_FAULT_NOPAGE;
> > > > +}
> > > > +
> > > > +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > > > +	.open = vfio_pci_mmap_open,
> > > > +	.close = vfio_pci_mmap_close,
> > > > +	.fault = vfio_pci_mmap_fault,
> > > > +};
> > > > +
> > > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  {
> > > >  	struct vfio_pci_device *vdev = device_data;
> > > > @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> > > >  
> > > > -	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > -			       req_len, vma->vm_page_prot);
> > > > +	/*
> > > > +	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> > > > +	 * change vm_flags within the fault handler.  Set them now.
> > > > +	 */
> > > > +	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > > > +	vma->vm_ops = &vfio_pci_mmap_ops;    
> > > 
> > > Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
> > > BAR is activated ? That way a fully populated BAR is presented in the
> > > common case and avoids taking a fault path?
> > > 
> > > But it does seem OK as is  
> > 
> > Thanks for reviewing.  There's also an argument that we defer
> > remap_pfn_range() until the device is actually touched, which might
> > reduce the startup latency.  
> 
> But not startup to a functional VM as that will now have to take the
> slower fault path.

We need to take the fault path regardless because a VM will size and
(virtually) map the BARs, toggling the memory enable bit.  As provided
here, we don't trigger the fault unless the user attempts to access the
BAR or we DMA map the BAR.  That defers the fault until the VM is (to
some extent) up and running, and has a better chance for multi-threaded
faulting than does QEMU initialization. 
 
> > It's also a bit inconsistent with the vm_ops.open() path where I
> > can't return error, so I can't call vfio_pci_add_vma(), I can only
> > zap the vma so that the fault handler can return an error if
> > necessary.  
> 
> open could allocate memory so the zap isn't needed. If allocation
> fails then do the zap and take the slow path.

That's a good idea, but it also gives us one more initialization
variation.  I thought it was a rather nice feature that our vma_list
includes only vmas that have actually faulted in the mapping since it
was last zapped.  This is our steady state, so why not get there
immediately rather than putting every mmap or open into the list,
regardless of whether that particular vma ever accesses the mapping?
If we have vmas in our vma_list that have never accessed the mapping,
we're increasing our latency on zap.

> > handler.  If there's a good reason to do otherwise, I can make the
> > change, but I doubt I'd have encountered the dma mapping of an
> > unfaulted vma issue had I done it this way, so maybe there's a test
> > coverage argument as well.  Thanks,  
> 
> This test is best done by having one thread disable the other bar
> while another thread is trying to map it

We can split hairs on the best mechanism to trigger this, but the best
test is the one that actually gets run, and this triggered by simply
starting an assigned device VM.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-01 23:48   ` Jason Gunthorpe
@ 2020-05-04 18:26     ` Alex Williamson
  2020-05-04 18:44       ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-04 18:26 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Fri, 1 May 2020 20:48:49 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:
> 
> >  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
> >  			    struct vm_area_struct *vma)
> >  {
> > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > +	vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > -	if (vfio_pci_add_vma(vdev, vma))
> > -		return VM_FAULT_OOM;
> > +	/*
> > +	 * Zap callers hold memory_lock and acquire mmap_sem, we hold
> > +	 * mmap_sem and need to acquire memory_lock to avoid races with
> > +	 * memory bit settings.  Release mmap_sem, wait, and retry, or fail.
> > +	 */
> > +	if (unlikely(!down_read_trylock(&vdev->memory_lock))) {
> > +		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > +			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > +				return VM_FAULT_RETRY;
> > +
> > +			up_read(&vma->vm_mm->mmap_sem);
> > +
> > +			if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > +				if (!down_read_killable(&vdev->memory_lock))
> > +					up_read(&vdev->memory_lock);
> > +			} else {
> > +				down_read(&vdev->memory_lock);
> > +				up_read(&vdev->memory_lock);
> > +			}
> > +			return VM_FAULT_RETRY;
> > +		}
> > +		return VM_FAULT_SIGBUS;
> > +	}  
> 
> So, why have the wait? It isn't reliable - if this gets faulted from a
> call site that can't handle retry then it will SIGBUS anyhow?

Do such call sites exist?  My assumption was that half of the branch
was unlikely to ever occur.

> The weird use of a rwsem as a completion suggest that perhaps using
> wait_event might improve things:
> 
> disable:
>   // Clean out the vma list with zap, then:
> 
>   down_read(mm->mmap_sem)

I assume this is simplifying the dance we do in zapping to first take
vma_lock in order to walk vma_list, to find a vma from which we can
acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock
below.  Also accounting that vma_list might be empty and we might need
to drop and re-acquire vma_lock to get to another mm, so we really
probably want to set pause_faults at the start rather than at the end.

>   mutex_lock(vma_lock);
>   list_for_each_entry_safe()
>      // zap and remove all vmas
> 
>   pause_faults = true;
>   mutex_write(vma_lock);
> 
> fault:
>   // Already have down_read(mmap_sem)
>   mutex_lock(vma_lock);
>   while (pause_faults) {
>      mutex_unlock(vma_lock)
>      wait_event(..., !pause_faults)
>      mutex_lock(vma_lock)
>   }

Nit, we need to test the memory enable bit setting somewhere under this
lock since it seems to be the only thing protecting it now.

>   list_add()
>   remap_pfn()
>   mutex_unlock(vma_lock)

The read and write file ops would need similar mechanisms.

> enable:
>   pause_faults = false
>   wake_event()

Hmm, vma_lock was dropped above and not re-acquired here.  I'm not sure
if it was an oversight that pause_faults was not tested in the disable
path, but this combination appears to lead to concurrent writers and
serialized readers??

So yeah, this might resolve a theoretical sigbus if we can't retry to
get the memory_lock ordering correct, but we also lose the concurrency
that memory_lock provided us.

> 
> The only requirement here is that while inside the write side of
> memory_lock you cannot touch user pages (ie no copy_from_user/etc)

I'm lost at this statement, I can only figure the above works if we
remove memory_lock.  Are you referring to a different lock?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-04 18:26     ` Alex Williamson
@ 2020-05-04 18:44       ` Jason Gunthorpe
  2020-05-04 19:35         ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-04 18:44 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, May 04, 2020 at 12:26:43PM -0600, Alex Williamson wrote:
> On Fri, 1 May 2020 20:48:49 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:
> > 
> > >  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > >  			    struct vm_area_struct *vma)
> > >  {
> > > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > >  {
> > >  	struct vm_area_struct *vma = vmf->vma;
> > >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > > +	vm_fault_t ret = VM_FAULT_NOPAGE;
> > >  
> > > -	if (vfio_pci_add_vma(vdev, vma))
> > > -		return VM_FAULT_OOM;
> > > +	/*
> > > +	 * Zap callers hold memory_lock and acquire mmap_sem, we hold
> > > +	 * mmap_sem and need to acquire memory_lock to avoid races with
> > > +	 * memory bit settings.  Release mmap_sem, wait, and retry, or fail.
> > > +	 */
> > > +	if (unlikely(!down_read_trylock(&vdev->memory_lock))) {
> > > +		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > > +			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > > +				return VM_FAULT_RETRY;
> > > +
> > > +			up_read(&vma->vm_mm->mmap_sem);
> > > +
> > > +			if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > > +				if (!down_read_killable(&vdev->memory_lock))
> > > +					up_read(&vdev->memory_lock);
> > > +			} else {
> > > +				down_read(&vdev->memory_lock);
> > > +				up_read(&vdev->memory_lock);
> > > +			}
> > > +			return VM_FAULT_RETRY;
> > > +		}
> > > +		return VM_FAULT_SIGBUS;
> > > +	}  
> > 
> > So, why have the wait? It isn't reliable - if this gets faulted from a
> > call site that can't handle retry then it will SIGBUS anyhow?
> 
> Do such call sites exist?  My assumption was that half of the branch
> was unlikely to ever occur.

hmm_range_fault() for instance doesn't set ALLOW_RETRY, I assume there
are enough other case to care about, but am not so sure

> > The weird use of a rwsem as a completion suggest that perhaps using
> > wait_event might improve things:
> > 
> > disable:
> >   // Clean out the vma list with zap, then:
> > 
> >   down_read(mm->mmap_sem)
> 
> I assume this is simplifying the dance we do in zapping to first take
> vma_lock in order to walk vma_list, to find a vma from which we can
> acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock
> below.  

No, that has to stay..

> Also accounting that vma_list might be empty and we might need
> to drop and re-acquire vma_lock to get to another mm, so we really
> probably want to set pause_faults at the start rather than at the end.

New vmas should not created/faulted while vma_lock is held, so the
order shouldn't matter..

> >   mutex_lock(vma_lock);
> >   list_for_each_entry_safe()
> >      // zap and remove all vmas
> > 
> >   pause_faults = true;
> >   mutex_write(vma_lock);
> > 
> > fault:
> >   // Already have down_read(mmap_sem)
> >   mutex_lock(vma_lock);
> >   while (pause_faults) {
> >      mutex_unlock(vma_lock)
> >      wait_event(..., !pause_faults)
> >      mutex_lock(vma_lock)
> >   }
> 
> Nit, we need to test the memory enable bit setting somewhere under this
> lock since it seems to be the only thing protecting it now.

I was thinking you'd keep the same locking for the memory enable bit,
the pause_faults is a shadow of that bit with locking connected to
vma_lock..

> >   list_add()
> >   remap_pfn()
> >   mutex_unlock(vma_lock)
> 
> The read and write file ops would need similar mechanisms.

Keep using the rwsem?

> > enable:
> >   pause_faults = false
> >   wake_event()
> 
> Hmm, vma_lock was dropped above and not re-acquired here.

I was thinking this would be under a continous rwlock

> I'm not sure if it was an oversight that pause_faults was not tested
> in the disable path, but this combination appears to lead to
> concurrent writers and serialized readers??

? pause_faults only exists to prevent the vm_ops fault callback from
progressing to a fault. I don't think any concurrancy is lost

> > The only requirement here is that while inside the write side of
> > memory_lock you cannot touch user pages (ie no copy_from_user/etc)
> 
> I'm lost at this statement, I can only figure the above works if we
> remove memory_lock.  Are you referring to a different lock?  Thanks,

No

This is just an approach to avoid the ABBA deadlock problem when using
a rwsem by using a looser form of lock combined witih the already
correctly nested vma_lock.

Stated another way, you can keep the existing memory_lock as is, if it
is structured like this:

disable:
 down_read(mmap_sem)
 mutex_lock(vma_lock)
 list_for_each_entry_safe()
      // zap and remove all vmas
 down_write(memory_lock)   // Now inside vma_lock!
 mutex_unlock(vma_lock)
 up_read(mmap_sem

 [ do the existing stuff under memory_lock ]


fault:
  mutex_lock(vma_lock)
  down_write(memory_lock)
  remap_pfn
  up_write(memory_lock)
  mutex_unlock(vma_lock)

enable:
  up_write(memory_lock)

Ie the key is to organize things to move the down_write(memory_lock)
to be under the mmap_sem/vma_lock

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-04 18:44       ` Jason Gunthorpe
@ 2020-05-04 19:35         ` Alex Williamson
  2020-05-04 20:01           ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-04 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, 4 May 2020 15:44:36 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, May 04, 2020 at 12:26:43PM -0600, Alex Williamson wrote:
> > On Fri, 1 May 2020 20:48:49 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:
> > >   
> > > >  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > > >  			    struct vm_area_struct *vma)
> > > >  {
> > > > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > > >  {
> > > >  	struct vm_area_struct *vma = vmf->vma;
> > > >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > > > +	vm_fault_t ret = VM_FAULT_NOPAGE;
> > > >  
> > > > -	if (vfio_pci_add_vma(vdev, vma))
> > > > -		return VM_FAULT_OOM;
> > > > +	/*
> > > > +	 * Zap callers hold memory_lock and acquire mmap_sem, we hold
> > > > +	 * mmap_sem and need to acquire memory_lock to avoid races with
> > > > +	 * memory bit settings.  Release mmap_sem, wait, and retry, or fail.
> > > > +	 */
> > > > +	if (unlikely(!down_read_trylock(&vdev->memory_lock))) {
> > > > +		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > > > +			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > > > +				return VM_FAULT_RETRY;
> > > > +
> > > > +			up_read(&vma->vm_mm->mmap_sem);
> > > > +
> > > > +			if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > > > +				if (!down_read_killable(&vdev->memory_lock))
> > > > +					up_read(&vdev->memory_lock);
> > > > +			} else {
> > > > +				down_read(&vdev->memory_lock);
> > > > +				up_read(&vdev->memory_lock);
> > > > +			}
> > > > +			return VM_FAULT_RETRY;
> > > > +		}
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	}    
> > > 
> > > So, why have the wait? It isn't reliable - if this gets faulted from a
> > > call site that can't handle retry then it will SIGBUS anyhow?  
> > 
> > Do such call sites exist?  My assumption was that half of the branch
> > was unlikely to ever occur.  
> 
> hmm_range_fault() for instance doesn't set ALLOW_RETRY, I assume there
> are enough other case to care about, but am not so sure
> 
> > > The weird use of a rwsem as a completion suggest that perhaps using
> > > wait_event might improve things:
> > > 
> > > disable:
> > >   // Clean out the vma list with zap, then:
> > > 
> > >   down_read(mm->mmap_sem)  
> > 
> > I assume this is simplifying the dance we do in zapping to first take
> > vma_lock in order to walk vma_list, to find a vma from which we can
> > acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock
> > below.    
> 
> No, that has to stay..

Sorry, I stated that unclearly, I'm assuming we keep that and it's been
omitted from this pseudo code for simplicity.
 
> > Also accounting that vma_list might be empty and we might need
> > to drop and re-acquire vma_lock to get to another mm, so we really
> > probably want to set pause_faults at the start rather than at the end.  
> 
> New vmas should not created/faulted while vma_lock is held, so the
> order shouldn't matter..

Technically that's true, but if vfio_pci_zap_mmap_vmas() drops vma_lock
to go back and get another mm, then vm_ops.fault() could get another
vma into the list while we're trying to zap and clear them all.  The
result is the same, but we might be doing unnecessary work versus
holding off the fault from the start.
 
> > >   mutex_lock(vma_lock);
> > >   list_for_each_entry_safe()
> > >      // zap and remove all vmas
> > > 
> > >   pause_faults = true;
> > >   mutex_write(vma_lock);
> > > 
> > > fault:
> > >   // Already have down_read(mmap_sem)
> > >   mutex_lock(vma_lock);
> > >   while (pause_faults) {
> > >      mutex_unlock(vma_lock)
> > >      wait_event(..., !pause_faults)
> > >      mutex_lock(vma_lock)
> > >   }  
> > 
> > Nit, we need to test the memory enable bit setting somewhere under this
> > lock since it seems to be the only thing protecting it now.  
> 
> I was thinking you'd keep the same locking for the memory enable bit,
> the pause_faults is a shadow of that bit with locking connected to
> vma_lock..

Oh!  I totally did not get that!

> > >   list_add()
> > >   remap_pfn()
> > >   mutex_unlock(vma_lock)  
> > 
> > The read and write file ops would need similar mechanisms.  
> 
> Keep using the rwsem?
> 
> > > enable:
> > >   pause_faults = false
> > >   wake_event()  
> > 
> > Hmm, vma_lock was dropped above and not re-acquired here.  
> 
> I was thinking this would be under a continous rwlock
> 
> > I'm not sure if it was an oversight that pause_faults was not tested
> > in the disable path, but this combination appears to lead to
> > concurrent writers and serialized readers??  
> 
> ? pause_faults only exists to prevent the vm_ops fault callback from
> progressing to a fault. I don't think any concurrancy is lost
> 
> > > The only requirement here is that while inside the write side of
> > > memory_lock you cannot touch user pages (ie no copy_from_user/etc)  
> > 
> > I'm lost at this statement, I can only figure the above works if we
> > remove memory_lock.  Are you referring to a different lock?  Thanks,  
> 
> No
> 
> This is just an approach to avoid the ABBA deadlock problem when using
> a rwsem by using a looser form of lock combined witih the already
> correctly nested vma_lock.
> 
> Stated another way, you can keep the existing memory_lock as is, if it
> is structured like this:
> 
> disable:
>  down_read(mmap_sem)
>  mutex_lock(vma_lock)
>  list_for_each_entry_safe()
>       // zap and remove all vmas
>  down_write(memory_lock)   // Now inside vma_lock!
>  mutex_unlock(vma_lock)
>  up_read(mmap_sem
> 
>  [ do the existing stuff under memory_lock ]
> 
> 
> fault:
>   mutex_lock(vma_lock)
>   down_write(memory_lock)
>   remap_pfn
>   up_write(memory_lock)
>   mutex_unlock(vma_lock)
> 
> enable:
>   up_write(memory_lock)
> 
> Ie the key is to organize things to move the down_write(memory_lock)
> to be under the mmap_sem/vma_lock

Ok, this all makes a lot more sense with memory_lock still in the
picture.  And it looks like you're not insisting on the wait_event, we
can block on memory_lock so long as we don't have an ordering issue.
I'll see what I can do.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-04 19:35         ` Alex Williamson
@ 2020-05-04 20:01           ` Jason Gunthorpe
  2020-05-05 17:12             ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-04 20:01 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote:

> Ok, this all makes a lot more sense with memory_lock still in the
> picture.  And it looks like you're not insisting on the wait_event, we
> can block on memory_lock so long as we don't have an ordering issue.
> I'll see what I can do.  Thanks,

Right, you can block on the rwsem if it is ordered properly vs
mmap_sem.

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-04 20:01           ` Jason Gunthorpe
@ 2020-05-05 17:12             ` Alex Williamson
  2020-05-05 18:36               ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2020-05-05 17:12 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, linux-kernel, cohuck, peterx

On Mon, 4 May 2020 17:01:23 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote:
> 
> > Ok, this all makes a lot more sense with memory_lock still in the
> > picture.  And it looks like you're not insisting on the wait_event, we
> > can block on memory_lock so long as we don't have an ordering issue.
> > I'll see what I can do.  Thanks,  
> 
> Right, you can block on the rwsem if it is ordered properly vs
> mmap_sem.

This is what I've come up with, please see if you agree with the logic:

void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
{
        struct vfio_pci_mmap_vma *mmap_vma, *tmp;

        /*
         * Lock ordering:
         * vma_lock is nested under mmap_sem for vm_ops callback paths.
         * The memory_lock semaphore is used by both code paths calling
         * into this function to zap vmas and the vm_ops.fault callback
         * to protect the memory enable state of the device.
         *
         * When zapping vmas we need to maintain the mmap_sem => vma_lock
         * ordering, which requires using vma_lock to walk vma_list to
         * acquire an mm, then dropping vma_lock to get the mmap_sem and
         * reacquiring vma_lock.  This logic is derived from similar
         * requirements in uverbs_user_mmap_disassociate().
         *
         * mmap_sem must always be the top-level lock when it is taken.
         * Therefore we can only hold the memory_lock write lock when
         * vma_list is empty, as we'd need to take mmap_sem to clear
         * entries.  vma_list can only be guaranteed empty when holding
         * vma_lock, thus memory_lock is nested under vma_lock.
         *
         * This enables the vm_ops.fault callback to acquire vma_lock,
         * followed by memory_lock read lock, while already holding
         * mmap_sem without risk of deadlock.
         */
        while (1) {
                struct mm_struct *mm = NULL;

                mutex_lock(&vdev->vma_lock);
                while (!list_empty(&vdev->vma_list)) {
                        mmap_vma = list_first_entry(&vdev->vma_list,
                                                    struct vfio_pci_mmap_vma,
                                                    vma_next);
                        mm = mmap_vma->vma->vm_mm;
                        if (mmget_not_zero(mm))
                                break;

                        list_del(&mmap_vma->vma_next);
                        kfree(mmap_vma);
                        mm = NULL;
                }

                if (!mm)
                        break;
                mutex_unlock(&vdev->vma_lock);

                down_read(&mm->mmap_sem);
                if (mmget_still_valid(mm)) {
                        mutex_lock(&vdev->vma_lock);
                        list_for_each_entry_safe(mmap_vma, tmp,
                                                 &vdev->vma_list, vma_next) {
                                struct vm_area_struct *vma = mmap_vma->vma;

                                if (vma->vm_mm != mm)
                                        continue;

                                list_del(&mmap_vma->vma_next);
                                kfree(mmap_vma);

                                zap_vma_ptes(vma, vma->vm_start,
                                             vma->vm_end - vma->vm_start);
                        }
                        mutex_unlock(&vdev->vma_lock);
                }
                up_read(&mm->mmap_sem);
                mmput(mm);
        }

        down_write(&vdev->memory_lock);
        mutex_unlock(&vdev->vma_lock);
}

As noted in the comment, the fault handler can simply do:

mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);

This should be deadlock free now, so we can drop the retry handling

Paths needing to acquire memory_lock with vmas zapped (device reset,
memory bit *->0 transition) call this function, perform their
operation, then simply release with up_write(&vdev->memory_lock).  Both
the read and write version of acquiring memory_lock can still occur
outside this function for operations that don't require flushing all
vmas or otherwise touch vma_lock or mmap_sem (ex. read/write, MSI-X
vector table access, writing *->1 to memory enable bit).

I still need to work on the bus reset path as acquiring memory_lock
write locks across multiple devices seems like it requires try-lock
behavior, which is clearly complicated, or at least messy in the above
function.

Does this seem like it's going in a reasonable direction?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  2020-05-05 17:12             ` Alex Williamson
@ 2020-05-05 18:36               ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2020-05-05 18:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, cohuck, peterx

On Tue, May 05, 2020 at 11:12:27AM -0600, Alex Williamson wrote:
> 
> As noted in the comment, the fault handler can simply do:
> 
> mutex_lock(&vdev->vma_lock);
> down_read(&vdev->memory_lock);
> 
> This should be deadlock free now, so we can drop the retry handling

That does look like the right direction, because the memory_lock can
be done at the very end it means it doesn't need to be nested inside
mmap_sem

This is much cleaner!

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-05-05 18:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 21:38 [PATCH 0/3] vfio-pci: Block user access to disabled device MMIO Alex Williamson
2020-05-01 21:39 ` [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas Alex Williamson
2020-05-01 23:50   ` Jason Gunthorpe
2020-05-04 14:06     ` Alex Williamson
2020-05-04 15:02       ` Jason Gunthorpe
2020-05-01 21:39 ` [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking Alex Williamson
2020-05-01 23:25   ` Jason Gunthorpe
2020-05-04 14:20     ` Alex Williamson
2020-05-04 15:05       ` Jason Gunthorpe
2020-05-04 15:53         ` Alex Williamson
2020-05-01 21:39 ` [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory Alex Williamson
2020-05-01 23:48   ` Jason Gunthorpe
2020-05-04 18:26     ` Alex Williamson
2020-05-04 18:44       ` Jason Gunthorpe
2020-05-04 19:35         ` Alex Williamson
2020-05-04 20:01           ` Jason Gunthorpe
2020-05-05 17:12             ` Alex Williamson
2020-05-05 18:36               ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).