Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cohuck@redhat.com, peterx@redhat.com
Subject: Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
Date: Tue, 5 May 2020 11:12:27 -0600	[thread overview]
Message-ID: <20200505111227.02ac9cee@w520.home> (raw)
In-Reply-To: <20200504200123.GA26002@ziepe.ca>

On Mon, 4 May 2020 17:01:23 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote:
> 
> > Ok, this all makes a lot more sense with memory_lock still in the
> > picture.  And it looks like you're not insisting on the wait_event, we
> > can block on memory_lock so long as we don't have an ordering issue.
> > I'll see what I can do.  Thanks,  
> 
> Right, you can block on the rwsem if it is ordered properly vs
> mmap_sem.

This is what I've come up with, please see if you agree with the logic:

void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
{
        struct vfio_pci_mmap_vma *mmap_vma, *tmp;

        /*
         * Lock ordering:
         * vma_lock is nested under mmap_sem for vm_ops callback paths.
         * The memory_lock semaphore is used by both code paths calling
         * into this function to zap vmas and the vm_ops.fault callback
         * to protect the memory enable state of the device.
         *
         * When zapping vmas we need to maintain the mmap_sem => vma_lock
         * ordering, which requires using vma_lock to walk vma_list to
         * acquire an mm, then dropping vma_lock to get the mmap_sem and
         * reacquiring vma_lock.  This logic is derived from similar
         * requirements in uverbs_user_mmap_disassociate().
         *
         * mmap_sem must always be the top-level lock when it is taken.
         * Therefore we can only hold the memory_lock write lock when
         * vma_list is empty, as we'd need to take mmap_sem to clear
         * entries.  vma_list can only be guaranteed empty when holding
         * vma_lock, thus memory_lock is nested under vma_lock.
         *
         * This enables the vm_ops.fault callback to acquire vma_lock,
         * followed by memory_lock read lock, while already holding
         * mmap_sem without risk of deadlock.
         */
        while (1) {
                struct mm_struct *mm = NULL;

                mutex_lock(&vdev->vma_lock);
                while (!list_empty(&vdev->vma_list)) {
                        mmap_vma = list_first_entry(&vdev->vma_list,
                                                    struct vfio_pci_mmap_vma,
                                                    vma_next);
                        mm = mmap_vma->vma->vm_mm;
                        if (mmget_not_zero(mm))
                                break;

                        list_del(&mmap_vma->vma_next);
                        kfree(mmap_vma);
                        mm = NULL;
                }

                if (!mm)
                        break;
                mutex_unlock(&vdev->vma_lock);

                down_read(&mm->mmap_sem);
                if (mmget_still_valid(mm)) {
                        mutex_lock(&vdev->vma_lock);
                        list_for_each_entry_safe(mmap_vma, tmp,
                                                 &vdev->vma_list, vma_next) {
                                struct vm_area_struct *vma = mmap_vma->vma;

                                if (vma->vm_mm != mm)
                                        continue;

                                list_del(&mmap_vma->vma_next);
                                kfree(mmap_vma);

                                zap_vma_ptes(vma, vma->vm_start,
                                             vma->vm_end - vma->vm_start);
                        }
                        mutex_unlock(&vdev->vma_lock);
                }
                up_read(&mm->mmap_sem);
                mmput(mm);
        }

        down_write(&vdev->memory_lock);
        mutex_unlock(&vdev->vma_lock);
}

As noted in the comment, the fault handler can simply do:

mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);

This should be deadlock free now, so we can drop the retry handling

Paths needing to acquire memory_lock with vmas zapped (device reset,
memory bit *->0 transition) call this function, perform their
operation, then simply release with up_write(&vdev->memory_lock).  Both
the read and write version of acquiring memory_lock can still occur
outside this function for operations that don't require flushing all
vmas or otherwise touch vma_lock or mmap_sem (ex. read/write, MSI-X
vector table access, writing *->1 to memory enable bit).

I still need to work on the bus reset path as acquiring memory_lock
write locks across multiple devices seems like it requires try-lock
behavior, which is clearly complicated, or at least messy in the above
function.

Does this seem like it's going in a reasonable direction?  Thanks,

Alex