Re: [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	<kevin.tian@intel.com>, <jike.song@intel.com>,
	<bjsdjshi@linux.vnet.ibm.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices
Date: Tue, 8 Nov 2016 10:05:39 -0700	[thread overview]
Message-ID: <20161108100539.4f03572e@t450s.home> (raw)
In-Reply-To: <f99b668b-e938-9fde-d166-cacfc7b07a53@nvidia.com>

On Tue, 8 Nov 2016 20:36:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/8/2016 4:46 AM, Alex Williamson wrote:
> > On Sat, 5 Nov 2016 02:40:44 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> 
> >> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >> +static int __vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >> +				    int prot, unsigned long *pfn_base,
> >> +				    bool do_accounting)
> >> +{
> >> +	struct task_struct *task = dma->task;
> >> +	unsigned long limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +	bool lock_cap = dma->mlock_cap;
> >> +	struct mm_struct *mm = dma->addr_space->mm;
> >> +	int ret;
> >> +	bool rsvd;
> >> +
> >> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> >> +
> >> +	if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) {
> >> +		put_pfn(*pfn_base, prot);
> >> +		pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +			__func__, task->comm, task_pid_nr(task),
> >> +			limit << PAGE_SHIFT);
> >> +		return -ENOMEM;
> >> +	}
> >> +
> >> +	if (!rsvd && do_accounting)
> >> +		vfio_lock_acct(mm, 1);
> >> +
> >> +	return 1;
> >> +}
> >> +
> >> +static void __vfio_unpin_page_external(struct vfio_addr_space *addr_space,
> >> +				       unsigned long pfn, int prot,
> >> +				       bool do_accounting)
> >> +{
> >> +	put_pfn(pfn, prot);
> >> +
> >> +	if (do_accounting)
> >> +		vfio_lock_acct(addr_space->mm, -1);  
> > 
> > Can't we batch this like we do elsewhere?  Intel folks, AIUI you intend
> > to pin all VM memory through this side channel, have you tested the
> > scalability and performance of this with larger VMs?  Our vfio_pfn
> > data structure alone is 40 bytes per pinned page, which means for
> > each 1GB of VM memory, we have 10MBs worth of struct vfio_pfn!
> > Additionally, unmapping each 1GB of VM memory will result in 256k
> > separate vfio_lock_acct() callbacks.  I'm concerned that we're not
> > being efficient enough in either space or time.
> > 
> > One thought might be whether we really need to save the pfn, we better
> > always get the same result if we pin it again, or maybe we can just do
> > a lookup through the mm at that point without re-pinning.  Could we get
> > to the point where we only need an atomic_t ref count per page in a
> > linear array relative to the IOVA?  
> 
> Ok. Is System RAM hot-plug supported? How is system RAM hot-plug
> handled? Are there DMA_MAP calls on such hot-plug for additional range?
> If we have a linear array/memory, we will have to realloc it on memory
> hot-plug?

I was thinking a linear array for each IOVA page within a vfio_dma.
The array would track the number of references (pins) of each page.  It
might actually need to be a page table given that a single vfio_dma can
nearly map the entire 64bit address space.  I don't think RAM hotplug
is a factor here, we need to support and properly account for multiple
IOVAs mapping to the same pfn, but the typical case will be a 1:1
mapping, I think that's what we'd optimize for.

> >  That would give us 1MB per 1GB
> > overhead. The semantics of the pin and unpin would make more sense then
> > too, both would take an IOVA range, only pinning would need a return
> > mechanism. For instance:
> > 
> > int pin_pages(void *iommu_data, dma_addr_t iova_base,
> > 	      int npage, unsigned long *pfn_base);
> > 
> > This would pin physically contiguous pages up to npage, returning the
> > base pfn and returning the number of pages pinned (<= npage).  The
> > vendor driver would make multiple calls to fill the necessary range.  
> 
> 
> With the current patch, input is user_pfn[] array and npages.
> 
> int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>                    int npage, int prot, unsigned long *phys_pfn)
> 
> 
> When guest allocates memory with malloc(), gfns would not be contiguous,
> right? These gfns (user_pfns) are passed as argument here.
> Is there any case where we could get pin/unpin request for contiguous pages?

It would depend on whether the user within the guest is actually
optimizing for hugepages.

> > Unpin would then simply be:
> > 
> > void unpin_pages(void *iommu_data, dma_addr_t iova_base, int npage);
> > 
> > Hugepage usage would really make such an interface shine (ie. 2MB+
> > contiguous ranges).  A downside would be the overhead of getting the
> > group and container reference in vfio for each callback, perhaps we'd
> > need to figure out how the vendor driver could hold that reference.  
> 
> In very initial phases of proposal, I had suggested to keep pointer to
> container->iommu_data in struct mdev_device. But that was discarded.

The referencing is definitely tricky, we run into the same problem that
Alexey had in trying to do that where holding a reference to the
container doesn't actually keep the container.  The user can tear down
the container or remove groups from the container, which automatically
removes the iommu backend.

> > The current API of passing around pfn arrays, further increasing the
> > overhead of the whole ecosystem just makes me cringe though.
> >   
> 
> ...
> 
> >> +		if (ret <= 0) {
> >> +			WARN_ON(!ret);
> >> +			goto pin_unwind;
> >> +		}
> >> +
> >> +		mutex_lock(&dma->addr_space->pfn_list_lock);
> >> +
> >> +		/* search if pfn exist */
> >> +		p = vfio_find_pfn(dma->addr_space, pfn[i]);
> >> +		if (p) {
> >> +			atomic_inc(&p->ref_count);  
> > 
> > We never test whether (p->prot == prot), shouldn't we be doing
> > something in that case?  In fact, why do we allow the side-channel
> > through the .{un}pin_pages to specify page protection flags that might
> > be different than the user specified for the DMA_MAP?  If the user
> > specified read-only, the vendor driver should not be allowed to
> > override with read-write access.
> >   
> 
> If user specified protection flags for DMA_MAP could be
> prot = IOMMU_WRITE | IOMMU_READ;
> 
> But vendor driver can request to pin page to be readonly, i.e.
> IOMMU_READ. In that case, pin pages should be allowed, right?
> 
> Then the check should be if (p->prot & prot).

That doesn't solve the problem, we might have an existing pfn with
prot R and we're asking for RW, (R & RW) is true, but that's not what
we asked for.  I think the solution would be to test the vendor driver
requested prot against vfio_dma.prot, but always pin the pages using
the vfio_dma.prot flags.  Thus if the vendor driver asks for read-only
and the vfio_dma.prot is read-write, these are compatible and the page
is pinned read-write since that's the user mapping.  Otherwise you need
to save prot per page (which you already do), but you're also going to
need to handle promoting pages if we have multiple vendors asking for
different prot attributes.  Thanks,

Alex