linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	<kevin.tian@intel.com>, <jike.song@intel.com>,
	<bjsdjshi@linux.vnet.ibm.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices
Date: Tue, 8 Nov 2016 10:05:39 -0700	[thread overview]
Message-ID: <20161108100539.4f03572e@t450s.home> (raw)
In-Reply-To: <f99b668b-e938-9fde-d166-cacfc7b07a53@nvidia.com>

On Tue, 8 Nov 2016 20:36:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/8/2016 4:46 AM, Alex Williamson wrote:
> > On Sat, 5 Nov 2016 02:40:44 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> 
> >> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >> +static int __vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >> +				    int prot, unsigned long *pfn_base,
> >> +				    bool do_accounting)
> >> +{
> >> +	struct task_struct *task = dma->task;
> >> +	unsigned long limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +	bool lock_cap = dma->mlock_cap;
> >> +	struct mm_struct *mm = dma->addr_space->mm;
> >> +	int ret;
> >> +	bool rsvd;
> >> +
> >> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> >> +
> >> +	if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) {
> >> +		put_pfn(*pfn_base, prot);
> >> +		pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +			__func__, task->comm, task_pid_nr(task),
> >> +			limit << PAGE_SHIFT);
> >> +		return -ENOMEM;
> >> +	}
> >> +
> >> +	if (!rsvd && do_accounting)
> >> +		vfio_lock_acct(mm, 1);
> >> +
> >> +	return 1;
> >> +}
> >> +
> >> +static void __vfio_unpin_page_external(struct vfio_addr_space *addr_space,
> >> +				       unsigned long pfn, int prot,
> >> +				       bool do_accounting)
> >> +{
> >> +	put_pfn(pfn, prot);
> >> +
> >> +	if (do_accounting)
> >> +		vfio_lock_acct(addr_space->mm, -1);  
> > 
> > Can't we batch this like we do elsewhere?  Intel folks, AIUI you intend
> > to pin all VM memory through this side channel, have you tested the
> > scalability and performance of this with larger VMs?  Our vfio_pfn
> > data structure alone is 40 bytes per pinned page, which means for
> > each 1GB of VM memory, we have 10MBs worth of struct vfio_pfn!
> > Additionally, unmapping each 1GB of VM memory will result in 256k
> > separate vfio_lock_acct() callbacks.  I'm concerned that we're not
> > being efficient enough in either space or time.
> > 
> > One thought might be whether we really need to save the pfn, we better
> > always get the same result if we pin it again, or maybe we can just do
> > a lookup through the mm at that point without re-pinning.  Could we get
> > to the point where we only need an atomic_t ref count per page in a
> > linear array relative to the IOVA?  
> 
> Ok. Is System RAM hot-plug supported? How is system RAM hot-plug
> handled? Are there DMA_MAP calls on such hot-plug for additional range?
> If we have a linear array/memory, we will have to realloc it on memory
> hot-plug?

I was thinking a linear array for each IOVA page within a vfio_dma.
The array would track the number of references (pins) of each page.  It
might actually need to be a page table given that a single vfio_dma can
nearly map the entire 64bit address space.  I don't think RAM hotplug
is a factor here, we need to support and properly account for multiple
IOVAs mapping to the same pfn, but the typical case will be a 1:1
mapping, I think that's what we'd optimize for.

> >  That would give us 1MB per 1GB
> > overhead. The semantics of the pin and unpin would make more sense then
> > too, both would take an IOVA range, only pinning would need a return
> > mechanism. For instance:
> > 
> > int pin_pages(void *iommu_data, dma_addr_t iova_base,
> > 	      int npage, unsigned long *pfn_base);
> > 
> > This would pin physically contiguous pages up to npage, returning the
> > base pfn and returning the number of pages pinned (<= npage).  The
> > vendor driver would make multiple calls to fill the necessary range.  
> 
> 
> With the current patch, input is user_pfn[] array and npages.
> 
> int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>                    int npage, int prot, unsigned long *phys_pfn)
> 
> 
> When guest allocates memory with malloc(), gfns would not be contiguous,
> right? These gfns (user_pfns) are passed as argument here.
> Is there any case where we could get pin/unpin request for contiguous pages?

It would depend on whether the user within the guest is actually
optimizing for hugepages.

> > Unpin would then simply be:
> > 
> > void unpin_pages(void *iommu_data, dma_addr_t iova_base, int npage);
> > 
> > Hugepage usage would really make such an interface shine (ie. 2MB+
> > contiguous ranges).  A downside would be the overhead of getting the
> > group and container reference in vfio for each callback, perhaps we'd
> > need to figure out how the vendor driver could hold that reference.  
> 
> In very initial phases of proposal, I had suggested to keep pointer to
> container->iommu_data in struct mdev_device. But that was discarded.

The referencing is definitely tricky, we run into the same problem that
Alexey had in trying to do that where holding a reference to the
container doesn't actually keep the container.  The user can tear down
the container or remove groups from the container, which automatically
removes the iommu backend.

> > The current API of passing around pfn arrays, further increasing the
> > overhead of the whole ecosystem just makes me cringe though.
> >   
> 
> ...
> 
> >> +		if (ret <= 0) {
> >> +			WARN_ON(!ret);
> >> +			goto pin_unwind;
> >> +		}
> >> +
> >> +		mutex_lock(&dma->addr_space->pfn_list_lock);
> >> +
> >> +		/* search if pfn exist */
> >> +		p = vfio_find_pfn(dma->addr_space, pfn[i]);
> >> +		if (p) {
> >> +			atomic_inc(&p->ref_count);  
> > 
> > We never test whether (p->prot == prot), shouldn't we be doing
> > something in that case?  In fact, why do we allow the side-channel
> > through the .{un}pin_pages to specify page protection flags that might
> > be different than the user specified for the DMA_MAP?  If the user
> > specified read-only, the vendor driver should not be allowed to
> > override with read-write access.
> >   
> 
> If user specified protection flags for DMA_MAP could be
> prot = IOMMU_WRITE | IOMMU_READ;
> 
> But vendor driver can request to pin page to be readonly, i.e.
> IOMMU_READ. In that case, pin pages should be allowed, right?
> 
> Then the check should be if (p->prot & prot).

That doesn't solve the problem, we might have an existing pfn with
prot R and we're asking for RW, (R & RW) is true, but that's not what
we asked for.  I think the solution would be to test the vendor driver
requested prot against vfio_dma.prot, but always pin the pages using
the vfio_dma.prot flags.  Thus if the vendor driver asks for read-only
and the vfio_dma.prot is read-write, these are compatible and the page
is pinned read-write since that's the user mapping.  Otherwise you need
to save prot per page (which you already do), but you're also going to
need to handle promoting pages if we have multiple vendors asking for
different prot attributes.  Thanks,

Alex

  reply	other threads:[~2016-11-08 17:05 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-04 21:10 [PATCH v11 00/22] Add Mediated device support Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 01/22] vfio: Mediated device Core driver Kirti Wankhede
2016-11-07  6:40   ` Tian, Kevin
     [not found]   ` <20161108092552.GA2090@bjsdjshi@linux.vnet.ibm.com>
2016-11-08 21:06     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 02/22] vfio: VFIO based driver for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 03/22] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 04/22] vfio: Common function to increment container_users Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops Kirti Wankhede
2016-11-07 19:36   ` Alex Williamson
2016-11-08 13:55     ` Kirti Wankhede
2016-11-08 16:39       ` Alex Williamson
2016-11-08 18:47         ` Kirti Wankhede
2016-11-08 19:14           ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 06/22] vfio iommu type1: Update arguments of vfio_lock_acct Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 07/22] vfio iommu type1: Update argument of vaddr_get_pfn() Kirti Wankhede
2016-11-07  8:42   ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 08/22] vfio iommu type1: Add find_iommu_group() function Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 09/22] vfio iommu type1: Add task structure to vfio_dma Kirti Wankhede
2016-11-07 21:03   ` Alex Williamson
2016-11-08 14:13     ` Kirti Wankhede
2016-11-08 16:43       ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices Kirti Wankhede
2016-11-07 23:16   ` Alex Williamson
2016-11-08  2:20     ` Jike Song
2016-11-08 16:18       ` Alex Williamson
2016-11-08 15:06     ` Kirti Wankhede
2016-11-08 17:05       ` Alex Williamson [this message]
2016-11-08  6:52   ` Alexey Kardashevskiy
2016-11-15  5:17     ` Alexey Kardashevskiy
2016-11-15  6:33       ` Kirti Wankhede
2016-11-15  7:27         ` Alexey Kardashevskiy
2016-11-15  7:56           ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP Kirti Wankhede
2016-11-07 23:45   ` Alex Williamson
2016-11-08 16:26     ` Kirti Wankhede
2016-11-08 17:46       ` Alex Williamson
2016-11-08 19:59         ` Kirti Wankhede
2016-11-08 21:28           ` Alex Williamson
2016-11-14  7:52             ` Kirti Wankhede
2016-11-14 15:37               ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 12/22] vfio: Add notifier callback to parent's ops structure of mdev Kirti Wankhede
2016-11-07 23:51   ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 13/22] vfio: Introduce common function to add capabilities Kirti Wankhede
2016-11-08  7:29   ` Alexey Kardashevskiy
2016-11-08 20:46     ` Kirti Wankhede
2016-11-08 21:42       ` Alex Williamson
2016-11-09  2:23         ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 14/22] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-08  8:46   ` Alexey Kardashevskiy
2016-11-08 20:22     ` Kirti Wankhede
2016-11-09  3:07       ` Alexey Kardashevskiy
2016-11-09  3:35         ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 16/22] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 17/22] vfio_platform: " Kirti Wankhede
2016-11-08  8:52   ` Alexey Kardashevskiy
2016-11-08 20:41     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 18/22] vfio: Define device_api strings Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 19/22] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 20/22] docs: Sysfs ABI for mediated device framework Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 21/22] docs: Sample driver to demonstrate how to use Mediated " Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 22/22] MAINTAINERS: Add entry VFIO based Mediated device drivers Kirti Wankhede
2016-11-07  3:30 ` [PATCH v11 00/22] Add Mediated device support Alexey Kardashevskiy
2016-11-07  3:59   ` Kirti Wankhede
2016-11-07  5:06     ` Kirti Wankhede
2016-11-07  6:15     ` Alexey Kardashevskiy
2016-11-07  6:36       ` Kirti Wankhede
2016-11-07  6:46         ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161108100539.4f03572e@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=bjsdjshi@linux.vnet.ibm.com \
    --cc=cjia@nvidia.com \
    --cc=jike.song@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).