From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1757073AbcJXCch (ORCPT <rfc822;w@1wt.eu>);
        Sun, 23 Oct 2016 22:32:37 -0400
Received: from mx1.redhat.com ([209.132.183.28]:35552 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1756927AbcJXCcT (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 23 Oct 2016 22:32:19 -0400
Date: Sun, 23 Oct 2016 20:32:16 -0600
From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
        <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>, <kevin.tian@intel.com>,
        <jike.song@intel.com>, <bjsdjshi@linux.vnet.ibm.com>,
        <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
Message-ID: <20161023203216.70b2fde3@t450s.home>
In-Reply-To: <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com>
References: <1476739332-4911-1-git-send-email-kwankhede@nvidia.com>
        <1476739332-4911-5-git-send-email-kwankhede@nvidia.com>
        <20161019150231.19e32b84@t450s.home>
        <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Mon, 24 Oct 2016 02:32:18 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 21 Oct 2016 01:47:25 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> Addressing your comments other than invalidation part.
> 
> On 10/20/2016 2:32 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:04 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> >> Tested by assigning below combinations of devices to a single VM:
> >> - GPU pass through only
> >> - vGPU device only
> >> - One GPU pass through and one vGPU device
> >> - Linux VM hot plug and unplug vGPU device while GPU pass through device
> >>   exist
> >> - Linux VM hot plug and unplug GPU pass through device while vGPU device
> >>   exist  
> > 
> > Were you able to do these with the locked memory limit of the user set
> > to the minimum required for existing GPU assignment?
> >   
> 
> No, is there a way to set memory limit through livbirt so that it would
> set memory limit to system memory assigned to VM?

Not that I know of, but I also don't know how you're making use of an
mdev device through libvirt yet since they don't have support for the
vfio-pci sysfsdev option.  I would recommend testing with QEMU manually.

> ...
> >> +	container = group->container;
> >> +	if (IS_ERR(container)) {  
> > 
> > I don't see that we ever use an ERR_PTR to set group->container, it
> > should either be NULL or valid and the fact that we added ourselves to
> > container_users should mean that it's valid.  The paranoia test here
> > would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
> > need that paranoia test, maybe we should just:
> > 
> > if (WARN_ON(!container)) {
> > 
> > I'm not fully convinced it's needed though.
> >   
> 
> Ok removing this check.
> 
> >> +		ret = PTR_ERR(container);
> >> +		goto err_pin_pages;
> >> +	}
> >> +
> >> +	down_read(&container->group_lock);
> >> +
> >> +	driver = container->iommu_driver;
> >> +	if (likely(driver && driver->ops->pin_pages))
> >> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +					     npage, prot, phys_pfn);  
> > 
> > The caller is going to need to provide some means for us to callback to
> > invalidate pinned pages.
> > 
> > ret has already been used, so it's zero at this point.  I expect the
> > original intention was to let the initialization above fall through
> > here so that the caller gets an errno if the driver doesn't support
> > pin_pages.  Returning zero without actually doing anything seems like
> > an unexpected return value.
> >   
> 
> yes, changing it to:
> 
> driver = container->iommu_driver;
> if (likely(driver && driver->ops->pin_pages))
>         ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>                                      npage, prot, phys_pfn);
> else
>         ret = -EINVAL;
> 
> 
> 
> 
> >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> >> +{
> >> +	struct vfio_pfn *p;
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	int ret = 1;
> >> +
> >> +	if (!domain)
> >> +		return 1;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	p = vfio_find_pfn(domain, pfn);
> >> +	if (p)
> >> +		ret = 0;
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +	return ret;
> >> +}  
> > 
> > So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> > But do we know that the vfio_pfn exists at the point where we actually
> > do that accounting?
> >  
> 
> Only below functions call vfio_pfn_account()
> __vfio_pin_pages_remote() -> vfio_pfn_account()
> __vfio_unpin_pages_remote() -> vfio_pfn_account()
> 
> Consider the case when mdev device is already assigned to VM, run some
> app in VM that pins some pages, then hotplug pass through device.
> Then __vfio_pin_pages_remote() is called when iommu capable domain is
> attached to container to pin all pages from vfio_iommu_replay(). So if
> at this time vfio_pfn exist means that the page is pinned through
> local_domain when iommu capable domain was not present, so accounting
> was already done for that pages. Hence returned 0 here which mean don't
> add this page in accounting.

Right, I see that's the intention, I can't pick any holes in the
concept, but I'll continue to try to look for bugs.

> >> +
> >>  struct vwork {
> >>  	struct mm_struct	*mm;
> >>  	long			npage;
> >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
> >>  	kfree(vwork);
> >>  }
> >>  
> >> -static void vfio_lock_acct(long npage)
> >> +static void vfio_lock_acct(struct task_struct *task, long npage)
> >>  {
> >>  	struct vwork *vwork;
> >>  	struct mm_struct *mm;
> >>  
> >> -	if (!current->mm || !npage)
> >> +	if (!task->mm || !npage)
> >>  		return; /* process exited or nothing to do */
> >>  
> >> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> >> -		current->mm->locked_vm += npage;
> >> -		up_write(&current->mm->mmap_sem);
> >> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> >> +		task->mm->locked_vm += npage;
> >> +		up_write(&task->mm->mmap_sem);
> >>  		return;
> >>  	}
> >>  
> >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
> >>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> >>  	if (!vwork)
> >>  		return;
> >> -	mm = get_task_mm(current);
> >> +	mm = get_task_mm(task);
> >>  	if (!mm) {
> >>  		kfree(vwork);
> >>  		return;
> >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
> >>  	return 0;
> >>  }  
> > 
> > This coversion of vfio_lock_acct() to pass a task_struct and updating
> > existing callers to pass current would be a great separate, easily
> > review-able patch.
> >  
> 
> Ok. I'll split this in separate commit.
> 
> 
> >>  
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +			 int prot, unsigned long *pfn)
> >>  {
> >>  	struct page *page[1];
> >>  	struct vm_area_struct *vma;
> >> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
> >>  	int ret = -EFAULT;
> >>  
> >> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +	if (mm) {
> >> +		down_read(&local_mm->mmap_sem);
> >> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> >> +		up_read(&local_mm->mmap_sem);
> >> +	} else
> >> +		ret = get_user_pages_fast(vaddr, 1,
> >> +					  !!(prot & IOMMU_WRITE), page);
> >> +
> >> +	if (ret == 1) {
> >>  		*pfn = page_to_pfn(page[0]);
> >>  		return 0;
> >>  	}
> >>  
> >> -	down_read(&current->mm->mmap_sem);
> >> +	down_read(&local_mm->mmap_sem);
> >>  
> >> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> >> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> >>  
> >>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> >>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>  			ret = 0;
> >>  	}
> >>  
> >> -	up_read(&current->mm->mmap_sem);
> >> +	up_read(&local_mm->mmap_sem);
> >>  
> >>  	return ret;
> >>  }  
> > 
> > This would also be a great separate patch.  
> 
> Ok.
> 
> >  Have you considered
> > renaming the mm_struct function arg to "remote_mm" and making the local
> > variable simply "mm"?  It seems like it would tie nicely with the
> > remote_mm path using get_user_pages_remote() while passing NULL for
> > remote_mm uses current->mm and the existing path (and avoid the general
> > oddness of passing local_mm to a "remote" function).
> >   
> 
> Yes, your suggestion looks good. Updating.
> 
> 
> >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> >>   * first page and all consecutive pages with the same locking.
> >>   */
> >> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> -			   int prot, unsigned long *pfn_base)
> >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> >> +				    unsigned long vaddr, long npage,
> >> +				    int prot, unsigned long *pfn_base)  
> 
> ...
> 
> 
> >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  			break;
> >>  		}
> >>  
> >> +		lock_acct += vfio_pfn_account(iommu, pfn);
> >> +  
> > 
> > I take it that this is the new technique for keeping the accounting
> > accurate, we only increment the locked accounting by the amount not
> > already pinned in a vfio_pfn.
> >  
> 
> That's correct.
> 
> 
> >>  		if (!rsvd && !lock_cap &&
> >> -		    current->mm->locked_vm + i + 1 > limit) {
> >> +		    current->mm->locked_vm + lock_acct > limit) {
> >>  			put_pfn(pfn, prot);
> >>  			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >>  				__func__, limit << PAGE_SHIFT);
> >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  	}
> >>  
> >>  	if (!rsvd)
> >> -		vfio_lock_acct(i);
> >> +		vfio_lock_acct(current, lock_acct);
> >>  
> >>  	return i;
> >>  }
> >>  
> >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> >> -			     int prot, bool do_accounting)
> >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> >> +				      unsigned long pfn, long npage, int prot,
> >> +				      bool do_accounting)  
> > 
> > Have you noticed that it's kind of confusing that
> > __vfio_{un}pin_pages_remote() uses current, which does a
> > get_user_pages_fast() while "local" uses a provided task_struct and
> > uses get_user_pages_*remote*()?  And also what was effectively local
> > (ie. we're pinning for our own use here) is now "remote" and pinning
> > for a remote, vendor driver consumer, is now "local".  It's not very
> > intuitive.
> >   
> 
> 'local' in local_domain was suggested to describe the domain for local
> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> name were discarded. May be we should revisit what the name should be.
> Any suggestion?
> 
> For local_domain, to pin pages, flow is:
> 
> for local_domain
>     |- vfio_pin_pages()
>         |- vfio_iommu_type1_pin_pages()
>             |- __vfio_pin_page_local()
>                 |-  vaddr_get_pfn(task->mm)
>                     |- get_user_pages_remote()
> 
> __vfio_pin_page_local() --> get_user_pages_remote()


In vfio.c we have the concept of an external user, perhaps that could
be continued here.  An mdev driver would be an external, or remote
pinning.

> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >>  	struct vfio_iommu *iommu = iommu_data;
> >> -	struct vfio_group *group, *g;
> >> +	struct vfio_group *group;
> >>  	struct vfio_domain *domain, *d;
> >>  	struct bus_type *bus = NULL;
> >>  	int ret;
> >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	list_for_each_entry(d, &iommu->domain_list, next) {
> >> -		list_for_each_entry(g, &d->group_list, next) {
> >> -			if (g->iommu_group != iommu_group)
> >> -				continue;
> >> +		if (find_iommu_group(d, iommu_group)) {
> >> +			mutex_unlock(&iommu->lock);
> >> +			return -EINVAL;
> >> +		}
> >> +	}  
> > 
> > The find_iommu_group() conversion would also be an easy separate patch.
> >   
> 
> Ok.
> 
> >>  
> >> +	if (iommu->local_domain) {
> >> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >>  			mutex_unlock(&iommu->lock);
> >>  			return -EINVAL;
> >>  		}
> >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_free;
> >>  
> >> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> >> +	    (bus == &mdev_bus_type)) {
> >> +		if (!iommu->local_domain) {
> >> +			domain->local_addr_space =
> >> +				kzalloc(sizeof(*domain->local_addr_space),
> >> +						GFP_KERNEL);
> >> +			if (!domain->local_addr_space) {
> >> +				ret = -ENOMEM;
> >> +				goto out_free;
> >> +			}
> >> +
> >> +			domain->local_addr_space->task = current;
> >> +			INIT_LIST_HEAD(&domain->group_list);
> >> +			domain->local_addr_space->pfn_list = RB_ROOT;
> >> +			mutex_init(&domain->local_addr_space->pfn_list_lock);
> >> +			iommu->local_domain = domain;
> >> +		} else
> >> +			kfree(domain);
> >> +
> >> +		list_add(&group->next, &domain->group_list);  
> > 
> > I think you mean s/domain/iommu->local_domain/ here, we just freed
> > domain in the else path.
> >   
> 
> Yes, corrected.
> 
> >> +		mutex_unlock(&iommu->lock);
> >> +		return 0;
> >> +	}
> >> +
> >>  	domain->domain = iommu_domain_alloc(bus);
> >>  	if (!domain->domain) {
> >>  		ret = -EIO;
> >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> >>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >>  }
> >>  
> >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> >> +{
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	struct vfio_dma *dma, *tdma;
> >> +	struct rb_node *n;
> >> +	long locked = 0;
> >> +
> >> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> >> +					     node) {
> >> +		vfio_unmap_unpin(iommu, dma);
> >> +	}
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	n = rb_first(&domain->local_addr_space->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n))
> >> +		locked++;
> >> +
> >> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}  
> > 
> > Couldn't a properly timed mlock by the user allow them to lock more
> > memory than they're allowed here?  For instance imagine the vendor
> > driver has pinned the entire VM memory and the user has exactly the
> > locked memory limit for that VM.  During the gap here between unpinning
> > the entire vfio_dma list and re-accounting for the pfn_list, the user
> > can mlock up to their limit again an now they've doubled the locked
> > memory they're allowed.
> >   
> 
> As per original code, vfio_unmap_unpin() calls
> __vfio_unpin_pages_remote(.., false) with do_accounting set to false,
> why is that so?

Because vfio_dma tracks the user granularity of calling MAP_DMA, not
the granularity with which the iommu mapping was actually done.  There
might be multiple non-contiguous chunks to make that mapping and we
don't know how the iommu chose to map a given chunk to support large
page sizes.  If we chose to do accounting on the iommu_unmap()
granularity, we might account for every 4k page separately.  We choose
not to do accounting there so that we can batch the accounting into one
update per range.

> Here if accounting is set to true then we don't have to do re-accounting
> here.

If vfio_unmap_unpin() did not do accounting, you could update
accounting once with the difference between what was pinned and what
remains pinned via the mdev and avoid the gap caused by de-accounting
everything and then re-accounting only for the mdev pinnings.

> >> +
> >> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> >> +{
> >> +	struct rb_node *node;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >> +
> >>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  					  struct iommu_group *iommu_group)
> >>  {
> >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> >> -		list_for_each_entry(group, &domain->group_list, next) {
> >> -			if (group->iommu_group != iommu_group)
> >> -				continue;
> >> -
> >> -			iommu_detach_group(domain->domain, iommu_group);
> >> +	if (iommu->local_domain) {
> >> +		domain = iommu->local_domain;
> >> +		group = find_iommu_group(domain, iommu_group);
> >> +		if (group) {
> >>  			list_del(&group->next);
> >>  			kfree(group);
> >> -			/*
> >> -			 * Group ownership provides privilege, if the group
> >> -			 * list is empty, the domain goes away.  If it's the
> >> -			 * last domain, then all the mappings go away too.
> >> -			 */
> >> +
> >>  			if (list_empty(&domain->group_list)) {
> >> -				if (list_is_singular(&iommu->domain_list))
> >> +				vfio_local_unpin_all(domain);
> >> +				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> >>  					vfio_iommu_unmap_unpin_all(iommu);
> >> -				iommu_domain_free(domain->domain);
> >> -				list_del(&domain->next);
> >>  				kfree(domain);
> >> +				iommu->local_domain = NULL;
> >> +			}  
> > 
> > 
> > I can't quite wrap my head around this, if we have mdev groups attached
> > and this iommu group matches an mdev group, remove from list and free
> > the group.  If there are now no more groups in the mdev group list,
> > then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> > udpates   
> 
> corrected the code to do accounting here.
> 
> > and remove the vfio_pfn, but only if the ref_count is now
> > zero.  
> 
> Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
> node in rb tree exist
> 
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +  
> 
> 
> and vfio_unpin_pfn() only remove the node from rb tree if ref count is
> zero.
> 
> static int vfio_unpin_pfn(struct vfio_domain *domain,
>                           struct vfio_pfn *vpfn, bool do_accounting)
> {
>         __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
>                                 do_accounting);
> 
>         if (atomic_dec_and_test(&vpfn->ref_count))
>                 vfio_remove_from_pfn_list(domain, vpfn);
> 
>         return 1;
> }
> 
> so for example for a vfio_pfn ref_count is 2, first iteration would be:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
> from rb tree.
> 
> In next iteration:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.

Ok, I missed that, thanks.

> >  We free the domain, so if the ref_count was non-zero we've now
> > just leaked memory.  I think that means that if a vendor driver pins a
> > given page twice, that leak occurs.  Furthermore, if there is not an
> > iommu capable domain in the container, we remove all the vfio_dma
> > entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.
> >   
> 
> So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
> twice and only when ref count is zero that node is removed from rb tree.
> So there is no memory leak.

Ok

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48943)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1byV3n-000881-Bc
	for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:25 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1byV3k-0006gv-2E
	for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:23 -0400
Received: from mx1.redhat.com ([209.132.183.28]:48424)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <alex.williamson@redhat.com>)
	id 1byV3j-0006gQ-Q6
	for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:19 -0400
Date: Sun, 23 Oct 2016 20:32:16 -0600
From: Alex Williamson <alex.williamson@redhat.com>
Message-ID: <20161023203216.70b2fde3@t450s.home>
In-Reply-To: <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com>
References: <1476739332-4911-1-git-send-email-kwankhede@nvidia.com>
	<1476739332-4911-5-git-send-email-kwankhede@nvidia.com>
	<20161019150231.19e32b84@t450s.home>
	<3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for
 mediated devices
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: pbonzini@redhat.com, kraxel@redhat.com, cjia@nvidia.com, qemu-devel@nongnu.org, kvm@vger.kernel.org, kevin.tian@intel.com, jike.song@intel.com, bjsdjshi@linux.vnet.ibm.com, linux-kernel@vger.kernel.org

On Fri, 21 Oct 2016 01:47:25 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> Addressing your comments other than invalidation part.
> 
> On 10/20/2016 2:32 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:04 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> >> Tested by assigning below combinations of devices to a single VM:
> >> - GPU pass through only
> >> - vGPU device only
> >> - One GPU pass through and one vGPU device
> >> - Linux VM hot plug and unplug vGPU device while GPU pass through device
> >>   exist
> >> - Linux VM hot plug and unplug GPU pass through device while vGPU device
> >>   exist  
> > 
> > Were you able to do these with the locked memory limit of the user set
> > to the minimum required for existing GPU assignment?
> >   
> 
> No, is there a way to set memory limit through livbirt so that it would
> set memory limit to system memory assigned to VM?

Not that I know of, but I also don't know how you're making use of an
mdev device through libvirt yet since they don't have support for the
vfio-pci sysfsdev option.  I would recommend testing with QEMU manually.

> ...
> >> +	container = group->container;
> >> +	if (IS_ERR(container)) {  
> > 
> > I don't see that we ever use an ERR_PTR to set group->container, it
> > should either be NULL or valid and the fact that we added ourselves to
> > container_users should mean that it's valid.  The paranoia test here
> > would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
> > need that paranoia test, maybe we should just:
> > 
> > if (WARN_ON(!container)) {
> > 
> > I'm not fully convinced it's needed though.
> >   
> 
> Ok removing this check.
> 
> >> +		ret = PTR_ERR(container);
> >> +		goto err_pin_pages;
> >> +	}
> >> +
> >> +	down_read(&container->group_lock);
> >> +
> >> +	driver = container->iommu_driver;
> >> +	if (likely(driver && driver->ops->pin_pages))
> >> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +					     npage, prot, phys_pfn);  
> > 
> > The caller is going to need to provide some means for us to callback to
> > invalidate pinned pages.
> > 
> > ret has already been used, so it's zero at this point.  I expect the
> > original intention was to let the initialization above fall through
> > here so that the caller gets an errno if the driver doesn't support
> > pin_pages.  Returning zero without actually doing anything seems like
> > an unexpected return value.
> >   
> 
> yes, changing it to:
> 
> driver = container->iommu_driver;
> if (likely(driver && driver->ops->pin_pages))
>         ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>                                      npage, prot, phys_pfn);
> else
>         ret = -EINVAL;
> 
> 
> 
> 
> >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> >> +{
> >> +	struct vfio_pfn *p;
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	int ret = 1;
> >> +
> >> +	if (!domain)
> >> +		return 1;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	p = vfio_find_pfn(domain, pfn);
> >> +	if (p)
> >> +		ret = 0;
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +	return ret;
> >> +}  
> > 
> > So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> > But do we know that the vfio_pfn exists at the point where we actually
> > do that accounting?
> >  
> 
> Only below functions call vfio_pfn_account()
> __vfio_pin_pages_remote() -> vfio_pfn_account()
> __vfio_unpin_pages_remote() -> vfio_pfn_account()
> 
> Consider the case when mdev device is already assigned to VM, run some
> app in VM that pins some pages, then hotplug pass through device.
> Then __vfio_pin_pages_remote() is called when iommu capable domain is
> attached to container to pin all pages from vfio_iommu_replay(). So if
> at this time vfio_pfn exist means that the page is pinned through
> local_domain when iommu capable domain was not present, so accounting
> was already done for that pages. Hence returned 0 here which mean don't
> add this page in accounting.

Right, I see that's the intention, I can't pick any holes in the
concept, but I'll continue to try to look for bugs.

> >> +
> >>  struct vwork {
> >>  	struct mm_struct	*mm;
> >>  	long			npage;
> >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
> >>  	kfree(vwork);
> >>  }
> >>  
> >> -static void vfio_lock_acct(long npage)
> >> +static void vfio_lock_acct(struct task_struct *task, long npage)
> >>  {
> >>  	struct vwork *vwork;
> >>  	struct mm_struct *mm;
> >>  
> >> -	if (!current->mm || !npage)
> >> +	if (!task->mm || !npage)
> >>  		return; /* process exited or nothing to do */
> >>  
> >> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> >> -		current->mm->locked_vm += npage;
> >> -		up_write(&current->mm->mmap_sem);
> >> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> >> +		task->mm->locked_vm += npage;
> >> +		up_write(&task->mm->mmap_sem);
> >>  		return;
> >>  	}
> >>  
> >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
> >>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> >>  	if (!vwork)
> >>  		return;
> >> -	mm = get_task_mm(current);
> >> +	mm = get_task_mm(task);
> >>  	if (!mm) {
> >>  		kfree(vwork);
> >>  		return;
> >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
> >>  	return 0;
> >>  }  
> > 
> > This coversion of vfio_lock_acct() to pass a task_struct and updating
> > existing callers to pass current would be a great separate, easily
> > review-able patch.
> >  
> 
> Ok. I'll split this in separate commit.
> 
> 
> >>  
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +			 int prot, unsigned long *pfn)
> >>  {
> >>  	struct page *page[1];
> >>  	struct vm_area_struct *vma;
> >> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
> >>  	int ret = -EFAULT;
> >>  
> >> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +	if (mm) {
> >> +		down_read(&local_mm->mmap_sem);
> >> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> >> +		up_read(&local_mm->mmap_sem);
> >> +	} else
> >> +		ret = get_user_pages_fast(vaddr, 1,
> >> +					  !!(prot & IOMMU_WRITE), page);
> >> +
> >> +	if (ret == 1) {
> >>  		*pfn = page_to_pfn(page[0]);
> >>  		return 0;
> >>  	}
> >>  
> >> -	down_read(&current->mm->mmap_sem);
> >> +	down_read(&local_mm->mmap_sem);
> >>  
> >> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> >> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> >>  
> >>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> >>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>  			ret = 0;
> >>  	}
> >>  
> >> -	up_read(&current->mm->mmap_sem);
> >> +	up_read(&local_mm->mmap_sem);
> >>  
> >>  	return ret;
> >>  }  
> > 
> > This would also be a great separate patch.  
> 
> Ok.
> 
> >  Have you considered
> > renaming the mm_struct function arg to "remote_mm" and making the local
> > variable simply "mm"?  It seems like it would tie nicely with the
> > remote_mm path using get_user_pages_remote() while passing NULL for
> > remote_mm uses current->mm and the existing path (and avoid the general
> > oddness of passing local_mm to a "remote" function).
> >   
> 
> Yes, your suggestion looks good. Updating.
> 
> 
> >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> >>   * first page and all consecutive pages with the same locking.
> >>   */
> >> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> -			   int prot, unsigned long *pfn_base)
> >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> >> +				    unsigned long vaddr, long npage,
> >> +				    int prot, unsigned long *pfn_base)  
> 
> ...
> 
> 
> >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  			break;
> >>  		}
> >>  
> >> +		lock_acct += vfio_pfn_account(iommu, pfn);
> >> +  
> > 
> > I take it that this is the new technique for keeping the accounting
> > accurate, we only increment the locked accounting by the amount not
> > already pinned in a vfio_pfn.
> >  
> 
> That's correct.
> 
> 
> >>  		if (!rsvd && !lock_cap &&
> >> -		    current->mm->locked_vm + i + 1 > limit) {
> >> +		    current->mm->locked_vm + lock_acct > limit) {
> >>  			put_pfn(pfn, prot);
> >>  			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >>  				__func__, limit << PAGE_SHIFT);
> >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  	}
> >>  
> >>  	if (!rsvd)
> >> -		vfio_lock_acct(i);
> >> +		vfio_lock_acct(current, lock_acct);
> >>  
> >>  	return i;
> >>  }
> >>  
> >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> >> -			     int prot, bool do_accounting)
> >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> >> +				      unsigned long pfn, long npage, int prot,
> >> +				      bool do_accounting)  
> > 
> > Have you noticed that it's kind of confusing that
> > __vfio_{un}pin_pages_remote() uses current, which does a
> > get_user_pages_fast() while "local" uses a provided task_struct and
> > uses get_user_pages_*remote*()?  And also what was effectively local
> > (ie. we're pinning for our own use here) is now "remote" and pinning
> > for a remote, vendor driver consumer, is now "local".  It's not very
> > intuitive.
> >   
> 
> 'local' in local_domain was suggested to describe the domain for local
> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> name were discarded. May be we should revisit what the name should be.
> Any suggestion?
> 
> For local_domain, to pin pages, flow is:
> 
> for local_domain
>     |- vfio_pin_pages()
>         |- vfio_iommu_type1_pin_pages()
>             |- __vfio_pin_page_local()
>                 |-  vaddr_get_pfn(task->mm)
>                     |- get_user_pages_remote()
> 
> __vfio_pin_page_local() --> get_user_pages_remote()


In vfio.c we have the concept of an external user, perhaps that could
be continued here.  An mdev driver would be an external, or remote
pinning.

> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >>  	struct vfio_iommu *iommu = iommu_data;
> >> -	struct vfio_group *group, *g;
> >> +	struct vfio_group *group;
> >>  	struct vfio_domain *domain, *d;
> >>  	struct bus_type *bus = NULL;
> >>  	int ret;
> >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	list_for_each_entry(d, &iommu->domain_list, next) {
> >> -		list_for_each_entry(g, &d->group_list, next) {
> >> -			if (g->iommu_group != iommu_group)
> >> -				continue;
> >> +		if (find_iommu_group(d, iommu_group)) {
> >> +			mutex_unlock(&iommu->lock);
> >> +			return -EINVAL;
> >> +		}
> >> +	}  
> > 
> > The find_iommu_group() conversion would also be an easy separate patch.
> >   
> 
> Ok.
> 
> >>  
> >> +	if (iommu->local_domain) {
> >> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >>  			mutex_unlock(&iommu->lock);
> >>  			return -EINVAL;
> >>  		}
> >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_free;
> >>  
> >> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> >> +	    (bus == &mdev_bus_type)) {
> >> +		if (!iommu->local_domain) {
> >> +			domain->local_addr_space =
> >> +				kzalloc(sizeof(*domain->local_addr_space),
> >> +						GFP_KERNEL);
> >> +			if (!domain->local_addr_space) {
> >> +				ret = -ENOMEM;
> >> +				goto out_free;
> >> +			}
> >> +
> >> +			domain->local_addr_space->task = current;
> >> +			INIT_LIST_HEAD(&domain->group_list);
> >> +			domain->local_addr_space->pfn_list = RB_ROOT;
> >> +			mutex_init(&domain->local_addr_space->pfn_list_lock);
> >> +			iommu->local_domain = domain;
> >> +		} else
> >> +			kfree(domain);
> >> +
> >> +		list_add(&group->next, &domain->group_list);  
> > 
> > I think you mean s/domain/iommu->local_domain/ here, we just freed
> > domain in the else path.
> >   
> 
> Yes, corrected.
> 
> >> +		mutex_unlock(&iommu->lock);
> >> +		return 0;
> >> +	}
> >> +
> >>  	domain->domain = iommu_domain_alloc(bus);
> >>  	if (!domain->domain) {
> >>  		ret = -EIO;
> >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> >>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >>  }
> >>  
> >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> >> +{
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	struct vfio_dma *dma, *tdma;
> >> +	struct rb_node *n;
> >> +	long locked = 0;
> >> +
> >> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> >> +					     node) {
> >> +		vfio_unmap_unpin(iommu, dma);
> >> +	}
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	n = rb_first(&domain->local_addr_space->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n))
> >> +		locked++;
> >> +
> >> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}  
> > 
> > Couldn't a properly timed mlock by the user allow them to lock more
> > memory than they're allowed here?  For instance imagine the vendor
> > driver has pinned the entire VM memory and the user has exactly the
> > locked memory limit for that VM.  During the gap here between unpinning
> > the entire vfio_dma list and re-accounting for the pfn_list, the user
> > can mlock up to their limit again an now they've doubled the locked
> > memory they're allowed.
> >   
> 
> As per original code, vfio_unmap_unpin() calls
> __vfio_unpin_pages_remote(.., false) with do_accounting set to false,
> why is that so?

Because vfio_dma tracks the user granularity of calling MAP_DMA, not
the granularity with which the iommu mapping was actually done.  There
might be multiple non-contiguous chunks to make that mapping and we
don't know how the iommu chose to map a given chunk to support large
page sizes.  If we chose to do accounting on the iommu_unmap()
granularity, we might account for every 4k page separately.  We choose
not to do accounting there so that we can batch the accounting into one
update per range.

> Here if accounting is set to true then we don't have to do re-accounting
> here.

If vfio_unmap_unpin() did not do accounting, you could update
accounting once with the difference between what was pinned and what
remains pinned via the mdev and avoid the gap caused by de-accounting
everything and then re-accounting only for the mdev pinnings.

> >> +
> >> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> >> +{
> >> +	struct rb_node *node;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >> +
> >>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  					  struct iommu_group *iommu_group)
> >>  {
> >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> >> -		list_for_each_entry(group, &domain->group_list, next) {
> >> -			if (group->iommu_group != iommu_group)
> >> -				continue;
> >> -
> >> -			iommu_detach_group(domain->domain, iommu_group);
> >> +	if (iommu->local_domain) {
> >> +		domain = iommu->local_domain;
> >> +		group = find_iommu_group(domain, iommu_group);
> >> +		if (group) {
> >>  			list_del(&group->next);
> >>  			kfree(group);
> >> -			/*
> >> -			 * Group ownership provides privilege, if the group
> >> -			 * list is empty, the domain goes away.  If it's the
> >> -			 * last domain, then all the mappings go away too.
> >> -			 */
> >> +
> >>  			if (list_empty(&domain->group_list)) {
> >> -				if (list_is_singular(&iommu->domain_list))
> >> +				vfio_local_unpin_all(domain);
> >> +				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> >>  					vfio_iommu_unmap_unpin_all(iommu);
> >> -				iommu_domain_free(domain->domain);
> >> -				list_del(&domain->next);
> >>  				kfree(domain);
> >> +				iommu->local_domain = NULL;
> >> +			}  
> > 
> > 
> > I can't quite wrap my head around this, if we have mdev groups attached
> > and this iommu group matches an mdev group, remove from list and free
> > the group.  If there are now no more groups in the mdev group list,
> > then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> > udpates   
> 
> corrected the code to do accounting here.
> 
> > and remove the vfio_pfn, but only if the ref_count is now
> > zero.  
> 
> Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
> node in rb tree exist
> 
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +  
> 
> 
> and vfio_unpin_pfn() only remove the node from rb tree if ref count is
> zero.
> 
> static int vfio_unpin_pfn(struct vfio_domain *domain,
>                           struct vfio_pfn *vpfn, bool do_accounting)
> {
>         __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
>                                 do_accounting);
> 
>         if (atomic_dec_and_test(&vpfn->ref_count))
>                 vfio_remove_from_pfn_list(domain, vpfn);
> 
>         return 1;
> }
> 
> so for example for a vfio_pfn ref_count is 2, first iteration would be:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
> from rb tree.
> 
> In next iteration:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.

Ok, I missed that, thanks.

> >  We free the domain, so if the ref_count was non-zero we've now
> > just leaked memory.  I think that means that if a vendor driver pins a
> > given page twice, that leak occurs.  Furthermore, if there is not an
> > iommu capable domain in the container, we remove all the vfio_dma
> > entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.
> >   
> 
> So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
> twice and only when ref count is zero that node is removed from rb tree.
> So there is no memory leak.

Ok