From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757073AbcJXCch (ORCPT ); Sun, 23 Oct 2016 22:32:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35552 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756927AbcJXCcT (ORCPT ); Sun, 23 Oct 2016 22:32:19 -0400 Date: Sun, 23 Oct 2016 20:32:16 -0600 From: Alex Williamson To: Kirti Wankhede Cc: , , , , , , , , Subject: Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices Message-ID: <20161023203216.70b2fde3@t450s.home> In-Reply-To: <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com> References: <1476739332-4911-1-git-send-email-kwankhede@nvidia.com> <1476739332-4911-5-git-send-email-kwankhede@nvidia.com> <20161019150231.19e32b84@t450s.home> <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Mon, 24 Oct 2016 02:32:18 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 21 Oct 2016 01:47:25 +0530 Kirti Wankhede wrote: > Alex, > > Addressing your comments other than invalidation part. > > On 10/20/2016 2:32 AM, Alex Williamson wrote: > > On Tue, 18 Oct 2016 02:52:04 +0530 > > Kirti Wankhede wrote: > > > ... > >> Tested by assigning below combinations of devices to a single VM: > >> - GPU pass through only > >> - vGPU device only > >> - One GPU pass through and one vGPU device > >> - Linux VM hot plug and unplug vGPU device while GPU pass through device > >> exist > >> - Linux VM hot plug and unplug GPU pass through device while vGPU device > >> exist > > > > Were you able to do these with the locked memory limit of the user set > > to the minimum required for existing GPU assignment? > > > > No, is there a way to set memory limit through livbirt so that it would > set memory limit to system memory assigned to VM? Not that I know of, but I also don't know how you're making use of an mdev device through libvirt yet since they don't have support for the vfio-pci sysfsdev option. I would recommend testing with QEMU manually. > ... > >> + container = group->container; > >> + if (IS_ERR(container)) { > > > > I don't see that we ever use an ERR_PTR to set group->container, it > > should either be NULL or valid and the fact that we added ourselves to > > container_users should mean that it's valid. The paranoia test here > > would be if container is NULL, but IS_ERR() doesn't check NULL. If we > > need that paranoia test, maybe we should just: > > > > if (WARN_ON(!container)) { > > > > I'm not fully convinced it's needed though. > > > > Ok removing this check. > > >> + ret = PTR_ERR(container); > >> + goto err_pin_pages; > >> + } > >> + > >> + down_read(&container->group_lock); > >> + > >> + driver = container->iommu_driver; > >> + if (likely(driver && driver->ops->pin_pages)) > >> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn, > >> + npage, prot, phys_pfn); > > > > The caller is going to need to provide some means for us to callback to > > invalidate pinned pages. > > > > ret has already been used, so it's zero at this point. I expect the > > original intention was to let the initialization above fall through > > here so that the caller gets an errno if the driver doesn't support > > pin_pages. Returning zero without actually doing anything seems like > > an unexpected return value. > > > > yes, changing it to: > > driver = container->iommu_driver; > if (likely(driver && driver->ops->pin_pages)) > ret = driver->ops->pin_pages(container->iommu_data, user_pfn, > npage, prot, phys_pfn); > else > ret = -EINVAL; > > > > > >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn) > >> +{ > >> + struct vfio_pfn *p; > >> + struct vfio_domain *domain = iommu->local_domain; > >> + int ret = 1; > >> + > >> + if (!domain) > >> + return 1; > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + > >> + p = vfio_find_pfn(domain, pfn); > >> + if (p) > >> + ret = 0; > >> + > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> + return ret; > >> +} > > > > So if the vfio_pfn for a given pfn exists, return 0, else return 1. > > But do we know that the vfio_pfn exists at the point where we actually > > do that accounting? > > > > Only below functions call vfio_pfn_account() > __vfio_pin_pages_remote() -> vfio_pfn_account() > __vfio_unpin_pages_remote() -> vfio_pfn_account() > > Consider the case when mdev device is already assigned to VM, run some > app in VM that pins some pages, then hotplug pass through device. > Then __vfio_pin_pages_remote() is called when iommu capable domain is > attached to container to pin all pages from vfio_iommu_replay(). So if > at this time vfio_pfn exist means that the page is pinned through > local_domain when iommu capable domain was not present, so accounting > was already done for that pages. Hence returned 0 here which mean don't > add this page in accounting. Right, I see that's the intention, I can't pick any holes in the concept, but I'll continue to try to look for bugs. > >> + > >> struct vwork { > >> struct mm_struct *mm; > >> long npage; > >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work) > >> kfree(vwork); > >> } > >> > >> -static void vfio_lock_acct(long npage) > >> +static void vfio_lock_acct(struct task_struct *task, long npage) > >> { > >> struct vwork *vwork; > >> struct mm_struct *mm; > >> > >> - if (!current->mm || !npage) > >> + if (!task->mm || !npage) > >> return; /* process exited or nothing to do */ > >> > >> - if (down_write_trylock(¤t->mm->mmap_sem)) { > >> - current->mm->locked_vm += npage; > >> - up_write(¤t->mm->mmap_sem); > >> + if (down_write_trylock(&task->mm->mmap_sem)) { > >> + task->mm->locked_vm += npage; > >> + up_write(&task->mm->mmap_sem); > >> return; > >> } > >> > >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage) > >> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); > >> if (!vwork) > >> return; > >> - mm = get_task_mm(current); > >> + mm = get_task_mm(task); > >> if (!mm) { > >> kfree(vwork); > >> return; > >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot) > >> return 0; > >> } > > > > This coversion of vfio_lock_acct() to pass a task_struct and updating > > existing callers to pass current would be a great separate, easily > > review-able patch. > > > > Ok. I'll split this in separate commit. > > > >> > >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, > >> + int prot, unsigned long *pfn) > >> { > >> struct page *page[1]; > >> struct vm_area_struct *vma; > >> + struct mm_struct *local_mm = (mm ? mm : current->mm); > >> int ret = -EFAULT; > >> > >> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) { > >> + if (mm) { > >> + down_read(&local_mm->mmap_sem); > >> + ret = get_user_pages_remote(NULL, local_mm, vaddr, 1, > >> + !!(prot & IOMMU_WRITE), 0, page, NULL); > >> + up_read(&local_mm->mmap_sem); > >> + } else > >> + ret = get_user_pages_fast(vaddr, 1, > >> + !!(prot & IOMMU_WRITE), page); > >> + > >> + if (ret == 1) { > >> *pfn = page_to_pfn(page[0]); > >> return 0; > >> } > >> > >> - down_read(¤t->mm->mmap_sem); > >> + down_read(&local_mm->mmap_sem); > >> > >> - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1); > >> + vma = find_vma_intersection(local_mm, vaddr, vaddr + 1); > >> > >> if (vma && vma->vm_flags & VM_PFNMAP) { > >> *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; > >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> ret = 0; > >> } > >> > >> - up_read(¤t->mm->mmap_sem); > >> + up_read(&local_mm->mmap_sem); > >> > >> return ret; > >> } > > > > This would also be a great separate patch. > > Ok. > > > Have you considered > > renaming the mm_struct function arg to "remote_mm" and making the local > > variable simply "mm"? It seems like it would tie nicely with the > > remote_mm path using get_user_pages_remote() while passing NULL for > > remote_mm uses current->mm and the existing path (and avoid the general > > oddness of passing local_mm to a "remote" function). > > > > Yes, your suggestion looks good. Updating. > > > >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> * the iommu can only map chunks of consecutive pfns anyway, so get the > >> * first page and all consecutive pages with the same locking. > >> */ > >> -static long vfio_pin_pages(unsigned long vaddr, long npage, > >> - int prot, unsigned long *pfn_base) > >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu, > >> + unsigned long vaddr, long npage, > >> + int prot, unsigned long *pfn_base) > > ... > > > >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, > >> break; > >> } > >> > >> + lock_acct += vfio_pfn_account(iommu, pfn); > >> + > > > > I take it that this is the new technique for keeping the accounting > > accurate, we only increment the locked accounting by the amount not > > already pinned in a vfio_pfn. > > > > That's correct. > > > >> if (!rsvd && !lock_cap && > >> - current->mm->locked_vm + i + 1 > limit) { > >> + current->mm->locked_vm + lock_acct > limit) { > >> put_pfn(pfn, prot); > >> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", > >> __func__, limit << PAGE_SHIFT); > >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, > >> } > >> > >> if (!rsvd) > >> - vfio_lock_acct(i); > >> + vfio_lock_acct(current, lock_acct); > >> > >> return i; > >> } > >> > >> -static long vfio_unpin_pages(unsigned long pfn, long npage, > >> - int prot, bool do_accounting) > >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu, > >> + unsigned long pfn, long npage, int prot, > >> + bool do_accounting) > > > > Have you noticed that it's kind of confusing that > > __vfio_{un}pin_pages_remote() uses current, which does a > > get_user_pages_fast() while "local" uses a provided task_struct and > > uses get_user_pages_*remote*()? And also what was effectively local > > (ie. we're pinning for our own use here) is now "remote" and pinning > > for a remote, vendor driver consumer, is now "local". It's not very > > intuitive. > > > > 'local' in local_domain was suggested to describe the domain for local > page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this > name were discarded. May be we should revisit what the name should be. > Any suggestion? > > For local_domain, to pin pages, flow is: > > for local_domain > |- vfio_pin_pages() > |- vfio_iommu_type1_pin_pages() > |- __vfio_pin_page_local() > |- vaddr_get_pfn(task->mm) > |- get_user_pages_remote() > > __vfio_pin_page_local() --> get_user_pages_remote() In vfio.c we have the concept of an external user, perhaps that could be continued here. An mdev driver would be an external, or remote pinning. > >> static int vfio_iommu_type1_attach_group(void *iommu_data, > >> struct iommu_group *iommu_group) > >> { > >> struct vfio_iommu *iommu = iommu_data; > >> - struct vfio_group *group, *g; > >> + struct vfio_group *group; > >> struct vfio_domain *domain, *d; > >> struct bus_type *bus = NULL; > >> int ret; > >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > >> mutex_lock(&iommu->lock); > >> > >> list_for_each_entry(d, &iommu->domain_list, next) { > >> - list_for_each_entry(g, &d->group_list, next) { > >> - if (g->iommu_group != iommu_group) > >> - continue; > >> + if (find_iommu_group(d, iommu_group)) { > >> + mutex_unlock(&iommu->lock); > >> + return -EINVAL; > >> + } > >> + } > > > > The find_iommu_group() conversion would also be an easy separate patch. > > > > Ok. > > >> > >> + if (iommu->local_domain) { > >> + if (find_iommu_group(iommu->local_domain, iommu_group)) { > >> mutex_unlock(&iommu->lock); > >> return -EINVAL; > >> } > >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > >> if (ret) > >> goto out_free; > >> > >> + if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) && > >> + (bus == &mdev_bus_type)) { > >> + if (!iommu->local_domain) { > >> + domain->local_addr_space = > >> + kzalloc(sizeof(*domain->local_addr_space), > >> + GFP_KERNEL); > >> + if (!domain->local_addr_space) { > >> + ret = -ENOMEM; > >> + goto out_free; > >> + } > >> + > >> + domain->local_addr_space->task = current; > >> + INIT_LIST_HEAD(&domain->group_list); > >> + domain->local_addr_space->pfn_list = RB_ROOT; > >> + mutex_init(&domain->local_addr_space->pfn_list_lock); > >> + iommu->local_domain = domain; > >> + } else > >> + kfree(domain); > >> + > >> + list_add(&group->next, &domain->group_list); > > > > I think you mean s/domain/iommu->local_domain/ here, we just freed > > domain in the else path. > > > > Yes, corrected. > > >> + mutex_unlock(&iommu->lock); > >> + return 0; > >> + } > >> + > >> domain->domain = iommu_domain_alloc(bus); > >> if (!domain->domain) { > >> ret = -EIO; > >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) > >> vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node)); > >> } > >> > >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu) > >> +{ > >> + struct vfio_domain *domain = iommu->local_domain; > >> + struct vfio_dma *dma, *tdma; > >> + struct rb_node *n; > >> + long locked = 0; > >> + > >> + rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list, > >> + node) { > >> + vfio_unmap_unpin(iommu, dma); > >> + } > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + > >> + n = rb_first(&domain->local_addr_space->pfn_list); > >> + > >> + for (; n; n = rb_next(n)) > >> + locked++; > >> + > >> + vfio_lock_acct(domain->local_addr_space->task, locked); > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> +} > > > > Couldn't a properly timed mlock by the user allow them to lock more > > memory than they're allowed here? For instance imagine the vendor > > driver has pinned the entire VM memory and the user has exactly the > > locked memory limit for that VM. During the gap here between unpinning > > the entire vfio_dma list and re-accounting for the pfn_list, the user > > can mlock up to their limit again an now they've doubled the locked > > memory they're allowed. > > > > As per original code, vfio_unmap_unpin() calls > __vfio_unpin_pages_remote(.., false) with do_accounting set to false, > why is that so? Because vfio_dma tracks the user granularity of calling MAP_DMA, not the granularity with which the iommu mapping was actually done. There might be multiple non-contiguous chunks to make that mapping and we don't know how the iommu chose to map a given chunk to support large page sizes. If we chose to do accounting on the iommu_unmap() granularity, we might account for every 4k page separately. We choose not to do accounting there so that we can batch the accounting into one update per range. > Here if accounting is set to true then we don't have to do re-accounting > here. If vfio_unmap_unpin() did not do accounting, you could update accounting once with the difference between what was pinned and what remains pinned via the mdev and avoid the gap caused by de-accounting everything and then re-accounting only for the mdev pinnings. > >> + > >> +static void vfio_local_unpin_all(struct vfio_domain *domain) > >> +{ > >> + struct rb_node *node; > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + while ((node = rb_first(&domain->local_addr_space->pfn_list))) > >> + vfio_unpin_pfn(domain, > >> + rb_entry(node, struct vfio_pfn, node), false); > >> + > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> +} > >> + > >> static void vfio_iommu_type1_detach_group(void *iommu_data, > >> struct iommu_group *iommu_group) > >> { > >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > >> > >> mutex_lock(&iommu->lock); > >> > >> - list_for_each_entry(domain, &iommu->domain_list, next) { > >> - list_for_each_entry(group, &domain->group_list, next) { > >> - if (group->iommu_group != iommu_group) > >> - continue; > >> - > >> - iommu_detach_group(domain->domain, iommu_group); > >> + if (iommu->local_domain) { > >> + domain = iommu->local_domain; > >> + group = find_iommu_group(domain, iommu_group); > >> + if (group) { > >> list_del(&group->next); > >> kfree(group); > >> - /* > >> - * Group ownership provides privilege, if the group > >> - * list is empty, the domain goes away. If it's the > >> - * last domain, then all the mappings go away too. > >> - */ > >> + > >> if (list_empty(&domain->group_list)) { > >> - if (list_is_singular(&iommu->domain_list)) > >> + vfio_local_unpin_all(domain); > >> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) > >> vfio_iommu_unmap_unpin_all(iommu); > >> - iommu_domain_free(domain->domain); > >> - list_del(&domain->next); > >> kfree(domain); > >> + iommu->local_domain = NULL; > >> + } > > > > > > I can't quite wrap my head around this, if we have mdev groups attached > > and this iommu group matches an mdev group, remove from list and free > > the group. If there are now no more groups in the mdev group list, > > then for each vfio_pfn, unpin the pfn, /without/ doing accounting > > udpates > > corrected the code to do accounting here. > > > and remove the vfio_pfn, but only if the ref_count is now > > zero. > > Yes, If you see the loop vfio_local_unpin_all(), it iterates till the > node in rb tree exist > > >> + while ((node = rb_first(&domain->local_addr_space->pfn_list))) > >> + vfio_unpin_pfn(domain, > >> + rb_entry(node, struct vfio_pfn, node), false); > >> + > > > and vfio_unpin_pfn() only remove the node from rb tree if ref count is > zero. > > static int vfio_unpin_pfn(struct vfio_domain *domain, > struct vfio_pfn *vpfn, bool do_accounting) > { > __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot, > do_accounting); > > if (atomic_dec_and_test(&vpfn->ref_count)) > vfio_remove_from_pfn_list(domain, vpfn); > > return 1; > } > > so for example for a vfio_pfn ref_count is 2, first iteration would be: > - call __vfio_unpin_page_local() > - atomic_dec(ref_count), so now ref_count is 1, but node is not removed > from rb tree. > > In next iteration: > - call __vfio_unpin_page_local() > - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree. Ok, I missed that, thanks. > > We free the domain, so if the ref_count was non-zero we've now > > just leaked memory. I think that means that if a vendor driver pins a > > given page twice, that leak occurs. Furthermore, if there is not an > > iommu capable domain in the container, we remove all the vfio_dma > > entries as well, ok. Maybe the only issue is those leaked vfio_pfns. > > > > So if vendor driver pins a page twice, vfio_unpin_pfn() would get called > twice and only when ref count is zero that node is removed from rb tree. > So there is no memory leak. Ok From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48943) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1byV3n-000881-Bc for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1byV3k-0006gv-2E for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:23 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48424) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1byV3j-0006gQ-Q6 for qemu-devel@nongnu.org; Sun, 23 Oct 2016 22:32:19 -0400 Date: Sun, 23 Oct 2016 20:32:16 -0600 From: Alex Williamson Message-ID: <20161023203216.70b2fde3@t450s.home> In-Reply-To: <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com> References: <1476739332-4911-1-git-send-email-kwankhede@nvidia.com> <1476739332-4911-5-git-send-email-kwankhede@nvidia.com> <20161019150231.19e32b84@t450s.home> <3aee74c1-f984-74e3-502b-18b5e1ea6fc7@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kirti Wankhede Cc: pbonzini@redhat.com, kraxel@redhat.com, cjia@nvidia.com, qemu-devel@nongnu.org, kvm@vger.kernel.org, kevin.tian@intel.com, jike.song@intel.com, bjsdjshi@linux.vnet.ibm.com, linux-kernel@vger.kernel.org On Fri, 21 Oct 2016 01:47:25 +0530 Kirti Wankhede wrote: > Alex, > > Addressing your comments other than invalidation part. > > On 10/20/2016 2:32 AM, Alex Williamson wrote: > > On Tue, 18 Oct 2016 02:52:04 +0530 > > Kirti Wankhede wrote: > > > ... > >> Tested by assigning below combinations of devices to a single VM: > >> - GPU pass through only > >> - vGPU device only > >> - One GPU pass through and one vGPU device > >> - Linux VM hot plug and unplug vGPU device while GPU pass through device > >> exist > >> - Linux VM hot plug and unplug GPU pass through device while vGPU device > >> exist > > > > Were you able to do these with the locked memory limit of the user set > > to the minimum required for existing GPU assignment? > > > > No, is there a way to set memory limit through livbirt so that it would > set memory limit to system memory assigned to VM? Not that I know of, but I also don't know how you're making use of an mdev device through libvirt yet since they don't have support for the vfio-pci sysfsdev option. I would recommend testing with QEMU manually. > ... > >> + container = group->container; > >> + if (IS_ERR(container)) { > > > > I don't see that we ever use an ERR_PTR to set group->container, it > > should either be NULL or valid and the fact that we added ourselves to > > container_users should mean that it's valid. The paranoia test here > > would be if container is NULL, but IS_ERR() doesn't check NULL. If we > > need that paranoia test, maybe we should just: > > > > if (WARN_ON(!container)) { > > > > I'm not fully convinced it's needed though. > > > > Ok removing this check. > > >> + ret = PTR_ERR(container); > >> + goto err_pin_pages; > >> + } > >> + > >> + down_read(&container->group_lock); > >> + > >> + driver = container->iommu_driver; > >> + if (likely(driver && driver->ops->pin_pages)) > >> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn, > >> + npage, prot, phys_pfn); > > > > The caller is going to need to provide some means for us to callback to > > invalidate pinned pages. > > > > ret has already been used, so it's zero at this point. I expect the > > original intention was to let the initialization above fall through > > here so that the caller gets an errno if the driver doesn't support > > pin_pages. Returning zero without actually doing anything seems like > > an unexpected return value. > > > > yes, changing it to: > > driver = container->iommu_driver; > if (likely(driver && driver->ops->pin_pages)) > ret = driver->ops->pin_pages(container->iommu_data, user_pfn, > npage, prot, phys_pfn); > else > ret = -EINVAL; > > > > > >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn) > >> +{ > >> + struct vfio_pfn *p; > >> + struct vfio_domain *domain = iommu->local_domain; > >> + int ret = 1; > >> + > >> + if (!domain) > >> + return 1; > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + > >> + p = vfio_find_pfn(domain, pfn); > >> + if (p) > >> + ret = 0; > >> + > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> + return ret; > >> +} > > > > So if the vfio_pfn for a given pfn exists, return 0, else return 1. > > But do we know that the vfio_pfn exists at the point where we actually > > do that accounting? > > > > Only below functions call vfio_pfn_account() > __vfio_pin_pages_remote() -> vfio_pfn_account() > __vfio_unpin_pages_remote() -> vfio_pfn_account() > > Consider the case when mdev device is already assigned to VM, run some > app in VM that pins some pages, then hotplug pass through device. > Then __vfio_pin_pages_remote() is called when iommu capable domain is > attached to container to pin all pages from vfio_iommu_replay(). So if > at this time vfio_pfn exist means that the page is pinned through > local_domain when iommu capable domain was not present, so accounting > was already done for that pages. Hence returned 0 here which mean don't > add this page in accounting. Right, I see that's the intention, I can't pick any holes in the concept, but I'll continue to try to look for bugs. > >> + > >> struct vwork { > >> struct mm_struct *mm; > >> long npage; > >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work) > >> kfree(vwork); > >> } > >> > >> -static void vfio_lock_acct(long npage) > >> +static void vfio_lock_acct(struct task_struct *task, long npage) > >> { > >> struct vwork *vwork; > >> struct mm_struct *mm; > >> > >> - if (!current->mm || !npage) > >> + if (!task->mm || !npage) > >> return; /* process exited or nothing to do */ > >> > >> - if (down_write_trylock(¤t->mm->mmap_sem)) { > >> - current->mm->locked_vm += npage; > >> - up_write(¤t->mm->mmap_sem); > >> + if (down_write_trylock(&task->mm->mmap_sem)) { > >> + task->mm->locked_vm += npage; > >> + up_write(&task->mm->mmap_sem); > >> return; > >> } > >> > >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage) > >> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); > >> if (!vwork) > >> return; > >> - mm = get_task_mm(current); > >> + mm = get_task_mm(task); > >> if (!mm) { > >> kfree(vwork); > >> return; > >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot) > >> return 0; > >> } > > > > This coversion of vfio_lock_acct() to pass a task_struct and updating > > existing callers to pass current would be a great separate, easily > > review-able patch. > > > > Ok. I'll split this in separate commit. > > > >> > >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, > >> + int prot, unsigned long *pfn) > >> { > >> struct page *page[1]; > >> struct vm_area_struct *vma; > >> + struct mm_struct *local_mm = (mm ? mm : current->mm); > >> int ret = -EFAULT; > >> > >> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) { > >> + if (mm) { > >> + down_read(&local_mm->mmap_sem); > >> + ret = get_user_pages_remote(NULL, local_mm, vaddr, 1, > >> + !!(prot & IOMMU_WRITE), 0, page, NULL); > >> + up_read(&local_mm->mmap_sem); > >> + } else > >> + ret = get_user_pages_fast(vaddr, 1, > >> + !!(prot & IOMMU_WRITE), page); > >> + > >> + if (ret == 1) { > >> *pfn = page_to_pfn(page[0]); > >> return 0; > >> } > >> > >> - down_read(¤t->mm->mmap_sem); > >> + down_read(&local_mm->mmap_sem); > >> > >> - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1); > >> + vma = find_vma_intersection(local_mm, vaddr, vaddr + 1); > >> > >> if (vma && vma->vm_flags & VM_PFNMAP) { > >> *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; > >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> ret = 0; > >> } > >> > >> - up_read(¤t->mm->mmap_sem); > >> + up_read(&local_mm->mmap_sem); > >> > >> return ret; > >> } > > > > This would also be a great separate patch. > > Ok. > > > Have you considered > > renaming the mm_struct function arg to "remote_mm" and making the local > > variable simply "mm"? It seems like it would tie nicely with the > > remote_mm path using get_user_pages_remote() while passing NULL for > > remote_mm uses current->mm and the existing path (and avoid the general > > oddness of passing local_mm to a "remote" function). > > > > Yes, your suggestion looks good. Updating. > > > >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) > >> * the iommu can only map chunks of consecutive pfns anyway, so get the > >> * first page and all consecutive pages with the same locking. > >> */ > >> -static long vfio_pin_pages(unsigned long vaddr, long npage, > >> - int prot, unsigned long *pfn_base) > >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu, > >> + unsigned long vaddr, long npage, > >> + int prot, unsigned long *pfn_base) > > ... > > > >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, > >> break; > >> } > >> > >> + lock_acct += vfio_pfn_account(iommu, pfn); > >> + > > > > I take it that this is the new technique for keeping the accounting > > accurate, we only increment the locked accounting by the amount not > > already pinned in a vfio_pfn. > > > > That's correct. > > > >> if (!rsvd && !lock_cap && > >> - current->mm->locked_vm + i + 1 > limit) { > >> + current->mm->locked_vm + lock_acct > limit) { > >> put_pfn(pfn, prot); > >> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", > >> __func__, limit << PAGE_SHIFT); > >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, > >> } > >> > >> if (!rsvd) > >> - vfio_lock_acct(i); > >> + vfio_lock_acct(current, lock_acct); > >> > >> return i; > >> } > >> > >> -static long vfio_unpin_pages(unsigned long pfn, long npage, > >> - int prot, bool do_accounting) > >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu, > >> + unsigned long pfn, long npage, int prot, > >> + bool do_accounting) > > > > Have you noticed that it's kind of confusing that > > __vfio_{un}pin_pages_remote() uses current, which does a > > get_user_pages_fast() while "local" uses a provided task_struct and > > uses get_user_pages_*remote*()? And also what was effectively local > > (ie. we're pinning for our own use here) is now "remote" and pinning > > for a remote, vendor driver consumer, is now "local". It's not very > > intuitive. > > > > 'local' in local_domain was suggested to describe the domain for local > page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this > name were discarded. May be we should revisit what the name should be. > Any suggestion? > > For local_domain, to pin pages, flow is: > > for local_domain > |- vfio_pin_pages() > |- vfio_iommu_type1_pin_pages() > |- __vfio_pin_page_local() > |- vaddr_get_pfn(task->mm) > |- get_user_pages_remote() > > __vfio_pin_page_local() --> get_user_pages_remote() In vfio.c we have the concept of an external user, perhaps that could be continued here. An mdev driver would be an external, or remote pinning. > >> static int vfio_iommu_type1_attach_group(void *iommu_data, > >> struct iommu_group *iommu_group) > >> { > >> struct vfio_iommu *iommu = iommu_data; > >> - struct vfio_group *group, *g; > >> + struct vfio_group *group; > >> struct vfio_domain *domain, *d; > >> struct bus_type *bus = NULL; > >> int ret; > >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > >> mutex_lock(&iommu->lock); > >> > >> list_for_each_entry(d, &iommu->domain_list, next) { > >> - list_for_each_entry(g, &d->group_list, next) { > >> - if (g->iommu_group != iommu_group) > >> - continue; > >> + if (find_iommu_group(d, iommu_group)) { > >> + mutex_unlock(&iommu->lock); > >> + return -EINVAL; > >> + } > >> + } > > > > The find_iommu_group() conversion would also be an easy separate patch. > > > > Ok. > > >> > >> + if (iommu->local_domain) { > >> + if (find_iommu_group(iommu->local_domain, iommu_group)) { > >> mutex_unlock(&iommu->lock); > >> return -EINVAL; > >> } > >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > >> if (ret) > >> goto out_free; > >> > >> + if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) && > >> + (bus == &mdev_bus_type)) { > >> + if (!iommu->local_domain) { > >> + domain->local_addr_space = > >> + kzalloc(sizeof(*domain->local_addr_space), > >> + GFP_KERNEL); > >> + if (!domain->local_addr_space) { > >> + ret = -ENOMEM; > >> + goto out_free; > >> + } > >> + > >> + domain->local_addr_space->task = current; > >> + INIT_LIST_HEAD(&domain->group_list); > >> + domain->local_addr_space->pfn_list = RB_ROOT; > >> + mutex_init(&domain->local_addr_space->pfn_list_lock); > >> + iommu->local_domain = domain; > >> + } else > >> + kfree(domain); > >> + > >> + list_add(&group->next, &domain->group_list); > > > > I think you mean s/domain/iommu->local_domain/ here, we just freed > > domain in the else path. > > > > Yes, corrected. > > >> + mutex_unlock(&iommu->lock); > >> + return 0; > >> + } > >> + > >> domain->domain = iommu_domain_alloc(bus); > >> if (!domain->domain) { > >> ret = -EIO; > >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) > >> vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node)); > >> } > >> > >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu) > >> +{ > >> + struct vfio_domain *domain = iommu->local_domain; > >> + struct vfio_dma *dma, *tdma; > >> + struct rb_node *n; > >> + long locked = 0; > >> + > >> + rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list, > >> + node) { > >> + vfio_unmap_unpin(iommu, dma); > >> + } > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + > >> + n = rb_first(&domain->local_addr_space->pfn_list); > >> + > >> + for (; n; n = rb_next(n)) > >> + locked++; > >> + > >> + vfio_lock_acct(domain->local_addr_space->task, locked); > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> +} > > > > Couldn't a properly timed mlock by the user allow them to lock more > > memory than they're allowed here? For instance imagine the vendor > > driver has pinned the entire VM memory and the user has exactly the > > locked memory limit for that VM. During the gap here between unpinning > > the entire vfio_dma list and re-accounting for the pfn_list, the user > > can mlock up to their limit again an now they've doubled the locked > > memory they're allowed. > > > > As per original code, vfio_unmap_unpin() calls > __vfio_unpin_pages_remote(.., false) with do_accounting set to false, > why is that so? Because vfio_dma tracks the user granularity of calling MAP_DMA, not the granularity with which the iommu mapping was actually done. There might be multiple non-contiguous chunks to make that mapping and we don't know how the iommu chose to map a given chunk to support large page sizes. If we chose to do accounting on the iommu_unmap() granularity, we might account for every 4k page separately. We choose not to do accounting there so that we can batch the accounting into one update per range. > Here if accounting is set to true then we don't have to do re-accounting > here. If vfio_unmap_unpin() did not do accounting, you could update accounting once with the difference between what was pinned and what remains pinned via the mdev and avoid the gap caused by de-accounting everything and then re-accounting only for the mdev pinnings. > >> + > >> +static void vfio_local_unpin_all(struct vfio_domain *domain) > >> +{ > >> + struct rb_node *node; > >> + > >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); > >> + while ((node = rb_first(&domain->local_addr_space->pfn_list))) > >> + vfio_unpin_pfn(domain, > >> + rb_entry(node, struct vfio_pfn, node), false); > >> + > >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > >> +} > >> + > >> static void vfio_iommu_type1_detach_group(void *iommu_data, > >> struct iommu_group *iommu_group) > >> { > >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > >> > >> mutex_lock(&iommu->lock); > >> > >> - list_for_each_entry(domain, &iommu->domain_list, next) { > >> - list_for_each_entry(group, &domain->group_list, next) { > >> - if (group->iommu_group != iommu_group) > >> - continue; > >> - > >> - iommu_detach_group(domain->domain, iommu_group); > >> + if (iommu->local_domain) { > >> + domain = iommu->local_domain; > >> + group = find_iommu_group(domain, iommu_group); > >> + if (group) { > >> list_del(&group->next); > >> kfree(group); > >> - /* > >> - * Group ownership provides privilege, if the group > >> - * list is empty, the domain goes away. If it's the > >> - * last domain, then all the mappings go away too. > >> - */ > >> + > >> if (list_empty(&domain->group_list)) { > >> - if (list_is_singular(&iommu->domain_list)) > >> + vfio_local_unpin_all(domain); > >> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) > >> vfio_iommu_unmap_unpin_all(iommu); > >> - iommu_domain_free(domain->domain); > >> - list_del(&domain->next); > >> kfree(domain); > >> + iommu->local_domain = NULL; > >> + } > > > > > > I can't quite wrap my head around this, if we have mdev groups attached > > and this iommu group matches an mdev group, remove from list and free > > the group. If there are now no more groups in the mdev group list, > > then for each vfio_pfn, unpin the pfn, /without/ doing accounting > > udpates > > corrected the code to do accounting here. > > > and remove the vfio_pfn, but only if the ref_count is now > > zero. > > Yes, If you see the loop vfio_local_unpin_all(), it iterates till the > node in rb tree exist > > >> + while ((node = rb_first(&domain->local_addr_space->pfn_list))) > >> + vfio_unpin_pfn(domain, > >> + rb_entry(node, struct vfio_pfn, node), false); > >> + > > > and vfio_unpin_pfn() only remove the node from rb tree if ref count is > zero. > > static int vfio_unpin_pfn(struct vfio_domain *domain, > struct vfio_pfn *vpfn, bool do_accounting) > { > __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot, > do_accounting); > > if (atomic_dec_and_test(&vpfn->ref_count)) > vfio_remove_from_pfn_list(domain, vpfn); > > return 1; > } > > so for example for a vfio_pfn ref_count is 2, first iteration would be: > - call __vfio_unpin_page_local() > - atomic_dec(ref_count), so now ref_count is 1, but node is not removed > from rb tree. > > In next iteration: > - call __vfio_unpin_page_local() > - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree. Ok, I missed that, thanks. > > We free the domain, so if the ref_count was non-zero we've now > > just leaked memory. I think that means that if a vendor driver pins a > > given page twice, that leak occurs. Furthermore, if there is not an > > iommu capable domain in the container, we remove all the vfio_dma > > entries as well, ok. Maybe the only issue is those leaked vfio_pfns. > > > > So if vendor driver pins a page twice, vfio_unpin_pfn() would get called > twice and only when ref count is zero that node is removed from rb tree. > So there is no memory leak. Ok