From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753619AbcKNPhD (ORCPT ); Mon, 14 Nov 2016 10:37:03 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52250 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752246AbcKNPhC (ORCPT ); Mon, 14 Nov 2016 10:37:02 -0500 Date: Mon, 14 Nov 2016 08:37:00 -0700 From: Alex Williamson To: Kirti Wankhede Cc: , , , , , , , , Subject: Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP Message-ID: <20161114083700.7df5c8d3@t450s.home> In-Reply-To: <6f8c77c4-2155-d3ae-75ad-931364fba16b@nvidia.com> References: <1478293856-8191-1-git-send-email-kwankhede@nvidia.com> <1478293856-8191-12-git-send-email-kwankhede@nvidia.com> <20161107164544.57ab1a92@t450s.home> <20161108104619.6f76b918@t450s.home> <675f0d46-aa2b-7404-701e-d9ce17b64549@nvidia.com> <20161108142810.44c97dd2@t450s.home> <6f8c77c4-2155-d3ae-75ad-931364fba16b@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Mon, 14 Nov 2016 15:37:01 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 14 Nov 2016 13:22:08 +0530 Kirti Wankhede wrote: > On 11/9/2016 2:58 AM, Alex Williamson wrote: > > On Wed, 9 Nov 2016 01:29:19 +0530 > > Kirti Wankhede wrote: > > > >> On 11/8/2016 11:16 PM, Alex Williamson wrote: > >>> On Tue, 8 Nov 2016 21:56:29 +0530 > >>> Kirti Wankhede wrote: > >>> > >>>> On 11/8/2016 5:15 AM, Alex Williamson wrote: > >>>>> On Sat, 5 Nov 2016 02:40:45 +0530 > >>>>> Kirti Wankhede wrote: > >>>>> > >>>> ... > >>>>>> > >>>>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb) > >>>>> > >>>>> Is the expectation here that this is a generic notifier for all > >>>>> vfio->mdev signaling? That should probably be made clear in the mdev > >>>>> API to avoid vendor drivers assuming their notifier callback only > >>>>> occurs for unmaps, even if that's currently the case. > >>>>> > >>>> > >>>> Ok. Adding comment about notifier callback in mdev_device which is part > >>>> of next patch. > >>>> > >>>> ... > >>>> > >>>>>> mutex_lock(&iommu->lock); > >>>>>> > >>>>>> - if (!iommu->external_domain) { > >>>>>> + /* Fail if notifier list is empty */ > >>>>>> + if ((!iommu->external_domain) || (!iommu->notifier.head)) { > >>>>>> ret = -EINVAL; > >>>>>> goto pin_done; > >>>>>> } > >>>>>> @@ -867,6 +870,11 @@ unlock: > >>>>>> /* Report how much was unmapped */ > >>>>>> unmap->size = unmapped; > >>>>>> > >>>>>> + if (unmapped && iommu->external_domain) > >>>>>> + blocking_notifier_call_chain(&iommu->notifier, > >>>>>> + VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >>>>>> + unmap); > >>>>> > >>>>> This is after the fact, there's already a gap here where pages are > >>>>> unpinned and the mdev device is still running. > >>>> > >>>> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and > >>>> find vfio_dma. If its not found, it doesn't unpin pages. We have to call > >>>> this notifier before vfio_remove_dma(). But if we call this before > >>>> vfio_remove_dma() there will be deadlock since iommu->lock is already > >>>> held here and vfio_iommu_type1_unpin_pages() will also try to hold > >>>> iommu->lock. > >>>> If we want to call blocking_notifier_call_chain() before > >>>> vfio_remove_dma(), sequence should be: > >>>> > >>>> unmapped += dma->size; > >>>> mutex_unlock(&iommu->lock); > >>>> if (iommu->external_domain)) { > >>>> struct vfio_iommu_type1_dma_unmap nb_unmap; > >>>> > >>>> nb_unmap.iova = dma->iova; > >>>> nb_unmap.size = dma->size; > >>>> blocking_notifier_call_chain(&iommu->notifier, > >>>> VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >>>> &nb_unmap); > >>>> } > >>>> mutex_lock(&iommu->lock); > >>>> vfio_remove_dma(iommu, dma); > >>> > >>> It seems like it would be worthwhile to have the rb-tree rooted in the > >>> vfio-dma, then we only need to call the notifier if there are pages > >>> pinned within that vfio-dma (ie. the rb-tree is not empty). We can > >>> then release the lock call the notifier, re-acquire the lock, and > >>> BUG_ON if the rb-tree still is not empty. We might get duplicate pfns > >>> between separate vfio_dma structs, but as I mentioned in other replies, > >>> that seems like an exception that we don't need to optimize for. > >>> > >> > >> If we don't optimize for the case where iova from different vfio_dma are > >> mapped to same pfn and we would not consider this case for page > >> accounting then: > > > > Just to clarify, the current code (not handling mdevs) will pin and do > > page accounting per iova, regardless of whether the iova translates to a > > unique pfn. As long as we do no worse than that, I'm ok. > > > >> - have rb tree of pinned iova, where key would be iova, in each vfio_dma > >> structure. > >> - iova tracking structure would have iova and ref_count only. > >> - page accounting would only count number of iova's in rb_tree, case > >> where different iova could map to same pfn would not be considered in > >> this implementation for now. > >> - vfio_unpin_pages() would have user_pfn and pfn as input, we would > >> validate that iova exist in rb tree and trust vendor driver that > >> corresponding pfn is correct, there is no validation of pfn. If want > >> validate pfn, call GUP, verify pfn and call put_pfn(). > >> - In .release() or .detach_group() path, if there are entries in this rb > >> tree, call GUP again using that iova, get pfn and then call > >> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping > >> pfn in our tracking logic. > > > > Wait a sec, if we detach a group from the container and it's not the > > last group in the container (which would trigger a release), we can't > > assume anything about which vfio_dma entries were associated with that > > device. The vendor driver, through the release of the device(s) within > > that group, needs to unpin. In a container release, we need to send a > > notifier to the vendor driver(s) to cause an unpin. This is the only > > mechanism we have to ensure that vendor drivers are not leaking > > references. If during the release, after the notifier, if any > > vfio_pfns remain, we need to BUG_ON, just like we need to do for any > > other DMA_UNMAP. > > > > Also, I'll say it again, I also don't like this API of passing around > > potentially giant arrays, and especially the API of relying on the > > vendor driver to tell us an arbitrary pfn to unpin. If we make the > > assumption that vendor drivers do not pin lots and lots of memory, > > perhaps we could use a struct vfio_pfn as: > > > > struct vfio_pfn { > > struct rb_node node; > > dma_addr_t iova; /* key */ > > unsigned long pfn; > > atomic_t ref_count; > > }; > > > > This puts us at 44-bytes per pfn, which isn't great, but I think it > > puts us in a better position with the API that we could make use of a > > page-table or sparse array in the future that would eliminate the > > rb_node and make the iova implicit in the location of the data > > structure. That would leave only the pfn and ref_count, which could > > potentially be combined into a single 8-byte field if we had per > > vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that > > the reference count is always less than PAGE_SIZE, ie. we could fail > > pinning if we get to that point). > > > > Ok. > - I'll have above structure to track pinned pfn for now and a rb-tree in > vfio_dma structure that would keep track of pages pinned in that range, > dma->iova to dma->iova + dma->size. > - Key for pfn_list rb-tree would be iova, instead of pfn. > - Removing address space structure. vfio_dma keeps task structure, which > would be used to get mm structure (using get_task_mm(task) and > mmput(mm)) for pin/unpin and page accounting. > - vfio_unpin_pages() would have array of user_pfns as input argument, > instead of array of pfns. > - On vfio_pin_pages(), pinning would happen once. On later call to > vfio_pin_pages() with same user_pfn, if iova is found in pfn_list, only > ref_count would be incremented. > - In vfio_unpin_pages(), ref_count is decremented and page will be > unpinned when ref_count is 0. > - For vfio_pin_pages() and vfio_unpin_pages() input array, number of > elements in array should be less that PAGE_SIZE. If vendor driver wants > to use for more pages, array should be split it in chunks of PAGE_SIZE. Yes, this is what we discussed offline, the size of the arrays should never exceed PAGE_SIZE, therefore the number of entries should never exceed PAGE_SIZE/sizeof(pfn). The iommu driver should fault with -E2BIG if the vendor driver attempts to exceed this. > - Updating page accounting logic with above changes. Thanks, Alex