From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753911AbcKNH60 (ORCPT <rfc822;w@1wt.eu>);
        Mon, 14 Nov 2016 02:58:26 -0500
Received: from hqemgate14.nvidia.com ([216.228.121.143]:12499 "EHLO
        hqemgate14.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752250AbcKNH6Y (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 14 Nov 2016 02:58:24 -0500
X-PGP-Universal: processed;
        by hqpgpgate101.nvidia.com on Sun, 13 Nov 2016 23:52:19 -0800
Subject: Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify
 DMA_UNMAP
To: Alex Williamson <alex.williamson@redhat.com>
References: <1478293856-8191-1-git-send-email-kwankhede@nvidia.com>
 <1478293856-8191-12-git-send-email-kwankhede@nvidia.com>
 <20161107164544.57ab1a92@t450s.home>
 <a2e3a1dd-696f-2d2b-023d-ccc662e90e95@nvidia.com>
 <20161108104619.6f76b918@t450s.home>
 <675f0d46-aa2b-7404-701e-d9ce17b64549@nvidia.com>
 <20161108142810.44c97dd2@t450s.home>
CC: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
        <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>, <kevin.tian@intel.com>,
        <jike.song@intel.com>, <bjsdjshi@linux.vnet.ibm.com>,
        <linux-kernel@vger.kernel.org>
X-Nvconfidentiality: public
From: Kirti Wankhede <kwankhede@nvidia.com>
Message-ID: <6f8c77c4-2155-d3ae-75ad-931364fba16b@nvidia.com>
Date: Mon, 14 Nov 2016 13:22:08 +0530
MIME-Version: 1.0
In-Reply-To: <20161108142810.44c97dd2@t450s.home>
X-Originating-IP: [10.24.216.210]
X-ClientProxiedBy: DRUKMAIL102.nvidia.com (10.25.59.20) To
 bgmail102.nvidia.com (10.25.59.11)
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 11/9/2016 2:58 AM, Alex Williamson wrote:
> On Wed, 9 Nov 2016 01:29:19 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/8/2016 11:16 PM, Alex Williamson wrote:
>>> On Tue, 8 Nov 2016 21:56:29 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 11/8/2016 5:15 AM, Alex Williamson wrote:  
>>>>> On Sat, 5 Nov 2016 02:40:45 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>> ...  
>>>>>>  
>>>>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb)    
>>>>>
>>>>> Is the expectation here that this is a generic notifier for all
>>>>> vfio->mdev signaling?  That should probably be made clear in the mdev
>>>>> API to avoid vendor drivers assuming their notifier callback only
>>>>> occurs for unmaps, even if that's currently the case.
>>>>>     
>>>>
>>>> Ok. Adding comment about notifier callback in mdev_device which is part
>>>> of next patch.
>>>>
>>>> ...
>>>>  
>>>>>>  	mutex_lock(&iommu->lock);
>>>>>>  
>>>>>> -	if (!iommu->external_domain) {
>>>>>> +	/* Fail if notifier list is empty */
>>>>>> +	if ((!iommu->external_domain) || (!iommu->notifier.head)) {
>>>>>>  		ret = -EINVAL;
>>>>>>  		goto pin_done;
>>>>>>  	}
>>>>>> @@ -867,6 +870,11 @@ unlock:
>>>>>>  	/* Report how much was unmapped */
>>>>>>  	unmap->size = unmapped;
>>>>>>  
>>>>>> +	if (unmapped && iommu->external_domain)
>>>>>> +		blocking_notifier_call_chain(&iommu->notifier,
>>>>>> +					     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>>>>>> +					     unmap);    
>>>>>
>>>>> This is after the fact, there's already a gap here where pages are
>>>>> unpinned and the mdev device is still running.    
>>>>
>>>> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
>>>> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
>>>> this notifier before vfio_remove_dma(). But if we call this before
>>>> vfio_remove_dma() there will be deadlock since iommu->lock is already
>>>> held here and vfio_iommu_type1_unpin_pages() will also try to hold
>>>> iommu->lock.
>>>> If we want to call blocking_notifier_call_chain() before
>>>> vfio_remove_dma(), sequence should be:
>>>>
>>>> unmapped += dma->size;
>>>> mutex_unlock(&iommu->lock);
>>>> if (iommu->external_domain)) {
>>>> 	struct vfio_iommu_type1_dma_unmap nb_unmap;
>>>>
>>>> 	nb_unmap.iova = dma->iova;
>>>> 	nb_unmap.size = dma->size;
>>>> 	blocking_notifier_call_chain(&iommu->notifier,
>>>>         	                     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>>>>                 	             &nb_unmap);
>>>> }
>>>> mutex_lock(&iommu->lock);
>>>> vfio_remove_dma(iommu, dma);  
>>>
>>> It seems like it would be worthwhile to have the rb-tree rooted in the
>>> vfio-dma, then we only need to call the notifier if there are pages
>>> pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
>>> then release the lock call the notifier, re-acquire the lock, and
>>> BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
>>> between separate vfio_dma structs, but as I mentioned in other replies,
>>> that seems like an exception that we don't need to optimize for.
>>>   
>>
>> If we don't optimize for the case where iova from different vfio_dma are
>> mapped to same pfn and we would not consider this case for page
>> accounting then:
> 
> Just to clarify, the current code (not handling mdevs) will pin and do
> page accounting per iova, regardless of whether the iova translates to a
> unique pfn.  As long as we do no worse than that, I'm ok.
> 
>> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
>> structure.
>> - iova tracking structure would have iova and ref_count only.
>> - page accounting would only count number of iova's in rb_tree, case
>> where different iova could map to same pfn would not be considered in
>> this implementation for now.
>> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
>> validate that iova exist in rb tree and trust vendor driver that
>> corresponding pfn is correct, there is no validation of pfn. If want
>> validate pfn, call GUP, verify pfn and call put_pfn().
>> - In .release() or .detach_group() path, if there are entries in this rb
>> tree, call GUP again using that iova, get pfn and then call
>> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
>> pfn in our tracking logic.
> 
> Wait a sec, if we detach a group from the container and it's not the
> last group in the container (which would trigger a release), we can't
> assume anything about which vfio_dma entries were associated with that
> device.  The vendor driver, through the release of the device(s) within
> that group, needs to unpin.  In a container release, we need to send a
> notifier to the vendor driver(s) to cause an unpin.  This is the only
> mechanism we have to ensure that vendor drivers are not leaking
> references.  If during the release, after the notifier, if any
> vfio_pfns remain, we need to BUG_ON, just like we need to do for any
> other DMA_UNMAP.
> 
> Also, I'll say it again, I also don't like this API of passing around
> potentially giant arrays, and especially the API of relying on the
> vendor driver to tell us an arbitrary pfn to unpin.  If we make the
> assumption that vendor drivers do not pin lots and lots of memory,
> perhaps we could use a struct vfio_pfn as:
> 
> struct vfio_pfn {
>         struct rb_node          node;
> 	dma_addr_t		iova; /* key */
>         unsigned long           pfn;
>         atomic_t                ref_count;
> };
> 
> This puts us at 44-bytes per pfn, which isn't great, but I think it
> puts us in a better position with the API that we could make use of a
> page-table or sparse array in the future that would eliminate the
> rb_node and make the iova implicit in the location of the data
> structure.  That would leave only the pfn and ref_count, which could
> potentially be combined into a single 8-byte field if we had per
> vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that
> the reference count is always less than PAGE_SIZE, ie. we could fail
> pinning if we get to that point).
> 

Ok.
- I'll have above structure to track pinned pfn for now and a rb-tree in
vfio_dma structure that would keep track of pages pinned in that range,
dma->iova to dma->iova + dma->size.
- Key for pfn_list rb-tree would be iova, instead of pfn.
- Removing address space structure. vfio_dma keeps task structure, which
would be used to get mm structure (using get_task_mm(task) and
mmput(mm)) for pin/unpin and page accounting.
- vfio_unpin_pages() would have array of user_pfns as input argument,
instead of array of pfns.
- On vfio_pin_pages(), pinning would happen once. On later call to
vfio_pin_pages() with same user_pfn, if iova is found in pfn_list, only
ref_count would be incremented.
- In vfio_unpin_pages(), ref_count is decremented and page will be
unpinned when ref_count is 0.
- For vfio_pin_pages() and vfio_unpin_pages() input array, number of
elements in array should be less that PAGE_SIZE. If vendor driver wants
to use for more pages, array should  be split it in chunks of PAGE_SIZE.
- Updating page accounting logic with above changes.

Thanks,
Kirti

> That would allow for the unpin call to not provide the pfn, so could we
> then look at whether we need the batching provided by the iova array at
> all?  I don't have a feel for the size of memory that gets pinned by
> the vendor driver, the frequency of pinning, or whether usage of
> hugepages for the guest is likely to translate into contiguous memory
> requests through this API.  What's your feeling?  Thanks,
>