linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	<kevin.tian@intel.com>, <jike.song@intel.com>,
	<bjsdjshi@linux.vnet.ibm.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
Date: Mon, 14 Nov 2016 08:37:00 -0700	[thread overview]
Message-ID: <20161114083700.7df5c8d3@t450s.home> (raw)
In-Reply-To: <6f8c77c4-2155-d3ae-75ad-931364fba16b@nvidia.com>

On Mon, 14 Nov 2016 13:22:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/9/2016 2:58 AM, Alex Williamson wrote:
> > On Wed, 9 Nov 2016 01:29:19 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/8/2016 11:16 PM, Alex Williamson wrote:  
> >>> On Tue, 8 Nov 2016 21:56:29 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 11/8/2016 5:15 AM, Alex Williamson wrote:    
> >>>>> On Sat, 5 Nov 2016 02:40:45 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>> ...    
> >>>>>>  
> >>>>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb)      
> >>>>>
> >>>>> Is the expectation here that this is a generic notifier for all
> >>>>> vfio->mdev signaling?  That should probably be made clear in the mdev
> >>>>> API to avoid vendor drivers assuming their notifier callback only
> >>>>> occurs for unmaps, even if that's currently the case.
> >>>>>       
> >>>>
> >>>> Ok. Adding comment about notifier callback in mdev_device which is part
> >>>> of next patch.
> >>>>
> >>>> ...
> >>>>    
> >>>>>>  	mutex_lock(&iommu->lock);
> >>>>>>  
> >>>>>> -	if (!iommu->external_domain) {
> >>>>>> +	/* Fail if notifier list is empty */
> >>>>>> +	if ((!iommu->external_domain) || (!iommu->notifier.head)) {
> >>>>>>  		ret = -EINVAL;
> >>>>>>  		goto pin_done;
> >>>>>>  	}
> >>>>>> @@ -867,6 +870,11 @@ unlock:
> >>>>>>  	/* Report how much was unmapped */
> >>>>>>  	unmap->size = unmapped;
> >>>>>>  
> >>>>>> +	if (unmapped && iommu->external_domain)
> >>>>>> +		blocking_notifier_call_chain(&iommu->notifier,
> >>>>>> +					     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >>>>>> +					     unmap);      
> >>>>>
> >>>>> This is after the fact, there's already a gap here where pages are
> >>>>> unpinned and the mdev device is still running.      
> >>>>
> >>>> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
> >>>> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
> >>>> this notifier before vfio_remove_dma(). But if we call this before
> >>>> vfio_remove_dma() there will be deadlock since iommu->lock is already
> >>>> held here and vfio_iommu_type1_unpin_pages() will also try to hold
> >>>> iommu->lock.
> >>>> If we want to call blocking_notifier_call_chain() before
> >>>> vfio_remove_dma(), sequence should be:
> >>>>
> >>>> unmapped += dma->size;
> >>>> mutex_unlock(&iommu->lock);
> >>>> if (iommu->external_domain)) {
> >>>> 	struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>>>
> >>>> 	nb_unmap.iova = dma->iova;
> >>>> 	nb_unmap.size = dma->size;
> >>>> 	blocking_notifier_call_chain(&iommu->notifier,
> >>>>         	                     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >>>>                 	             &nb_unmap);
> >>>> }
> >>>> mutex_lock(&iommu->lock);
> >>>> vfio_remove_dma(iommu, dma);    
> >>>
> >>> It seems like it would be worthwhile to have the rb-tree rooted in the
> >>> vfio-dma, then we only need to call the notifier if there are pages
> >>> pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
> >>> then release the lock call the notifier, re-acquire the lock, and
> >>> BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
> >>> between separate vfio_dma structs, but as I mentioned in other replies,
> >>> that seems like an exception that we don't need to optimize for.
> >>>     
> >>
> >> If we don't optimize for the case where iova from different vfio_dma are
> >> mapped to same pfn and we would not consider this case for page
> >> accounting then:  
> > 
> > Just to clarify, the current code (not handling mdevs) will pin and do
> > page accounting per iova, regardless of whether the iova translates to a
> > unique pfn.  As long as we do no worse than that, I'm ok.
> >   
> >> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
> >> structure.
> >> - iova tracking structure would have iova and ref_count only.
> >> - page accounting would only count number of iova's in rb_tree, case
> >> where different iova could map to same pfn would not be considered in
> >> this implementation for now.
> >> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
> >> validate that iova exist in rb tree and trust vendor driver that
> >> corresponding pfn is correct, there is no validation of pfn. If want
> >> validate pfn, call GUP, verify pfn and call put_pfn().
> >> - In .release() or .detach_group() path, if there are entries in this rb
> >> tree, call GUP again using that iova, get pfn and then call
> >> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
> >> pfn in our tracking logic.  
> > 
> > Wait a sec, if we detach a group from the container and it's not the
> > last group in the container (which would trigger a release), we can't
> > assume anything about which vfio_dma entries were associated with that
> > device.  The vendor driver, through the release of the device(s) within
> > that group, needs to unpin.  In a container release, we need to send a
> > notifier to the vendor driver(s) to cause an unpin.  This is the only
> > mechanism we have to ensure that vendor drivers are not leaking
> > references.  If during the release, after the notifier, if any
> > vfio_pfns remain, we need to BUG_ON, just like we need to do for any
> > other DMA_UNMAP.
> > 
> > Also, I'll say it again, I also don't like this API of passing around
> > potentially giant arrays, and especially the API of relying on the
> > vendor driver to tell us an arbitrary pfn to unpin.  If we make the
> > assumption that vendor drivers do not pin lots and lots of memory,
> > perhaps we could use a struct vfio_pfn as:
> > 
> > struct vfio_pfn {
> >         struct rb_node          node;
> > 	dma_addr_t		iova; /* key */
> >         unsigned long           pfn;
> >         atomic_t                ref_count;
> > };
> > 
> > This puts us at 44-bytes per pfn, which isn't great, but I think it
> > puts us in a better position with the API that we could make use of a
> > page-table or sparse array in the future that would eliminate the
> > rb_node and make the iova implicit in the location of the data
> > structure.  That would leave only the pfn and ref_count, which could
> > potentially be combined into a single 8-byte field if we had per
> > vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that
> > the reference count is always less than PAGE_SIZE, ie. we could fail
> > pinning if we get to that point).
> >   
> 
> Ok.
> - I'll have above structure to track pinned pfn for now and a rb-tree in
> vfio_dma structure that would keep track of pages pinned in that range,
> dma->iova to dma->iova + dma->size.
> - Key for pfn_list rb-tree would be iova, instead of pfn.
> - Removing address space structure. vfio_dma keeps task structure, which
> would be used to get mm structure (using get_task_mm(task) and
> mmput(mm)) for pin/unpin and page accounting.
> - vfio_unpin_pages() would have array of user_pfns as input argument,
> instead of array of pfns.
> - On vfio_pin_pages(), pinning would happen once. On later call to
> vfio_pin_pages() with same user_pfn, if iova is found in pfn_list, only
> ref_count would be incremented.
> - In vfio_unpin_pages(), ref_count is decremented and page will be
> unpinned when ref_count is 0.
> - For vfio_pin_pages() and vfio_unpin_pages() input array, number of
> elements in array should be less that PAGE_SIZE. If vendor driver wants
> to use for more pages, array should  be split it in chunks of PAGE_SIZE.

Yes, this is what we discussed offline, the size of the arrays should
never exceed PAGE_SIZE, therefore the number of entries should never
exceed PAGE_SIZE/sizeof(pfn).  The iommu driver should fault with -E2BIG
if the vendor driver attempts to exceed this.

> - Updating page accounting logic with above changes.

Thanks,
Alex

  reply	other threads:[~2016-11-14 15:37 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-04 21:10 [PATCH v11 00/22] Add Mediated device support Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 01/22] vfio: Mediated device Core driver Kirti Wankhede
2016-11-07  6:40   ` Tian, Kevin
     [not found]   ` <20161108092552.GA2090@bjsdjshi@linux.vnet.ibm.com>
2016-11-08 21:06     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 02/22] vfio: VFIO based driver for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 03/22] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 04/22] vfio: Common function to increment container_users Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops Kirti Wankhede
2016-11-07 19:36   ` Alex Williamson
2016-11-08 13:55     ` Kirti Wankhede
2016-11-08 16:39       ` Alex Williamson
2016-11-08 18:47         ` Kirti Wankhede
2016-11-08 19:14           ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 06/22] vfio iommu type1: Update arguments of vfio_lock_acct Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 07/22] vfio iommu type1: Update argument of vaddr_get_pfn() Kirti Wankhede
2016-11-07  8:42   ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 08/22] vfio iommu type1: Add find_iommu_group() function Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 09/22] vfio iommu type1: Add task structure to vfio_dma Kirti Wankhede
2016-11-07 21:03   ` Alex Williamson
2016-11-08 14:13     ` Kirti Wankhede
2016-11-08 16:43       ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices Kirti Wankhede
2016-11-07 23:16   ` Alex Williamson
2016-11-08  2:20     ` Jike Song
2016-11-08 16:18       ` Alex Williamson
2016-11-08 15:06     ` Kirti Wankhede
2016-11-08 17:05       ` Alex Williamson
2016-11-08  6:52   ` Alexey Kardashevskiy
2016-11-15  5:17     ` Alexey Kardashevskiy
2016-11-15  6:33       ` Kirti Wankhede
2016-11-15  7:27         ` Alexey Kardashevskiy
2016-11-15  7:56           ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP Kirti Wankhede
2016-11-07 23:45   ` Alex Williamson
2016-11-08 16:26     ` Kirti Wankhede
2016-11-08 17:46       ` Alex Williamson
2016-11-08 19:59         ` Kirti Wankhede
2016-11-08 21:28           ` Alex Williamson
2016-11-14  7:52             ` Kirti Wankhede
2016-11-14 15:37               ` Alex Williamson [this message]
2016-11-04 21:10 ` [PATCH v11 12/22] vfio: Add notifier callback to parent's ops structure of mdev Kirti Wankhede
2016-11-07 23:51   ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 13/22] vfio: Introduce common function to add capabilities Kirti Wankhede
2016-11-08  7:29   ` Alexey Kardashevskiy
2016-11-08 20:46     ` Kirti Wankhede
2016-11-08 21:42       ` Alex Williamson
2016-11-09  2:23         ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 14/22] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-08  8:46   ` Alexey Kardashevskiy
2016-11-08 20:22     ` Kirti Wankhede
2016-11-09  3:07       ` Alexey Kardashevskiy
2016-11-09  3:35         ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 16/22] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 17/22] vfio_platform: " Kirti Wankhede
2016-11-08  8:52   ` Alexey Kardashevskiy
2016-11-08 20:41     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 18/22] vfio: Define device_api strings Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 19/22] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 20/22] docs: Sysfs ABI for mediated device framework Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 21/22] docs: Sample driver to demonstrate how to use Mediated " Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 22/22] MAINTAINERS: Add entry VFIO based Mediated device drivers Kirti Wankhede
2016-11-07  3:30 ` [PATCH v11 00/22] Add Mediated device support Alexey Kardashevskiy
2016-11-07  3:59   ` Kirti Wankhede
2016-11-07  5:06     ` Kirti Wankhede
2016-11-07  6:15     ` Alexey Kardashevskiy
2016-11-07  6:36       ` Kirti Wankhede
2016-11-07  6:46         ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161114083700.7df5c8d3@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=bjsdjshi@linux.vnet.ibm.com \
    --cc=cjia@nvidia.com \
    --cc=jike.song@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).