linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	<kevin.tian@intel.com>, <jike.song@intel.com>,
	<bjsdjshi@linux.vnet.ibm.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
Date: Tue, 8 Nov 2016 14:28:10 -0700	[thread overview]
Message-ID: <20161108142810.44c97dd2@t450s.home> (raw)
In-Reply-To: <675f0d46-aa2b-7404-701e-d9ce17b64549@nvidia.com>

On Wed, 9 Nov 2016 01:29:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/8/2016 11:16 PM, Alex Williamson wrote:
> > On Tue, 8 Nov 2016 21:56:29 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/8/2016 5:15 AM, Alex Williamson wrote:  
> >>> On Sat, 5 Nov 2016 02:40:45 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >> ...  
> >>>>  
> >>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb)    
> >>>
> >>> Is the expectation here that this is a generic notifier for all
> >>> vfio->mdev signaling?  That should probably be made clear in the mdev
> >>> API to avoid vendor drivers assuming their notifier callback only
> >>> occurs for unmaps, even if that's currently the case.
> >>>     
> >>
> >> Ok. Adding comment about notifier callback in mdev_device which is part
> >> of next patch.
> >>
> >> ...
> >>  
> >>>>  	mutex_lock(&iommu->lock);
> >>>>  
> >>>> -	if (!iommu->external_domain) {
> >>>> +	/* Fail if notifier list is empty */
> >>>> +	if ((!iommu->external_domain) || (!iommu->notifier.head)) {
> >>>>  		ret = -EINVAL;
> >>>>  		goto pin_done;
> >>>>  	}
> >>>> @@ -867,6 +870,11 @@ unlock:
> >>>>  	/* Report how much was unmapped */
> >>>>  	unmap->size = unmapped;
> >>>>  
> >>>> +	if (unmapped && iommu->external_domain)
> >>>> +		blocking_notifier_call_chain(&iommu->notifier,
> >>>> +					     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >>>> +					     unmap);    
> >>>
> >>> This is after the fact, there's already a gap here where pages are
> >>> unpinned and the mdev device is still running.    
> >>
> >> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
> >> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
> >> this notifier before vfio_remove_dma(). But if we call this before
> >> vfio_remove_dma() there will be deadlock since iommu->lock is already
> >> held here and vfio_iommu_type1_unpin_pages() will also try to hold
> >> iommu->lock.
> >> If we want to call blocking_notifier_call_chain() before
> >> vfio_remove_dma(), sequence should be:
> >>
> >> unmapped += dma->size;
> >> mutex_unlock(&iommu->lock);
> >> if (iommu->external_domain)) {
> >> 	struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>
> >> 	nb_unmap.iova = dma->iova;
> >> 	nb_unmap.size = dma->size;
> >> 	blocking_notifier_call_chain(&iommu->notifier,
> >>         	                     VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >>                 	             &nb_unmap);
> >> }
> >> mutex_lock(&iommu->lock);
> >> vfio_remove_dma(iommu, dma);  
> > 
> > It seems like it would be worthwhile to have the rb-tree rooted in the
> > vfio-dma, then we only need to call the notifier if there are pages
> > pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
> > then release the lock call the notifier, re-acquire the lock, and
> > BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
> > between separate vfio_dma structs, but as I mentioned in other replies,
> > that seems like an exception that we don't need to optimize for.
> >   
> 
> If we don't optimize for the case where iova from different vfio_dma are
> mapped to same pfn and we would not consider this case for page
> accounting then:

Just to clarify, the current code (not handling mdevs) will pin and do
page accounting per iova, regardless of whether the iova translates to a
unique pfn.  As long as we do no worse than that, I'm ok.

> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
> structure.
> - iova tracking structure would have iova and ref_count only.
> - page accounting would only count number of iova's in rb_tree, case
> where different iova could map to same pfn would not be considered in
> this implementation for now.
> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
> validate that iova exist in rb tree and trust vendor driver that
> corresponding pfn is correct, there is no validation of pfn. If want
> validate pfn, call GUP, verify pfn and call put_pfn().
> - In .release() or .detach_group() path, if there are entries in this rb
> tree, call GUP again using that iova, get pfn and then call
> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
> pfn in our tracking logic.

Wait a sec, if we detach a group from the container and it's not the
last group in the container (which would trigger a release), we can't
assume anything about which vfio_dma entries were associated with that
device.  The vendor driver, through the release of the device(s) within
that group, needs to unpin.  In a container release, we need to send a
notifier to the vendor driver(s) to cause an unpin.  This is the only
mechanism we have to ensure that vendor drivers are not leaking
references.  If during the release, after the notifier, if any
vfio_pfns remain, we need to BUG_ON, just like we need to do for any
other DMA_UNMAP.

Also, I'll say it again, I also don't like this API of passing around
potentially giant arrays, and especially the API of relying on the
vendor driver to tell us an arbitrary pfn to unpin.  If we make the
assumption that vendor drivers do not pin lots and lots of memory,
perhaps we could use a struct vfio_pfn as:

struct vfio_pfn {
        struct rb_node          node;
	dma_addr_t		iova; /* key */
        unsigned long           pfn;
        atomic_t                ref_count;
};

This puts us at 44-bytes per pfn, which isn't great, but I think it
puts us in a better position with the API that we could make use of a
page-table or sparse array in the future that would eliminate the
rb_node and make the iova implicit in the location of the data
structure.  That would leave only the pfn and ref_count, which could
potentially be combined into a single 8-byte field if we had per
vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that
the reference count is always less than PAGE_SIZE, ie. we could fail
pinning if we get to that point).

That would allow for the unpin call to not provide the pfn, so could we
then look at whether we need the batching provided by the iova array at
all?  I don't have a feel for the size of memory that gets pinned by
the vendor driver, the frequency of pinning, or whether usage of
hugepages for the guest is likely to translate into contiguous memory
requests through this API.  What's your feeling?  Thanks,

Alex

  reply	other threads:[~2016-11-08 21:28 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-04 21:10 [PATCH v11 00/22] Add Mediated device support Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 01/22] vfio: Mediated device Core driver Kirti Wankhede
2016-11-07  6:40   ` Tian, Kevin
     [not found]   ` <20161108092552.GA2090@bjsdjshi@linux.vnet.ibm.com>
2016-11-08 21:06     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 02/22] vfio: VFIO based driver for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 03/22] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 04/22] vfio: Common function to increment container_users Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops Kirti Wankhede
2016-11-07 19:36   ` Alex Williamson
2016-11-08 13:55     ` Kirti Wankhede
2016-11-08 16:39       ` Alex Williamson
2016-11-08 18:47         ` Kirti Wankhede
2016-11-08 19:14           ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 06/22] vfio iommu type1: Update arguments of vfio_lock_acct Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 07/22] vfio iommu type1: Update argument of vaddr_get_pfn() Kirti Wankhede
2016-11-07  8:42   ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 08/22] vfio iommu type1: Add find_iommu_group() function Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 09/22] vfio iommu type1: Add task structure to vfio_dma Kirti Wankhede
2016-11-07 21:03   ` Alex Williamson
2016-11-08 14:13     ` Kirti Wankhede
2016-11-08 16:43       ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices Kirti Wankhede
2016-11-07 23:16   ` Alex Williamson
2016-11-08  2:20     ` Jike Song
2016-11-08 16:18       ` Alex Williamson
2016-11-08 15:06     ` Kirti Wankhede
2016-11-08 17:05       ` Alex Williamson
2016-11-08  6:52   ` Alexey Kardashevskiy
2016-11-15  5:17     ` Alexey Kardashevskiy
2016-11-15  6:33       ` Kirti Wankhede
2016-11-15  7:27         ` Alexey Kardashevskiy
2016-11-15  7:56           ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP Kirti Wankhede
2016-11-07 23:45   ` Alex Williamson
2016-11-08 16:26     ` Kirti Wankhede
2016-11-08 17:46       ` Alex Williamson
2016-11-08 19:59         ` Kirti Wankhede
2016-11-08 21:28           ` Alex Williamson [this message]
2016-11-14  7:52             ` Kirti Wankhede
2016-11-14 15:37               ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 12/22] vfio: Add notifier callback to parent's ops structure of mdev Kirti Wankhede
2016-11-07 23:51   ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 13/22] vfio: Introduce common function to add capabilities Kirti Wankhede
2016-11-08  7:29   ` Alexey Kardashevskiy
2016-11-08 20:46     ` Kirti Wankhede
2016-11-08 21:42       ` Alex Williamson
2016-11-09  2:23         ` Alexey Kardashevskiy
2016-11-04 21:10 ` [PATCH v11 14/22] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-08  8:46   ` Alexey Kardashevskiy
2016-11-08 20:22     ` Kirti Wankhede
2016-11-09  3:07       ` Alexey Kardashevskiy
2016-11-09  3:35         ` Alex Williamson
2016-11-04 21:10 ` [PATCH v11 16/22] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 17/22] vfio_platform: " Kirti Wankhede
2016-11-08  8:52   ` Alexey Kardashevskiy
2016-11-08 20:41     ` Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 18/22] vfio: Define device_api strings Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 19/22] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 20/22] docs: Sysfs ABI for mediated device framework Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 21/22] docs: Sample driver to demonstrate how to use Mediated " Kirti Wankhede
2016-11-04 21:10 ` [PATCH v11 22/22] MAINTAINERS: Add entry VFIO based Mediated device drivers Kirti Wankhede
2016-11-07  3:30 ` [PATCH v11 00/22] Add Mediated device support Alexey Kardashevskiy
2016-11-07  3:59   ` Kirti Wankhede
2016-11-07  5:06     ` Kirti Wankhede
2016-11-07  6:15     ` Alexey Kardashevskiy
2016-11-07  6:36       ` Kirti Wankhede
2016-11-07  6:46         ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161108142810.44c97dd2@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=bjsdjshi@linux.vnet.ibm.com \
    --cc=cjia@nvidia.com \
    --cc=jike.song@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).