On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> On Fri, 13 May 2016 17:16:48 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 05/06/2016 08:39 AM, Alex Williamson wrote:
> > > On Wed,  4 May 2016 16:52:13 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >  
> > >> This postpones VFIO container deinitialization to let region_del()
> > >> callbacks (called via vfio_listener_release) do proper clean up
> > >> while the group is still attached to the container.  
> > >
> > > Any mappings within the container should clean themselves up when the
> > > container is deprivleged by removing the last group in the kernel. Is
> > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > bug, or that our QEMU side structures get all out of whack if we let
> > > that happen?  
> > 
> > My mailbase got corrupted, missed that.
> > 
> > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > 01/19 and 02/19 right before 17/19, sorry about that.
> 
> Which I object to, it's just ridiculous to have vfio start/stop
> callbacks in a set of generic iommu region ops.

It's ugly, but I don't actually see a better way to do this (the
general concept of having vfio start/stop callbacks, that is, not the
specifics of the patches).

The fact is that how we implement the guest side IOMMU *does* need to
change depending on whether VFIO devices are present or not.  That's
due essentially to incompatibilities between a couple of kernel
mechanisms.  Which in itself is ugly, but nonetheless real.

A (usually blank) vfio on/off callback in the guest side IOMMU ops
seems like the least-bad way to handle this.

> > Every reboot the spapr machine removes all (i.e. one or two) windows and 
> > creates the default one.
> > 
> > I do this by memory_region_del_subregion(iommu_mr) + 
> > memory_region_add_subregion(iommu_mr). Which gets translated to 
> > VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
> > vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
> > => cool. During the machine reset, the VFIO device is there with the   
> > container and groups attached, at some point with no windows.
> > 
> > Now to VFIO plug/unplug.
> > 
> > When VFIO plug happens, vfio_memory_listener is created, region_add() is 
> > called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
> > Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
> > region_del() is not called when the container is being destroyed (as before 
> > this patchset), then the kernel cleans and destroys windows when 
> > close(container->fd) is called or when qemu is killed (and this fd is 
> > naturally closed), I hope this answers the comment from 02/19.
> > 
> > So far so good (right?)
> > 
> > However I also have a guest view of the TCE table, this is what the guest 
> > sees and this is what emulated PCI devices use. This guest view is either 
> > allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
> > kernel, even in real mode) or userspace (VFIO case).
> > 
> > I generally want the guest view to be in the KVM. However when I plug VFIO, 
> > I have to move the table to the userspace. When I unplug VFIO, I want to do 
> > the opposite so I need a way to tell spapr that it can move the table. 
> > region_del() seemed a natural way of doing this as region_add() is already 
> > doing the opposite part.
> > 
> > With this patchset, each IOMMU MR gets a usage counter, region_add() does 
> > +1, region_del() does -1 (yeah, not extremely optimal during reset). When 
> > the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
> > becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
> > the same PHB.
> > 
> > Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
> > decrement steps in vfio_disconnect_container(). And I still cannot move 
> > counting from region_add() to vfio_connect_container() so there will be 
> > asymmetry which I am fine with, I am just checking here - what would be the 
> > best approach here?
> 
> 
> You're imposing on other iommu models (type1) that in order to release
> a container we first deregister the listener, which un-plays all of
> the mappings within that region.  That's inefficient when we can simply
> unset the container and move on.  So you're imposing an inefficiency on
> a separate vfio iommu model for the book keeping of your own.  I don't
> think that's a reasonable approach.  Has it even been testing how that
> affects type1 users?  When a container is closed, clearly it shouldn't
> be contributing to reference counts, so it seems like there must be
> other ways to handle this.

My first guess is to agree, but I'll look at that more carefully when
I actually get to the patch doing that.

What I really don't understand about this one is what the
group<->container connection - an entirely host side matter - has to
do with the reference counting here, which is per *guest* side IOMMU.

> 
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >>  hw/vfio/common.c | 22 +++++++++++++++-------
> > >>  1 file changed, 15 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index fe5ec6a..0b40262 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
> > >>  {
> > >>      VFIOContainer *container = group->container;
> > >>
> > >> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> -        error_report("vfio: error disconnecting group %d from container",
> > >> -                     group->groupid);
> > >> -    }
> > >> -
> > >>      QLIST_REMOVE(group, container_next);
> > >> +
> > >> +    if (QLIST_EMPTY(&container->group_list)) {
> > >> +        VFIOGuestIOMMU *giommu;
> > >> +
> > >> +        vfio_listener_release(container);
> > >> +
> > >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> > >> +            memory_region_unregister_iommu_notifier(&giommu->n);
> > >> +        }
> > >> +    }
> > >> +
> > >>      group->container = NULL;
> > >> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> +        error_report("vfio: error disconnecting group %d from container",
> > >> +                     group->groupid);
> > >> +    }
> > >>
> > >>      if (QLIST_EMPTY(&container->group_list)) {
> > >>          VFIOAddressSpace *space = container->space;
> > >>          VFIOGuestIOMMU *giommu, *tmp;
> > >>
> > >> -        vfio_listener_release(container);
> > >>          QLIST_REMOVE(container, next);
> > >>
> > >>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> > >> -            memory_region_unregister_iommu_notifier(&giommu->n);
> > >>              QLIST_REMOVE(giommu, giommu_next);
> > >>              g_free(giommu);
> > >>          }  
> > >
> > > I'm not spotting why this is a 2-pass process vs simply moving the
> > > existing QLIST_EMPTY cleanup above the ioctl.  Thanks,  
> > 
> > 
> > 
> > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson