Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu

From: Avi Kivity <avi@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: aik@ozlabs.ru, qemu-devel@nongnu.org, kvm@vger.kernel.org
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
Date: Tue, 31 Jul 2012 15:34:38 +0300	[thread overview]
Message-ID: <5017D0DE.7090004@redhat.com> (raw)
In-Reply-To: <1343687399.8073.213.camel@ul30vt>

On 07/31/2012 01:29 AM, Alex Williamson wrote:
>> 
>> If the region size is zero, then both memory_region_del_subregion()
>> (assuming the region is parented) and munmap() do nothing.  So you could
>> call this unconditionally.
> 
> I suppose parenting them is the key.  I'm counting on memory_region_size
> of zero for an uninitialized, g_malloc0() MemoryRegion.

That's a no-no.  We have APIs for a reason.  Maybe I'll start encrypting
the contents by xoring with a private variable.

>  Initializing
> them just to have a parent so we can unconditionally remove them here
> seems like it's just shifting complexity from one function to another.
> The majority of BARs aren't even implemented, so we'd actually be
> setting up a lot of dummy infrastructure for a slightly cleaner unmap
> function.  I'll keep looking at this, but I'm not optimistic there's an
> overall simplification here.

Ok.  But use your own bool, don't overload an something from MemoryRegion.

>  
>> >> > +
>> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
>> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
>> >> > +        if (size) {
>> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
>> >> > +            munmap(vdev->msix->mmap, size);
>> >> > +        }
>> >> > +    }
>> > 
>> > And this one potentially unmaps the overlap after the vector table if
>> > there's any space for one.
>> > 
>> >> Are the three size checks needed? Everything should work without them
>> >> from the memory core point of view.
>> > 
>> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
>> > NULL... no?
>> 
>> NULL isn't the problem (well some kernels protect against mmaping NULL
>> to avoid kernel exploits), but it seems the kernel doesn't like a zero
>> length.
> 
> in mm/mmap.c:do_munmap() I see:
> 
>         if ((len = PAGE_ALIGN(len)) == 0)
>                 return -EINVAL;
> 
> Before anything scary happens, so that should be ok.  It's not really
> worthwhile to call the munmaps unconditionally if we already have the
> condition tests because the subregions are unparented though.

Yeah.

> 
>> >> > +
>> >> > +    /*
>> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
>> >> > +     * potentially insert a direct-mapped subregion before and after it.
>> >> > +     */
>> >> 
>> >> This splitting is what the memory core really enjoys.  You can just
>> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
>> > 
>> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
>> > reasons.  It might be worthwhile to someday make VFIO insert an
>> > anonymous page over the MSI-X table to allow this, but it didn't look
>> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
>> > kernel structure where we learn about this BAR if we add it in the
>> > future.
>> 
>> I meant due it purely in qemu.  Instead of an emulated region overlaid
>> by two assigned regions, have an assigned region overlaid by the
>> emulated region.  The regions seen by the vfio listener will be the same.
> 
> Sure, that's what KVM device assignment does, but it requires being able
> to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
> can't assume userspace isn't malicious so it has to prevent this.

I wonder whether it should prevent the mmap(), or let it though and just
SIGBUS on accesses.

>> > 
>> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
>> > an instance of a container in the kernel.  A group can only be attached
>> > to one container.  So whoever calls us with passed fds needs to track
>> > this very carefully.  This is also why I've dropped any kind of shared
>> > IOMMU option to give us a hint whether to try to cram everything in the
>> > same container (~= iommu domain).  It's too easy to pass conflicting
>> > info to share a container for one device, but not another... yet they
>> > may be in the same group.  I'll work on the fd passing though and try to
>> > come up with a reasonable model.
>> 
>> I didn't really follow the container stuff so I can't comment here.  But
>> suppose all assigned devices are done via fd passing, isn't it
>> sufficient to just pass the fd for the device (and keep the iommu group
>> fd in the managment tool)?
> 
> Nope.
> 
> containerfd = open(/dev/vfio/vfio)
> groupfd = open(/dev/vfio/$GROUPID)
> devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
> 
> The container provides access to the iommu, the group is the unit of
> ownership and privilege, and device cannot be accessed without iommu
> protection.  Therefore to get to a devicefd, we first need to privilege
> the container by attaching a group to it, that let's us initialize the
> iommu, which allows us to get the device fd.  At a minimum, we'd need
> both container and device fds, which means libvirt would be responsible
> for determining what type of iommu interface to initialize.  Doing that
> makes adding another device tenuous.  It's not impossible, but VFIO is
> design such that /dev/vfio/vfio is completely harmless on it's own, safe
> for mode 0666 access, just like /dev/kvm.  The groupfd is the important
> access point, so maybe it's sufficient that libvirt could pass only that
> and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
> that libvirt needs to pass the same groupfd for each device that gets
> assigned within a group.

What I was thinking was that libvirt would do all the setup, including
attaching the iommu, then pass something that is safe to qemu.  I don't
see an issue with libvirt keeping tracks of groups; libvirt is supposed
to be doing the host-side management anyway.  But I'm not familiar with
the API so I guess it can't be done.  Maybe an extension?

> 
>> >> > +
>> >> > +
>> >> > +typedef struct MSIVector {
>> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
>> >> > +    struct VFIODevice *vdev; /* back pointer to device */
>> >> > +    int vector; /* the vector number for this element */
>> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
>> >> 
>> >> This calls for an abstraction (don't we have a cache where we look those
>> >> up?)
>> > 
>> > I haven't see one, pointer?  I tried to follow vhost's lead here.
>> 
>> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.
> 
> Right, the irqfd is what we're really after.

Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
creates an irqfd if available, or emulates it by adding a listener to
that eventfd and injecting the interrupt (either through tcg or kvm) itself.

>  
>> >> > +    bool use;
>> >> > +} MSIVector;
>> >> > +
>> >> > +
>> >> > +typedef struct VFIOContainer {
>> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> >> > +    struct {
>> >> > +        /* enable abstraction to support various iommu backends */
>> >> > +        union {
>> >> > +            MemoryListener listener; /* Used by type1 iommu */
>> >> > +        };
>> >> 
>> >> The usual was is to have a Type1VFIOContainer deriving from
>> >> VFIOContainer and adding a MemoryListener.
>> > 
>> > Yep, that would work too.  It gets a bit more complicated that way
>> > though because we have to know when the container is allocated what type
>> > it's going to be.  This way we can step though possible iommu types and
>> > support the right one.  Eventually there may be more than one type
>> > supported on the same platform (ex. one that enables PRI).  Do-able, but
>> > I'm not sure it's worth it at this point.
>> 
>> An alternative alternative is to put a pointer to an abstract type here,
>> then you can defer the decision on the concrete type later.  But I agree
>> it's not worth it at this point.  Maybe just drop the union and decide
>> later when a second iommu type is added.
> 
> A pointer doesn't allow us to use container_of to get back to the
> VFIOContainer from the memory listener callback, so we'd have to create
> some new struct just to hold that back pointer.  Alexey's proposed POWER
> support for VFIO already makes use of the union, so it seems like a
> sufficient solution for now.  We'll have to re-evaluate if it's getting
> unwieldy after we get a few though.

Ok.

-- 
error compiling committee.c: too many arguments to function