All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: aik@ozlabs.ru, qemu-devel@nongnu.org, kvm@vger.kernel.org
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
Date: Tue, 31 Jul 2012 15:34:38 +0300	[thread overview]
Message-ID: <5017D0DE.7090004@redhat.com> (raw)
In-Reply-To: <1343687399.8073.213.camel@ul30vt>

On 07/31/2012 01:29 AM, Alex Williamson wrote:
>> 
>> If the region size is zero, then both memory_region_del_subregion()
>> (assuming the region is parented) and munmap() do nothing.  So you could
>> call this unconditionally.
> 
> I suppose parenting them is the key.  I'm counting on memory_region_size
> of zero for an uninitialized, g_malloc0() MemoryRegion.

That's a no-no.  We have APIs for a reason.  Maybe I'll start encrypting
the contents by xoring with a private variable.

>  Initializing
> them just to have a parent so we can unconditionally remove them here
> seems like it's just shifting complexity from one function to another.
> The majority of BARs aren't even implemented, so we'd actually be
> setting up a lot of dummy infrastructure for a slightly cleaner unmap
> function.  I'll keep looking at this, but I'm not optimistic there's an
> overall simplification here.

Ok.  But use your own bool, don't overload an something from MemoryRegion.


>  
>> >> > +
>> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
>> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
>> >> > +        if (size) {
>> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
>> >> > +            munmap(vdev->msix->mmap, size);
>> >> > +        }
>> >> > +    }
>> > 
>> > And this one potentially unmaps the overlap after the vector table if
>> > there's any space for one.
>> > 
>> >> Are the three size checks needed? Everything should work without them
>> >> from the memory core point of view.
>> > 
>> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
>> > NULL... no?
>> 
>> NULL isn't the problem (well some kernels protect against mmaping NULL
>> to avoid kernel exploits), but it seems the kernel doesn't like a zero
>> length.
> 
> in mm/mmap.c:do_munmap() I see:
> 
>         if ((len = PAGE_ALIGN(len)) == 0)
>                 return -EINVAL;
> 
> Before anything scary happens, so that should be ok.  It's not really
> worthwhile to call the munmaps unconditionally if we already have the
> condition tests because the subregions are unparented though.

Yeah.

> 
>> >> > +
>> >> > +    /*
>> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
>> >> > +     * potentially insert a direct-mapped subregion before and after it.
>> >> > +     */
>> >> 
>> >> This splitting is what the memory core really enjoys.  You can just
>> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
>> > 
>> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
>> > reasons.  It might be worthwhile to someday make VFIO insert an
>> > anonymous page over the MSI-X table to allow this, but it didn't look
>> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
>> > kernel structure where we learn about this BAR if we add it in the
>> > future.
>> 
>> I meant due it purely in qemu.  Instead of an emulated region overlaid
>> by two assigned regions, have an assigned region overlaid by the
>> emulated region.  The regions seen by the vfio listener will be the same.
> 
> Sure, that's what KVM device assignment does, but it requires being able
> to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
> can't assume userspace isn't malicious so it has to prevent this.

I wonder whether it should prevent the mmap(), or let it though and just
SIGBUS on accesses.

>> > 
>> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
>> > an instance of a container in the kernel.  A group can only be attached
>> > to one container.  So whoever calls us with passed fds needs to track
>> > this very carefully.  This is also why I've dropped any kind of shared
>> > IOMMU option to give us a hint whether to try to cram everything in the
>> > same container (~= iommu domain).  It's too easy to pass conflicting
>> > info to share a container for one device, but not another... yet they
>> > may be in the same group.  I'll work on the fd passing though and try to
>> > come up with a reasonable model.
>> 
>> I didn't really follow the container stuff so I can't comment here.  But
>> suppose all assigned devices are done via fd passing, isn't it
>> sufficient to just pass the fd for the device (and keep the iommu group
>> fd in the managment tool)?
> 
> Nope.
> 
> containerfd = open(/dev/vfio/vfio)
> groupfd = open(/dev/vfio/$GROUPID)
> devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
> 
> The container provides access to the iommu, the group is the unit of
> ownership and privilege, and device cannot be accessed without iommu
> protection.  Therefore to get to a devicefd, we first need to privilege
> the container by attaching a group to it, that let's us initialize the
> iommu, which allows us to get the device fd.  At a minimum, we'd need
> both container and device fds, which means libvirt would be responsible
> for determining what type of iommu interface to initialize.  Doing that
> makes adding another device tenuous.  It's not impossible, but VFIO is
> design such that /dev/vfio/vfio is completely harmless on it's own, safe
> for mode 0666 access, just like /dev/kvm.  The groupfd is the important
> access point, so maybe it's sufficient that libvirt could pass only that
> and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
> that libvirt needs to pass the same groupfd for each device that gets
> assigned within a group.

What I was thinking was that libvirt would do all the setup, including
attaching the iommu, then pass something that is safe to qemu.  I don't
see an issue with libvirt keeping tracks of groups; libvirt is supposed
to be doing the host-side management anyway.  But I'm not familiar with
the API so I guess it can't be done.  Maybe an extension?

> 
>> >> > +
>> >> > +
>> >> > +typedef struct MSIVector {
>> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
>> >> > +    struct VFIODevice *vdev; /* back pointer to device */
>> >> > +    int vector; /* the vector number for this element */
>> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
>> >> 
>> >> This calls for an abstraction (don't we have a cache where we look those
>> >> up?)
>> > 
>> > I haven't see one, pointer?  I tried to follow vhost's lead here.
>> 
>> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.
> 
> Right, the irqfd is what we're really after.

Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
creates an irqfd if available, or emulates it by adding a listener to
that eventfd and injecting the interrupt (either through tcg or kvm) itself.

>  
>> >> > +    bool use;
>> >> > +} MSIVector;
>> >> > +
>> >> > +
>> >> > +typedef struct VFIOContainer {
>> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> >> > +    struct {
>> >> > +        /* enable abstraction to support various iommu backends */
>> >> > +        union {
>> >> > +            MemoryListener listener; /* Used by type1 iommu */
>> >> > +        };
>> >> 
>> >> The usual was is to have a Type1VFIOContainer deriving from
>> >> VFIOContainer and adding a MemoryListener.
>> > 
>> > Yep, that would work too.  It gets a bit more complicated that way
>> > though because we have to know when the container is allocated what type
>> > it's going to be.  This way we can step though possible iommu types and
>> > support the right one.  Eventually there may be more than one type
>> > supported on the same platform (ex. one that enables PRI).  Do-able, but
>> > I'm not sure it's worth it at this point.
>> 
>> An alternative alternative is to put a pointer to an abstract type here,
>> then you can defer the decision on the concrete type later.  But I agree
>> it's not worth it at this point.  Maybe just drop the union and decide
>> later when a second iommu type is added.
> 
> A pointer doesn't allow us to use container_of to get back to the
> VFIOContainer from the memory listener callback, so we'd have to create
> some new struct just to hold that back pointer.  Alexey's proposed POWER
> support for VFIO already makes use of the union, so it seems like a
> sufficient solution for now.  We'll have to re-evaluate if it's getting
> unwieldy after we get a few though.

Ok.

-- 
error compiling committee.c: too many arguments to function



  reply	other threads:[~2012-07-31 12:34 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-25 17:03 [RFC PATCH] vfio: VFIO PCI driver for Qemu Alex Williamson
2012-07-25 17:03 ` [Qemu-devel] " Alex Williamson
2012-07-25 19:30 ` Avi Kivity
2012-07-25 19:30   ` [Qemu-devel] " Avi Kivity
2012-07-25 19:53   ` Alex Williamson
2012-07-25 19:53     ` [Qemu-devel] " Alex Williamson
2012-07-26  8:35     ` Avi Kivity
2012-07-26  8:35       ` [Qemu-devel] " Avi Kivity
2012-07-26  9:28       ` Andreas Hartmann
2012-07-26  9:40         ` Avi Kivity
2012-07-26 13:45           ` Andreas Hartmann
2012-07-26 14:56       ` Alex Williamson
2012-07-26 14:56         ` [Qemu-devel] " Alex Williamson
2012-07-26 15:59         ` Avi Kivity
2012-07-26 15:59           ` [Qemu-devel] " Avi Kivity
2012-07-26 16:33           ` Alex Williamson
2012-07-26 16:33             ` [Qemu-devel] " Alex Williamson
2012-07-26 16:40             ` Avi Kivity
2012-07-26 16:40               ` [Qemu-devel] " Avi Kivity
2012-07-26 19:11               ` Alex Williamson
2012-07-26 21:23                 ` Andreas Hartmann
2012-07-26 21:37                   ` Alex Williamson
2012-07-26 16:06         ` Avi Kivity
2012-07-26 16:06           ` [Qemu-devel] " Avi Kivity
2012-07-26 16:40           ` Alex Williamson
2012-07-26 16:40             ` [Qemu-devel] " Alex Williamson
2012-07-26 16:47             ` Avi Kivity
2012-07-26 16:47               ` [Qemu-devel] " Avi Kivity
2012-07-26 15:09 ` Alex Williamson
2012-07-26 15:09   ` [Qemu-devel] " Alex Williamson
2012-07-26 16:34 ` Avi Kivity
2012-07-26 16:34   ` [Qemu-devel] " Avi Kivity
2012-07-26 17:40   ` Alex Williamson
2012-07-26 17:40     ` Alex Williamson
2012-07-29 13:47     ` Avi Kivity
2012-07-29 13:47       ` [Qemu-devel] " Avi Kivity
2012-07-30 22:29       ` Alex Williamson
2012-07-30 22:29         ` Alex Williamson
2012-07-31 12:34         ` Avi Kivity [this message]
2012-07-31 16:56           ` Alex Williamson
2012-07-31 16:56             ` [Qemu-devel] " Alex Williamson
2012-07-27 19:22 ` Blue Swirl
2012-07-27 19:22   ` Blue Swirl
2012-07-27 20:28   ` Alex Williamson
2012-07-27 20:28     ` Alex Williamson
2012-07-28  2:55   ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5017D0DE.7090004@redhat.com \
    --to=avi@redhat.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.