All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>,
	Joerg Roedel <joerg.roedel@amd.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] iommu: making IOMMU sysfs nodes API public
Date: Mon, 18 Mar 2013 20:40:44 -0600	[thread overview]
Message-ID: <1363660844.24132.452.camel@bling.home> (raw)
In-Reply-To: <51468FC9.2030203@ozlabs.ru>

On Mon, 2013-03-18 at 14:53 +1100, Alexey Kardashevskiy wrote:
> On 20/02/13 15:33, Alex Williamson wrote:
> > On Wed, 2013-02-20 at 15:20 +1100, Alexey Kardashevskiy wrote:
> >> On 20/02/13 14:47, Alex Williamson wrote:
> >>> On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
> >>>> On 20/02/13 07:11, Alex Williamson wrote:
> >>>>> On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> >>>>>> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> >>>>>>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 13/02/13 04:15, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>>>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>>>>>>>> for IOMMU groups is going to be easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>>>>>>>> property which tells a size of a DMA window of the specific
> >>>>>>>>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm still churning on this one; I'm nervous this would basically creat
> >>>>>>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>>>>>>>>>> iommu driver can add random attributes.  That can get ugly for
> >>>>>>>>>>> userspace.
> >>>>>>>>>>
> >>>>>>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>>>>>>>
> >>>>>>>>> Um, I hope it's a little more thought out than /proc.
> >>>>>>>>>
> >>>>>>>>>>> On the other hand, for the application of userspace knowing how much
> >>>>>>>>>>> memory to lock for vfio use of a group, it's an appealing location to
> >>>>>>>>>>> get that information.  Something like libvirt would already be poking
> >>>>>>>>>>> around here to figure out which devices to bind.  Page limits need to be
> >>>>>>>>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>>>>>>>> through vfio ioctls.
> >>>>>>>>>>
> >>>>>>>>>> True. DMA window properties do not change since boot so sysfs is the right
> >>>>>>>>>> place to expose them.
> >>>>>>>>>>
> >>>>>>>>>>> But then is dma-window-size just a vfio requirement leaking over into
> >>>>>>>>>>> iommu groups?  Can we allow iommu driver based attributes without giving
> >>>>>>>>>>> up control of the namespace?  Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Who are you asking these questions? :)
> >>>>>>>>>
> >>>>>>>>> Anyone, including you.  Rather than dropping misc files in sysfs to
> >>>>>>>>> describe things about the group, I think the better solution in your
> >>>>>>>>> case might be a link from the group to an existing sysfs directory
> >>>>>>>>> describing the PE.  I believe your PE is rooted in a PCI bridge, so that
> >>>>>>>>> presumably already has a representation in sysfs.  Can the aperture size
> >>>>>>>>> be determined from something in sysfs for that bridge already?  I'm just
> >>>>>>>>> not ready to create a grab bag of sysfs entries for a group yet.
> >>>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> At the moment there is no information neither in sysfs nor
> >>>>>>>> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
> >>>>>>>> (powerpc partitionable end-point which is often a PHB but not always) just
> >>>>>>>> for VFIO is quite heavy.
> >>>>>>>
> >>>>>>> How do you learn the window size and PE extents in the host kernel?
> >>>>>>>
> >>>>>>>> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the
> >>>>>>>> "dma-window" property there. And replace it with a symlink when and if we
> >>>>>>>> add something for PE later. Would work?
> >>>>>>
> >>>>>> Fwiw, I'd suggest a subfolder named for the type of IOMMU, rather than
> >>>>>> "ppc64".
> >>>>>>
> >>>>>>> To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
> >>>>>>> right?  A subfolder really only limits the scope of the mess, so it's
> >>>>>>> not much improvement.  What does the interface look like to make those
> >>>>>>> subfolders?
> >>>>>>>
> >>>>>>> The problem we're trying to solve is this call flow:
> >>>>>>>
> >>>>>>> containerfd = open("/dev/vfio/vfio");
> >>>>>>> ioctl(containerfd, VFIO_GET_API_VERSION);
> >>>>>>> ioctl(containerfd, VFIO_CHECK_EXTENSION, ...);
> >>>>>>> groupfd = open("/dev/vfio/$GROUP");
> >>>>>>> ioctl(groupfd, VFIO_GROUP_GET_STATUS);
> >>>>>>> ioctl(groupfd, VFIO_GROUP_SET_CONTAINER, &containerfd);
> >>>>>>>
> >>>>>>> You wanted to lock all the memory for the DMA window here, before we can
> >>>>>>> call VFIO_IOMMU_GET_INFO, but does it need to happen there?  We still
> >>>>>>> have a MAP_DMA hook.  We could do it all on the first mapping.
> >>>>>>
> >>>>>> MAP_DMA isn't quite enough, since the guest can also directly cause
> >>>>>> mappings using hypercalls directly implemented in KVM.  I think it
> >>>>>> would be feasible to lock on the first mapping (either via MAP_DMA, or
> >>>>>> H_PUT_TCE) though it would be a bit ugly and require that the first
> >>>>>> H_PUT_TCE always bounce out to virtual mode (Alexey, correct me if I'm
> >>>>>> wrong here).  IIRC there is also a call to bind the vfio container to
> >>>>>> a (qemu assigned) LIOBN, before the guest can use H_PUT_TCE directly,
> >>>>>> so that might be another place we could do the lock.
> >>>>>
> >>>>> Somehow hypercall mappings have to be gated by the userspace setup,
> >>>>> right?
> >>>>
> >>>>
> >>>> There is a KVM ioctl (and a KVM capability) which hooks LIOBN (PCI bus ID)
> >>>> with IOMMU ID. It basically creates an entry in the list of all LIOBNs and
> >>>> when TCE call occurs, the host finds correct IOMMU group to pass this call to.
> >>>>
> >>>> It happens from spapr_register_vfio_container() in QEMU, i.e. after getting
> >>>> DMA window properties but only if the host supports real mode TCE handling.
> >>>>
> >>>>
> >>>>>>>     It also
> >>>>>>> has a flags field that could augment the behavior to trigger page
> >>>>>>> locking.
> >>>>>>
> >>>>>> I don't see how the flags help us - we can't have userspace choose to
> >>>>>> skip the locked memory accounting.  Or are you suggesting a flag to
> >>>>>> open the container in some sort of dummy mode where only GET_INFO is
> >>>>>> possible, then re-open with the full locking?
> >>>>>
> >>>>> Sort of, I don't think it needs to be re-opened, but we had previously
> >>>>> talked about some kind of enable and disable ioctl.  "enable" would be
> >>>>> the logical place to lock pages, but then we probably got stuck in
> >>>>> questions around what it means to enable an iommu generically.
> >>>>
> >>>> The other question is if a container is ready to work if I add just one
> >>>> group? What happens when I add another one (not supported on ppc64 but
> >>>> still)?
> >>>
> >>> This is also the problem with exposing a dma window under the group in
> >>> sysfs.  Do I require the ability to lock the sum of the window, the
> >>> largest window, what?  If we rely on the ioctls, userspace can figure
> >>> out that they can't be combined and know it's the sum.  I'm not sure
> >>> what your plans are around hotplug of a PHB though.
> >>>
> >>>> Having "enable" method and disabling new attachments when it is
> >>>> "enabled" would keep my brain calm :)
> >>>
> >>> Now I'm not sure whether you're for or against it ;)
> >>
> >>
> >> I am for introducing enable() ioctls :) Or even "lock" ioctl.
> >>
> >>
> >>>>> So what
> >>>>> if instead of a separate enable ioctl we had a flag on DMA_MAP that was
> >>>>> defined as SET_WINDOW where iova and size are passed and specify the
> >>>>> portion of the DMA window that userspace intends to use and which is
> >>>>> therefore locked.  If you don't support subwindows, fine, just fail it.
> >>>>> You could have a matching PUT_WINDOW on DMA_UNMAP if you wanted.
> >>>>
> >>>> DMA_MAP which does not do "map" but does "lock" or "set window"?
> >>>> enable()/disable() look better.
> >>>
> >>> Sure, this is why we have a modular iommu interface, spapr can create an
> >>> enable ioctl if necessary.  I think there are ways to use the
> >>> DMA_MAP/UNMAP ioctl in ways that aren't a complete kludge though.
> >>>
> >>>>>>>     Adding the window size to sysfs seems more readily convenient,
> >>>>>>> but is it so hard for userspace to open the files and call a couple
> >>>>>>> ioctls to get far enough to call IOMMU_GET_INFO?  I'm unconvinced the
> >>>>>>> clutter in sysfs more than just a quick fix.  Thanks,
> >>>>>>
> >>>>>> And finally, as Alexey points out, isn't the point here so we know how
> >>>>>> much rlimit to give qemu?  Using ioctls we'd need a special tool just
> >>>>>> to check the dma window sizes, which seems a bit hideous.
> >>>>>
> >>>>> Is it more hideous that using iommu groups to report a vfio imposed
> >>>>> restriction?  Are a couple open files and a handful of ioctls worse than
> >>>>> code to parse directory entries and the future maintenance of an
> >>>>> unrestricted grab bag of sysfs entries?
> >>>>
> >>>> At the moment DMA32 window properties are static. So I can easily get rid
> >>>> of VFIO_IOMMU_SPAPR_TCE_GET_INFO and be happy.
> >>>
> >>> Like, for instance, every PE always gets 512MB DMA window, fixed base
> >>> address, not configurable, end of story?
> >>
> >>
> >> Almost :) 1GB, starting at 0 (sometime at 2GB). Multiple PCI domains are
> >> supported on ppc64 so it does not make a problem as bus address spaces are
> >> separated. But yes, not flexible at all.
> >
> > Statements like "at the moment...", "[but] sometimes at..." make me
> > think it's best to keep the GET_INFO call.
> >
> >>>> Ah, anyway, how do you see these ioctls to work on a user machine?
> >>>> A separate tool which takes an iommu id, returns DMA window size and
> >>>> adjusts rlimit?
> >>>
> >>> Sure, we need something that provides the function of libvirt and
> >>> unbinds devices from host drivers, re-binds them to vfio-pci.  That tool
> >>> needs to have permissions to manipulate groups, so we're just talking
> >>> about whether it's stepping over the line for it to open the group and a
> >>> container, associate them, and probe the iommu info vs reading a sysfs
> >>> file.  Thanks,
> >>
> >> So the Tool is going to be a part of libvirt but not kernel or qemu, right?
> >> Then implementing "LOCK" (and call it after GET_INFO in QEMU and not call
> >> it from the Tool) should work fine.
> >
> > Right, a probe tool would check the value, close the files and set the
> > locked page limit for qemu, which would take the next step to trigger
> > the in-kernel accounting.  Thanks,
> 
> Continuing the discussion :)
> In meanwhile I added/tested VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE like 
> that (will repost the patch later, may be this week, only few changes there):
> 
> 
> +       case VFIO_IOMMU_ENABLE: {
> +               mutex_lock(&container->lock);
> +               ret = tce_iommu_enable(container);
> +               mutex_unlock(&container->lock);
> +
> +               return ret;
> +       }
> +       case VFIO_IOMMU_DISABLE: {
> +               mutex_lock(&container->lock);
> +               tce_iommu_disable(container);
> +               mutex_unlock(&container->lock);
> +
> +               return 0;
> +       }
> 
> and defined them as (not arch specific):
> 
> +/* IOCTLs to enable/disable IOMMU container usage */
> +#define VFIO_IOMMU_ENABLE      _IO(VFIO_TYPE, VFIO_BASE + 15)
> +#define VFIO_IOMMU_DISABLE     _IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> 
> should be ok, right?

Seems fine.  Can you envision any parameters to either of these in the
future?  Be sure to document in the header and maybe add a SPAPR section
to Documentation/vfio.txt making note of the call flow for this IOMMU
backend.

> There is another question. It is possible to compile
> vfio_iommu_spapr_tce as a module. How/when is it supposed to be loaded?
> A user may not want to do "modprobe vfio_iommu_spapr_tce" manually.

See the request_module() call in drivers/vfio/vfio.c.  I'd think you'd
just add another for vfio_iommu_spapr_tce.  Thanks,

Alex


  reply	other threads:[~2013-03-19  2:40 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-11 11:54 [PATCH 0/2] vfio on power Alexey Kardashevskiy
2013-02-11 11:54 ` Alexey Kardashevskiy
2013-02-11 11:54 ` [PATCH 1/2] vfio powerpc: enabled on powernv platform Alexey Kardashevskiy
2013-02-11 11:54   ` Alexey Kardashevskiy
2013-02-11 22:16   ` Alex Williamson
2013-02-11 22:16     ` Alex Williamson
2013-02-11 23:19     ` Alexey Kardashevskiy
2013-02-11 23:19       ` Alexey Kardashevskiy
2013-02-12  0:01       ` Alex Williamson
2013-02-12  0:01         ` Alex Williamson
2013-02-25  2:21     ` Paul Mackerras
2013-02-25  2:21       ` Paul Mackerras
2013-02-11 11:54 ` [PATCH 2/2] vfio powerpc: implemented IOMMU driver for VFIO Alexey Kardashevskiy
2013-02-11 11:54   ` Alexey Kardashevskiy
2013-02-11 22:17   ` Alex Williamson
2013-02-11 22:17     ` Alex Williamson
2013-02-11 23:45     ` Alexey Kardashevskiy
2013-02-11 23:45       ` Alexey Kardashevskiy
2013-02-12  0:25       ` Alex Williamson
2013-02-12  0:25         ` Alex Williamson
2013-02-12  4:06         ` [PATCH] iommu: making IOMMU sysfs nodes API public Alexey Kardashevskiy
2013-02-12  5:07           ` Alex Williamson
2013-02-12 14:42             ` Alexey Kardashevskiy
2013-02-12 17:15               ` Alex Williamson
2013-02-18  6:15                 ` Alexey Kardashevskiy
2013-02-19  5:24                   ` Alex Williamson
2013-02-19  5:48                     ` Alexey Kardashevskiy
2013-02-19 19:53                       ` Alex Williamson
2013-02-19  7:38                     ` David Gibson
2013-02-19 20:11                       ` Alex Williamson
2013-02-20  2:31                         ` Alexey Kardashevskiy
2013-02-20  3:47                           ` Alex Williamson
2013-02-20  4:20                             ` Alexey Kardashevskiy
2013-02-20  4:33                               ` Alex Williamson
2013-03-18  3:53                                 ` Alexey Kardashevskiy
2013-03-19  2:40                                   ` Alex Williamson [this message]
2013-03-19  7:08                                     ` [PATCH] vfio powerpc: implement IOMMU driver for VFIO Alexey Kardashevskiy
2013-03-20 20:45                                       ` Alex Williamson
2013-03-21  0:57                                         ` Alexey Kardashevskiy
2013-03-21  1:29                                           ` Alex Williamson
2013-03-21  1:55                                         ` David Gibson
2013-03-21  3:16                                           ` Alex Williamson
2013-03-25  2:12                                             ` David Gibson
2013-02-22  0:04                         ` [PATCH] iommu: making IOMMU sysfs nodes API public David Gibson
2013-02-22  4:11                           ` Alex Williamson
2013-02-25  0:03                             ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1363660844.24132.452.camel@bling.home \
    --to=alex.williamson@redhat.com \
    --cc=aik@ozlabs.ru \
    --cc=benh@kernel.crashing.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=joerg.roedel@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.