All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Peter Crosthwaite <peter.crosthwaite@xilinx.com>,
	Alexey Kardashevskiy <aik@ozlabs.ru>,
	qemu-devel@nongnu.org, Michael Roth <mdroth@linux.vnet.ibm.com>,
	qemu-ppc@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
Date: Fri, 17 Jul 2015 15:20:55 +1000	[thread overview]
Message-ID: <20150717052055.GF25179@voom.redhat.com> (raw)
In-Reply-To: <1437057899.1391.562.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6964 bytes --]

On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > This makes use of the new "memory registering" feature. The idea is
> > > to provide the userspace ability to notify the host kernel about pages
> > > which are going to be used for DMA. Having this information, the host
> > > kernel can pin them all once per user process, do locked pages
> > > accounting (once) and not spent time on doing that in real time with
> > > possible failures which cannot be handled nicely in some cases.
> > > 
> > > This adds a guest RAM memory listener which notifies a VFIO container
> > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > (i.e. "skip dump" regions) are skipped.
> > > 
> > > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > > not call it when v2 is detected and enabled.
> > > 
> > > This does not change the guest visible interface.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > I've looked at this in more depth now, and attempting to unify the
> > pre-reg and mapping listeners like this can't work - they need to be
> > listening on different address spaces:  mapping actions need to be
> > listening on the PCI address space, whereas the pre-reg needs to be
> > listening on address_space_memory.  For x86 - for now - those end up
> > being the same thing, but on Power they're not.
> > 
> > We do need to be clear about what differences are due to the presence
> > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > well change in future: we could well implement the guest side IOMMU
> > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > On the other side, BenH's experimental powernv machine type could
> > introduce Power machines without a guest side IOMMU (or at least an
> > optional guest side IOMMU).
> > 
> > The quick and dirty approach here is:
> >    1. Leave the main listener as is
> >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> >       which listens on address_space_memory, *not* the PCI space
> 
> It is dirty and that's exactly what I've been advising Alexey against
> because we have entirely too much dirty spapr specific code that doesn't
> need to be spapr specific.  I don't see why separate address space
> matters, that's done at the point of registering the listener and so far
> doesn't play much a role in the actual listener behavior, just which
> regions it sees.

Well, there's two parts to this - the different address spaces means
they need to be different listener instances.  They also need
different callbacks - or at least parameterized callback behaviour
because they do different things (one maps, the other preregs).

So I really don't see any sense in which these can be accomplished by
the same listener.  *Maybe* they could share some region walking code,
but I'm not sure there's going to be anything of significant size here.

> > The more generally correct approach, which allows for more complex
> > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > is:
> >    1. Have the core implement both a mapping listener and a pre-reg
> >       listener (optionally enabled by a per-iommu-type flag).
> >       Basically the first one sees what *is* mapped, the second sees
> >       what *could* be mapped.
> 
> This just sounds like different address spaces, address_space_memory vs
> address_space_physical_memory

Um.. what?  I'm not even sure what you mean by
address_space_physical_memory (I see no such thing in the source).

The idea was that the (top level) pre-reg listener would spawn more
listeners for any AS which could get (partially) mapped into the PCI
addres space.

But.. I looked a bit closer and realised this scheme doesn't actually
work.  IOMMU memory regions don't actually have a fixed target AS
property (by which I mean the address space the IOMMU translates
*into* rather than translates from - address_space_memory in most
cases).  Instead any individual IOMMU mapping can point to a different
AS supplied in the IOMMUTLB structure.

> >    2. As now, the mapping listener listens on PCI address space, if
> >       RAM blocks are added, immediately map them into the host IOMMU,
> >       if guest IOMMU blocks appear register a notifier which will
> >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> >       do now).
> 
> Right, this is done now, nothing new required.

Yes, I was just spelling that out for comparison with the other part.

> >    3. The pre-reg listener also listens on the PCI address space.  RAM
> >       blocks added are pre-registered immediately.  But, if guest
> >       IOMMU blocks are added, instead of registering a guest-iommu
> >       notifier, we register another listener on the *target* AS of the
> >       guest IOMMU, same callbacks as this one.  In practice that
> >       target AS will almost always resolve to address_space_memory,
> >       but this can at least in theory handle crazy guest setups with
> >       multiple layers of IOMMU.
> > 
> >    4. Have to ensure that the pre-reg callbacks always happen before
> >       the mapping calls.  For a system with an IOMMU backend which
> >       requires pre-registration, but doesn't have a guest IOMMU, we
> >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> >       PCI address space.

> Both of these just seem like details of the iommu-type (pre-reg)
> specific parts of the listener, I'm not understanding what's
> fundamentally different about it that we can't do it now, within a
> single listener, nor do I see how it's all that different from x86.

So, we have two different address spaces to listen on (PCI and
address_space_memory) so we need to different listener instances
registered, clearly.

Those two different listeners need to do different things.  1) maps
memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
mappings into VFIO.  2) preregisters memory blocks, and it would be a
bug if it ever saw a guest IOMMU.

So, what's left to share?

> x86
> mappings are more dynamic, but that's largely in legacy space for x86
> due to some bizarre compatibility things.  x86 also tries to map mmio
> regions for p2p, but these are just minor details to mappings that are
> largely the same.  Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

  reply	other threads:[~2015-07-17  5:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-16  1:12     ` Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-16  1:26     ` Alexey Kardashevskiy
2015-07-16  2:51       ` Alex Williamson
2015-07-16  3:31         ` Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-16  5:11   ` David Gibson
2015-07-16 14:44     ` Alex Williamson
2015-07-17  5:20       ` David Gibson [this message]
2015-07-17 18:25         ` Alex Williamson
2015-07-18 15:05           ` David Gibson
2015-07-19 15:04             ` Alex Williamson
2015-07-17  7:13     ` Alexey Kardashevskiy
2015-07-17 13:39       ` David Gibson
2015-07-17 15:47         ` Alexey Kardashevskiy
2015-07-18 15:17           ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150717052055.GF25179@voom.redhat.com \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.crosthwaite@xilinx.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.