All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Alexander Graf <agraf@suse.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
Date: Fri, 28 Mar 2014 15:49:24 +1100	[thread overview]
Message-ID: <5334FF54.5040008@ozlabs.ru> (raw)
In-Reply-To: <1395411438.632.52.camel@ul30vt.home>

On 03/22/2014 01:17 AM, Alex Williamson wrote:
> On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote:
>> On 03/20/2014 06:57 AM, Alex Williamson wrote:
>>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>>>> From: David Gibson <david@gibson.dropbear.id.au>
>>>>
>>>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
>>>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
>>>> sufficient capability and granularity to match the guest side. This works
>>>> by tracking all map and unmap operations on the guest IOMMU using the
>>>> notifiers, and mirroring them into VFIO.
>>>>
>>>> There are a number of FIXMEs, and the scheme involves rather more notifier
>>>> structures than I'd like, but it should make for a reasonable proof of
>>>> concept.
>>>>
>>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>
>>>> ---
>>>> Changes:
>>>> v4:
>>>> * fixed list objects naming
>>>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
>>>> place only, it is also easier to review the changes
>>>> * fixes boundary check not to fail on sections == 2^64 bytes,
>>>> the "vfio: Fix debug output for int128 values" patch is required;
>>>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
>>>> for 128 bit memory section sizes" patch proposal
>>>> ---
>>>>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>  1 file changed, 120 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>>> index 038010b..4f6f5da 100644
>>>> --- a/hw/misc/vfio.c
>>>> +++ b/hw/misc/vfio.c
>>>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
>>>>          };
>>>>          void (*release)(struct VFIOContainer *);
>>>>      } iommu_data;
>>>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>>>      QLIST_HEAD(, VFIOGroup) group_list;
>>>>      QLIST_ENTRY(VFIOContainer) next;
>>>>  } VFIOContainer;
>>>>  
>>>> +typedef struct VFIOGuestIOMMU {
>>>> +    VFIOContainer *container;
>>>> +    MemoryRegion *iommu;
>>>> +    Notifier n;
>>>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>>> +} VFIOGuestIOMMU;
>>>> +
>>>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>>>>  typedef struct VFIOMSIXInfo {
>>>>      uint8_t table_bar;
>>>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>  
>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>  {
>>>> -    return !memory_region_is_ram(section->mr) ||
>>>> -           /*
>>>> +    return (!memory_region_is_ram(section->mr) &&
>>>> +            !memory_region_is_iommu(section->mr)) ||
>>>> +        /*
>>>
>>> White space damage
>>>
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>              * are never accessed by the CPU and beyond the address width of
>>>> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>             section->offset_within_address_space & (1ULL << 63);
>>>>  }
>>>>  
>>>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>> +{
>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> +    VFIOContainer *container = giommu->container;
>>>> +    IOMMUTLBEntry *iotlb = data;
>>>> +    MemoryRegion *mr;
>>>> +    hwaddr xlat;
>>>> +    hwaddr len = iotlb->addr_mask + 1;
>>>> +    void *vaddr;
>>>> +    int ret;
>>>> +
>>>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>>>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
>>>> +
>>>> +    /*
>>>> +     * The IOMMU TLB entry we have just covers translation through
>>>> +     * this IOMMU to its immediate target.  We need to translate
>>>> +     * it the rest of the way through to memory.
>>>> +     */
>>>> +    mr = address_space_translate(&address_space_memory,
>>>> +                                 iotlb->translated_addr,
>>>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
>>>
>>> Write-only?  Is this supposed to be read-write to mask just 2 bits?
>>
>>
>> The last parameter of address_space_translate() bool is_write. So I do not
>> really understand the problem here.
> 
> Oops, my bad, I didn't look at what address_space_translate() used that
> for.  Ok.
> 
>>>> +    if (!memory_region_is_ram(mr)) {
>>>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>>>> +                xlat);
>>>> +        return;
>>>> +    }
>>>> +    if (len & iotlb->addr_mask) {
>>>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
>>>
>>> Is this possible?  Assuming len is initially a power-of-2, would the
>>> translate function change it?  Maybe worth a comment to explain.
>>
>>
>> Oh. address_space_translate() actually changes @len to min(len,
>> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
>> but lately I have been implementing a huge DMA window (plus one
>> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
>> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
>> is happening.
>>
>> I disabled that check but I need to think of better fix...
>>
>> Adding Paolo to cc, may be he picks the context and gives good piece of
>> advise :)
>>
>>
>>
>>>
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>
>>> This lookup isn't free and the unmap path doesn't need it, maybe move
>>> the variable and lookup into the first branch below?
>>>
>>>> +
>>>> +    if (iotlb->perm != IOMMU_NONE) {
>>>> +        ret = vfio_dma_map(container, iotlb->iova,
>>>> +                           iotlb->addr_mask + 1, vaddr,
>>>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
>>>> +        if (ret) {
>>>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>> +                         container, iotlb->iova,
>>>> +                         iotlb->addr_mask + 1, vaddr, ret);
>>>> +        }
>>>> +    } else {
>>>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>>>> +        if (ret) {
>>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>>>> +                         container, iotlb->iova,
>>>> +                         iotlb->addr_mask + 1, ret);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>>  static void vfio_listener_region_add(MemoryListener *listener,
>>>>                                       MemoryRegionSection *section)
>>>>  {
>>>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>      void *vaddr;
>>>>      int ret;
>>>>  
>>>> -    assert(!memory_region_is_iommu(section->mr));
>>>> -
>>>>      if (vfio_listener_skipped_section(section)) {
>>>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>>>>                  section->offset_within_address_space,
>>>> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>          return;
>>>>      }
>>>>  
>>>> +    memory_region_ref(section->mr);
>>>> +
>>>> +    if (memory_region_is_iommu(section->mr)) {
>>>> +        VFIOGuestIOMMU *giommu;
>>>> +
>>>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>>>> +                iova, int128_get64(int128_sub(llend, int128_one())));
>>>> +        /*
>>>> +         * FIXME: We should do some checking to see if the
>>>> +         * capabilities of the host VFIO IOMMU are adequate to model
>>>> +         * the guest IOMMU
>>>> +         *
>>>> +         * FIXME: This assumes that the guest IOMMU is empty of
>>>> +         * mappings at this point - we should either enforce this, or
>>>> +         * loop through existing mappings to map them into VFIO.
>>>> +         *
>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>> +         * would be the right place to wire that up (tell the KVM
>>>> +         * device emulation the VFIO iommu handles to use).
>>>> +         */
>>>
>>> That's a lot of FIXMEs...  The second one in particular looks like it
>>> needs to expand a bit on why this is likely a valid assumption.  The
>>> last one is more of a TODO than a FIXME.
>>>
>>>> +        giommu = g_malloc0(sizeof(*giommu));
>>>> +        giommu->iommu = section->mr;
>>>> +        giommu->container = container;
>>>> +        giommu->n.notify = vfio_iommu_map_notify;
>>>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>>> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>>>> +
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
>>>> +
>>>>      end = int128_get64(llend);
>>>>      vaddr = memory_region_get_ram_ptr(section->mr) +
>>>>              section->offset_within_region +
>>>>              (iova - section->offset_within_address_space);
>>>>  
>>>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>>>> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>>>>              iova, end - 1, vaddr);
>>>>  
>>>> -    memory_region_ref(section->mr);
>>>>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>>>>      if (ret) {
>>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>          return;
>>>>      }
>>>>  
>>>> +    if (memory_region_is_iommu(section->mr)) {
>>>> +        VFIOGuestIOMMU *giommu;
>>>> +
>>>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>>> +            if (giommu->iommu == section->mr) {
>>>> +                memory_region_unregister_iommu_notifier(&giommu->n);
>>>> +                QLIST_REMOVE(giommu, giommu_next);
>>>> +                g_free(giommu);
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        /*
>>>> +         * FIXME: We assume the one big unmap below is adequate to
>>>> +         * remove any individual page mappings in the IOMMU which
>>>> +         * might have been copied into VFIO.  That may not be true for
>>>> +         * all IOMMU types
>>>> +         */
>>>
>>> We assume this because the IOVA that gets unmapped is the same
>>> regardless of whether a guest IOMMU is present?
>>
>>
>> What exactly is meant by "guest IOMMU is present"? Doing the second DMA
>> window, now I am really confused about terminology :(
> 
> The confusion for me is that add_region initializes the giommu and all
> the DMA mapping through VFIO is done in the notifier for the giommu.
> It's therefore asymmetric that add_region doesn't vfio_dma_map anything,
> but region_del does vfio_dma_unmap, which is the basis of my question.
> I thought this comment was trying to address why that is, but apparently
> it's something else entirely, so it would be nice to understand why this
> doesn't return() and decode a bit more clearly what the FIXME is trying
> to say.  Thanks,

I do not mind extending that comment but need some assistance.

My understanding is that:
===
region_del is called on memory region removal so this particular window is
not going to be used anymore and this is the only place where such cleanup
could be done.
===

Asymmetry is here but it does not look terrible.


> Alex
> 
>>>> +    }
>>>> +
>>>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>>>>            TARGET_PAGE_MASK;



-- 
Alexey

  parent reply	other threads:[~2014-03-28  4:49 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
2014-03-20 10:20   ` Paolo Bonzini
2014-03-20 11:45     ` David Gibson
2014-03-27  5:40     ` Alexey Kardashevskiy
2014-03-27 12:15       ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
2014-03-20 10:19   ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
2014-03-20 10:20   ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-28  3:42     ` Alexey Kardashevskiy
2014-03-31 19:14       ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-20  5:25     ` David Gibson
2014-03-28  5:12       ` Alexey Kardashevskiy
2014-03-31 19:59         ` Alex Williamson
2014-03-21  7:59     ` Alexey Kardashevskiy
2014-03-21 14:17       ` Alex Williamson
2014-03-21 14:23         ` Paolo Bonzini
2014-03-28  4:49         ` Alexey Kardashevskiy [this message]
2014-03-31 19:54           ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
2014-04-03 12:17   ` Alexander Graf
2014-04-07  4:07     ` Alexey Kardashevskiy
2014-04-10 12:13       ` Alexander Graf
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2014-03-13  8:12   ` [Qemu-devel] [PATCH v6] " Alexey Kardashevskiy
2014-03-19 19:57   ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
2014-03-28  6:01     ` Alexey Kardashevskiy
2014-03-31 20:09       ` Alex Williamson
2014-04-01  6:25         ` Alexey Kardashevskiy
2014-04-01 18:21           ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-19 20:12 ` [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5334FF54.5040008@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=agraf@suse.de \
    --cc=alex.williamson@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.