All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64
@ 2014-03-12  5:52 Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
                   ` (11 more replies)
  0 siblings, 12 replies; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

Yet another try with VFIO on SPAPR (server PPC64).
As the previous try was too long time ago, I did not bother with
the change log much as all of this requires review again. Also,
it depends on these 2 patchsets which I cannot get reviewed yet
(keep pinging...):
[PATCH] spapr-iommu: extend SPAPR_TCE_TABLE class
[PATCH 0/4] spapr-pci: prepare for vfio

This does not include VFIO KVM device support as the host kernel
part is not there yet because bigger rework of the host VFIO driver
is going to happen soon.


Alex (Williamson), if you find it possible, please "ack" or "rb" as much
as you can. Thanks!


Changes:
v5:
* rebase on top of the current upstream

v4:
* addressed all comments from Alex Williamson
* moved spapr-pci-phb-vfio-phb to new file
* split spapr-pci-phb-vfio to many smaller patches


Alexey Kardashevskiy (7):
  int128: add int128_exts64()
  vfio: Fix 128 bit handling
  vfio: rework to have error paths
  spapr-iommu: add SPAPR VFIO IOMMU device
  spapr vfio: add vfio_container_spapr_get_info()
  spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  spapr-vfio: enable for spapr

David Gibson (4):
  memory: Sanity check that no listeners remain on a destroyed
    AddressSpace
  vfio: Introduce VFIO address spaces
  vfio: Create VFIOAddressSpace objects as needed
  vfio: Add guest side IOMMU support

 hw/misc/vfio.c              | 338 +++++++++++++++++++++++++++++++++++++-------
 hw/ppc/Makefile.objs        |   2 +-
 hw/ppc/spapr_iommu.c        |  97 +++++++++++++
 hw/ppc/spapr_pci_vfio.c     | 206 +++++++++++++++++++++++++++
 include/hw/misc/vfio.h      |  11 ++
 include/hw/pci-host/spapr.h |  13 ++
 include/hw/ppc/spapr.h      |   5 +
 include/qemu/int128.h       |   5 +
 memory.c                    |   7 +
 9 files changed, 633 insertions(+), 51 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c
 create mode 100644 include/hw/misc/vfio.h

-- 
1.8.4.rc4

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-20 10:20   ` Paolo Bonzini
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-ppc, qemu-devel, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

At the moment, most AddressSpace objects last as long as the guest system
in practice, but that could well change in future.  In addition, for VFIO
we will be introducing some private per-AdressSpace information, which must
be disposed of before the AddressSpace itself is destroyed.

To reduce the chances of subtle bugs in this area, this patch adds
asssertions to ensure that when an AddressSpace is destroyed, there are no
remaining MemoryListeners using that AS as a filter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/memory.c b/memory.c
index 3f1df23..678661e 100644
--- a/memory.c
+++ b/memory.c
@@ -1722,12 +1722,19 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
 
 void address_space_destroy(AddressSpace *as)
 {
+    MemoryListener *listener;
+
     /* Flush out anything from MemoryListeners listening in on this */
     memory_region_transaction_begin();
     as->root = NULL;
     memory_region_transaction_commit();
     QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
     address_space_destroy_dispatch(as);
+
+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
+        assert(listener->address_space_filter != as);
+    }
+
     flatview_unref(as->current_map);
     g_free(as->name);
     g_free(as->ioeventfds);
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64()
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-20 10:19   ` Paolo Bonzini
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

This adds macro to extend signed 64bit value to signed 128bit value.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* (.hi = (a >> 63) ? -1 : 0) changed to (.hi = (a < 0) ? -1 : 0)
---
 include/qemu/int128.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/qemu/int128.h b/include/qemu/int128.h
index 9ed47aa..ef87e5e 100644
--- a/include/qemu/int128.h
+++ b/include/qemu/int128.h
@@ -38,6 +38,11 @@ static inline Int128 int128_2_64(void)
     return (Int128) { 0, 1 };
 }
 
+static inline Int128 int128_exts64(int64_t a)
+{
+    return (Int128) { .lo = a, .hi = (a < 0) ? -1 : 0 };
+}
+
 static inline Int128 int128_and(Int128 a, Int128 b)
 {
     return (Int128) { a.lo & b.lo, a.hi & b.hi };
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-20 10:20   ` Paolo Bonzini
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

Upcoming VFIO on SPAPR PPC64 support will initialize the IOMMU
memory region with UINT64_MAX (2^64 bytes) size so int128_get64()
will assert.

The patch takes care of this check. The existing type1 IOMMU code
is not expected to map all 64 bits of RAM so the patch does not
touch that part.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* 64bit @end is calculated from 128-bit @llend instead of repeating
the same calculation steps

v2:
* used new function int128_exts64()
---
 hw/misc/vfio.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index c2c688c..029a100 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -2251,6 +2251,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     VFIOContainer *container = container_of(listener, VFIOContainer,
                                             iommu_data.type1.listener);
     hwaddr iova, end;
+    Int128 llend;
     void *vaddr;
     int ret;
 
@@ -2271,13 +2272,15 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 
     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
 
-    if (iova >= end) {
+    if (int128_ge(int128_make64(iova), llend)) {
         return;
     }
 
+    end = int128_get64(llend);
     vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
             (iova - section->offset_within_address_space);
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

This reworks vfio_connect_container() and vfio_get_group() to have
common exit path at the end of the function bodies.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/misc/vfio.c | 60 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 27 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 029a100..6a04c2a 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -3307,8 +3307,8 @@ static int vfio_connect_container(VFIOGroup *group)
     if (ret != VFIO_API_VERSION) {
         error_report("vfio: supported vfio version: %d, "
                      "reported version: %d", VFIO_API_VERSION, ret);
-        close(fd);
-        return -EINVAL;
+        ret = -EINVAL;
+        goto close_fd_exit;
     }
 
     container = g_malloc0(sizeof(*container));
@@ -3318,17 +3318,15 @@ static int vfio_connect_container(VFIOGroup *group)
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
             error_report("vfio: failed to set group container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
+            ret = -errno;
+            goto free_container_exit;
         }
 
         ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
+            ret = -errno;
+            goto free_container_exit;
         }
 
         container->iommu_data.type1.listener = vfio_memory_listener;
@@ -3339,20 +3337,15 @@ static int vfio_connect_container(VFIOGroup *group)
 
         if (container->iommu_data.type1.error) {
             ret = container->iommu_data.type1.error;
-            vfio_listener_release(container);
-            g_free(container);
-            close(fd);
-            error_report("vfio: memory listener initialization failed for container\n");
-            return ret;
+            goto listener_release_exit;
         }
 
         container->iommu_data.type1.initialized = true;
 
     } else {
         error_report("vfio: No available IOMMU models");
-        g_free(container);
-        close(fd);
-        return -EINVAL;
+        ret = -EINVAL;
+        goto free_container_exit;
     }
 
     QLIST_INIT(&container->group_list);
@@ -3362,6 +3355,18 @@ static int vfio_connect_container(VFIOGroup *group)
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
     return 0;
+
+listener_release_exit:
+    vfio_listener_release(container);
+    error_report("vfio: memory listener initialization failed for container");
+
+free_container_exit:
+    g_free(container);
+
+close_fd_exit:
+    close(fd);
+
+    return ret;
 }
 
 static void vfio_disconnect_container(VFIOGroup *group)
@@ -3405,24 +3410,19 @@ static VFIOGroup *vfio_get_group(int groupid)
     group->fd = qemu_open(path, O_RDWR);
     if (group->fd < 0) {
         error_report("vfio: error opening %s: %m", path);
-        g_free(group);
-        return NULL;
+        goto free_group_exit;
     }
 
     if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
         error_report("vfio: error getting group status: %m");
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }
 
     if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
         error_report("vfio: error, group %d is not viable, please ensure "
                      "all devices within the iommu_group are bound to their "
                      "vfio bus driver.", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }
 
     group->groupid = groupid;
@@ -3430,9 +3430,7 @@ static VFIOGroup *vfio_get_group(int groupid)
 
     if (vfio_connect_container(group)) {
         error_report("vfio: failed to setup container for group %d", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }
 
     if (QLIST_EMPTY(&group_list)) {
@@ -3444,6 +3442,14 @@ static VFIOGroup *vfio_get_group(int groupid)
     vfio_kvm_device_add_group(group);
 
     return group;
+
+close_fd_exit:
+    close(group->fd);
+
+free_group_exit:
+    g_free(group);
+
+    return NULL;
 }
 
 static void vfio_put_group(VFIOGroup *group)
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-19 19:57   ` Alex Williamson
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-ppc, qemu-devel, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The only model so far supported for VFIO passthrough devices is the model
usually used on x86, where all of the guest's RAM is mapped into the
(host) IOMMU and there is no IOMMU visible in the guest.

This patch begins to relax this model, introducing the notion of a
VFIOAddressSpace.  This represents a logical DMA address space which will
be visible to one or more VFIO devices by appropriate mapping in the (host)
IOMMU.  Thus the currently global list of containers becomes local to
a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
group to multiple address spaces.

For now, only one VFIOAddressSpace is created and used, corresponding to
main system memory, that will change in future patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v5:
* vfio_get_group() now receives AddressSpace*

v4:
* removed redundant checks and asserts
* fixed some return error codes
---
 hw/misc/vfio.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 6a04c2a..c8236c3 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -133,6 +133,22 @@ enum {
     VFIO_INT_MSIX = 3,
 };
 
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
+} VFIOAddressSpace;
+
+QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
+
+static VFIOAddressSpace vfio_address_space_memory;
+
+static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
+{
+    space->as = as;
+    QLIST_INIT(&space->containers);
+}
+
 struct VFIOGroup;
 
 typedef struct VFIOType1 {
@@ -142,6 +158,7 @@ typedef struct VFIOType1 {
 } VFIOType1;
 
 typedef struct VFIOContainer {
+    VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     struct {
         /* enable abstraction to support various iommu backends */
@@ -234,9 +251,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 
 #define MSIX_CAP_LENGTH 12
 
-static QLIST_HEAD(, VFIOContainer)
-    container_list = QLIST_HEAD_INITIALIZER(container_list);
-
 static QLIST_HEAD(, VFIOGroup)
     group_list = QLIST_HEAD_INITIALIZER(group_list);
 
@@ -3280,16 +3294,15 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
 #endif
 }
 
-static int vfio_connect_container(VFIOGroup *group)
+static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 {
     VFIOContainer *container;
     int ret, fd;
+    VFIOAddressSpace *space;
 
-    if (group->container) {
-        return 0;
-    }
+    space = &vfio_address_space_memory;
 
-    QLIST_FOREACH(container, &container_list, next) {
+    QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -3312,6 +3325,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     container = g_malloc0(sizeof(*container));
+    container->space = space;
     container->fd = fd;
 
     if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
@@ -3349,7 +3363,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&container_list, container, next);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -3392,7 +3406,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
 {
     VFIOGroup *group;
     char path[32];
@@ -3400,7 +3414,14 @@ static VFIOGroup *vfio_get_group(int groupid)
 
     QLIST_FOREACH(group, &group_list, next) {
         if (group->groupid == groupid) {
-            return group;
+            /* Found it.  Now is it already in the right context? */
+            if (group->container->space->as == as) {
+                return group;
+            } else {
+                error_report("vfio: group %d used in multiple address spaces",
+                             group->groupid);
+                return NULL;
+            }
         }
     }
 
@@ -3428,7 +3449,7 @@ static VFIOGroup *vfio_get_group(int groupid)
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group)) {
+    if (vfio_connect_container(group, as)) {
         error_report("vfio: failed to setup container for group %d", groupid);
         goto close_fd_exit;
     }
@@ -3780,7 +3801,12 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    group = vfio_get_group(groupid);
+    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
+        error_report("vfio: DMA address space must be system memory");
+        return -EINVAL;
+    }
+
+    group = vfio_get_group(groupid, &address_space_memory);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -3994,6 +4020,7 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-19 19:57   ` Alex Williamson
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-ppc, qemu-devel, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

So far, VFIO has a notion of different logical DMA address spaces, but
only ever uses one (system memory).  This patch extends this, creating
new VFIOAddressSpace objects as necessary, according to the AddressSpace
reported by the PCI subsystem for this device's DMAs.

This isn't enough yet to support guest side IOMMUs with VFIO, but it does
mean we could now support VFIO devices on, for example, a guest side PCI
host bridge which maps system memory at somewhere other than 0 in PCI
space.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* vfio_get_address_space() moved to vfio_connect_container()

v5:
* vfio_get_group() now takes AddressSpace* instead of VFIOAddressSpace
---
 hw/misc/vfio.c | 58 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index c8236c3..038010b 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -141,14 +141,6 @@ typedef struct VFIOAddressSpace {
 
 QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
 
-static VFIOAddressSpace vfio_address_space_memory;
-
-static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
-{
-    space->as = as;
-    QLIST_INIT(&space->containers);
-}
-
 struct VFIOGroup;
 
 typedef struct VFIOType1 {
@@ -3294,13 +3286,43 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
 #endif
 }
 
+static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
+{
+    VFIOAddressSpace *space;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        if (space->as == as) {
+            return space;
+        }
+    }
+
+    /* No suitable VFIOAddressSpace, create a new one */
+    space = g_malloc0(sizeof(*space));
+    space->as = as;
+    QLIST_INIT(&space->containers);
+
+    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
+
+    return space;
+}
+
+static void vfio_put_address_space(VFIOAddressSpace *space)
+{
+    if (!QLIST_EMPTY(&space->containers)) {
+        return;
+    }
+
+    QLIST_REMOVE(space, list);
+    g_free(space);
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 {
     VFIOContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
 
-    space = &vfio_address_space_memory;
+    space = vfio_get_address_space(as);
 
     QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
@@ -3313,7 +3335,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
     fd = qemu_open("/dev/vfio/vfio", O_RDWR);
     if (fd < 0) {
         error_report("vfio: failed to open /dev/vfio/vfio: %m");
-        return -errno;
+        ret = -errno;
+        goto put_space_exit;
     }
 
     ret = ioctl(fd, VFIO_GET_API_VERSION);
@@ -3380,6 +3403,9 @@ free_container_exit:
 close_fd_exit:
     close(fd);
 
+put_space_exit:
+    vfio_put_address_space(space);
+
     return ret;
 }
 
@@ -3396,6 +3422,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     group->container = NULL;
 
     if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *space = container->space;
+
         if (container->iommu_data.release) {
             container->iommu_data.release(container);
         }
@@ -3403,6 +3431,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         DPRINTF("vfio_disconnect_container: close container->fd\n");
         close(container->fd);
         g_free(container);
+
+        vfio_put_address_space(space);
     }
 }
 
@@ -3801,12 +3831,7 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
-        error_report("vfio: DMA address space must be system memory");
-        return -EINVAL;
-    }
-
-    group = vfio_get_group(groupid, &address_space_memory);
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -4020,7 +4045,6 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
-    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-19 19:57   ` Alex Williamson
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-ppc, qemu-devel, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

This patch uses the new IOMMU notifiers to allow VFIO pass through devices
to work with guest side IOMMUs, as long as the host-side VFIO iommu has
sufficient capability and granularity to match the guest side. This works
by tracking all map and unmap operations on the guest IOMMU using the
notifiers, and mirroring them into VFIO.

There are a number of FIXMEs, and the scheme involves rather more notifier
structures than I'd like, but it should make for a reasonable proof of
concept.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v4:
* fixed list objects naming
* vfio_listener_region_add() reworked to call memory_region_ref() from one
place only, it is also easier to review the changes
* fixes boundary check not to fail on sections == 2^64 bytes,
the "vfio: Fix debug output for int128 values" patch is required;
this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
for 128 bit memory section sizes" patch proposal
---
 hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 120 insertions(+), 6 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 038010b..4f6f5da 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -159,10 +159,18 @@ typedef struct VFIOContainer {
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
+typedef struct VFIOGuestIOMMU {
+    VFIOContainer *container;
+    MemoryRegion *iommu;
+    Notifier n;
+    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+} VFIOGuestIOMMU;
+
 /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
 typedef struct VFIOMSIXInfo {
     uint8_t table_bar;
@@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
 
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
-    return !memory_region_is_ram(section->mr) ||
-           /*
+    return (!memory_region_is_ram(section->mr) &&
+            !memory_region_is_iommu(section->mr)) ||
+        /*
             * Sizing an enabled 64-bit BAR can cause spurious mappings to
             * addresses in the upper part of the 64-bit address space.  These
             * are never accessed by the CPU and beyond the address width of
@@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
+static void vfio_iommu_map_notify(Notifier *n, void *data)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    IOMMUTLBEntry *iotlb = data;
+    MemoryRegion *mr;
+    hwaddr xlat;
+    hwaddr len = iotlb->addr_mask + 1;
+    void *vaddr;
+    int ret;
+
+    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+            iotlb->iova, iotlb->iova + iotlb->addr_mask);
+
+    /*
+     * The IOMMU TLB entry we have just covers translation through
+     * this IOMMU to its immediate target.  We need to translate
+     * it the rest of the way through to memory.
+     */
+    mr = address_space_translate(&address_space_memory,
+                                 iotlb->translated_addr,
+                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+    if (!memory_region_is_ram(mr)) {
+        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
+                xlat);
+        return;
+    }
+    if (len & iotlb->addr_mask) {
+        DPRINTF("iommu has granularity incompatible with target AS\n");
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(mr) + xlat;
+
+    if (iotlb->perm != IOMMU_NONE) {
+        ret = vfio_dma_map(container, iotlb->iova,
+                           iotlb->addr_mask + 1, vaddr,
+                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, vaddr, ret);
+        }
+    } else {
+        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, ret);
+        }
+    }
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
     void *vaddr;
     int ret;
 
-    assert(!memory_region_is_iommu(section->mr));
-
     if (vfio_listener_skipped_section(section)) {
         DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
                 section->offset_within_address_space,
@@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
         return;
     }
 
+    memory_region_ref(section->mr);
+
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+                iova, int128_get64(int128_sub(llend, int128_one())));
+        /*
+         * FIXME: We should do some checking to see if the
+         * capabilities of the host VFIO IOMMU are adequate to model
+         * the guest IOMMU
+         *
+         * FIXME: This assumes that the guest IOMMU is empty of
+         * mappings at this point - we should either enforce this, or
+         * loop through existing mappings to map them into VFIO.
+         *
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        giommu = g_malloc0(sizeof(*giommu));
+        giommu->iommu = section->mr;
+        giommu->container = container;
+        giommu->n.notify = vfio_iommu_map_notify;
+        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
+        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+
+        return;
+    }
+
+    /* Here we assume that memory_region_is_ram(section->mr)==true */
+
     end = int128_get64(llend);
     vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
+    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
             iova, end - 1, vaddr);
 
-    memory_region_ref(section->mr);
     ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
     if (ret) {
         error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
@@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
         return;
     }
 
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (giommu->iommu == section->mr) {
+                memory_region_unregister_iommu_notifier(&giommu->n);
+                QLIST_REMOVE(giommu, giommu_next);
+                g_free(giommu);
+                break;
+            }
+        }
+
+        /*
+         * FIXME: We assume the one big unmap below is adequate to
+         * remove any individual page mappings in the IOMMU which
+         * might have been copied into VFIO.  That may not be true for
+         * all IOMMU types
+         */
+    }
+
     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
           TARGET_PAGE_MASK;
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-04-03 12:17   ` Alexander Graf
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

This adds SPAPR VFIO IOMMU device in order to support DMA operations
for VFIO devices.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* fixed to adjust changes to support VFIO KVM device
---
 hw/ppc/spapr_iommu.c   | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  5 +++
 2 files changed, 102 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index a54f96f..f39cc4a 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -445,9 +445,106 @@ static TypeInfo spapr_tce_table_info = {
     .instance_finalize = spapr_tce_table_finalize,
 };
 
+/*
+ * SPAPR TCE VFIO IOMMU
+ */
+static IOMMUTLBEntry spapr_vfio_translate_iommu(MemoryRegion *iommu,
+                                                hwaddr addr)
+{
+    IOMMUTLBEntry entry;
+    /*
+     * This callback would normally be used by a QEMU device for DMA
+     * but in this case the vfio-pci device does not do any DMA.
+     * Instead, the real hardware does DMA and hardware TCE table
+     * performs the address translation.
+     */
+    assert(0);
+    return entry;
+}
+
+static MemoryRegionIOMMUOps spapr_vfio_iommu_ops = {
+    .translate = spapr_vfio_translate_iommu,
+};
+
+static int spapr_tce_table_vfio_realize(DeviceState *dev)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+
+    memory_region_init_iommu(&tcet->iommu, NULL, &spapr_vfio_iommu_ops,
+                             "iommu-vfio-spapr", UINT64_MAX);
+
+    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
+
+    return 0;
+}
+
+sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn)
+{
+    sPAPRTCETable *tcet;
+
+    if (spapr_tce_find_by_liobn(liobn)) {
+        fprintf(stderr, "Attempted to create TCE table with duplicate"
+                " LIOBN 0x%x\n", liobn);
+        return NULL;
+    }
+    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE_VFIO));
+    tcet->liobn = liobn;
+    object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL);
+
+    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
+
+    return tcet;
+}
+
+static target_ulong put_tce_vfio(sPAPRTCETable *tcet, target_ulong ioba,
+                                 target_ulong tce)
+{
+    IOMMUTLBEntry entry;
+
+    entry.iova = ioba & ~SPAPR_TCE_PAGE_MASK;
+    entry.translated_addr = tce & ~SPAPR_TCE_PAGE_MASK;
+    entry.addr_mask = SPAPR_TCE_PAGE_MASK;
+    entry.perm = 0;
+    if ((tce & SPAPR_TCE_RO) == SPAPR_TCE_RO) {
+        entry.perm |= IOMMU_RO;
+    }
+    if ((tce & SPAPR_TCE_WO) == SPAPR_TCE_WO) {
+        entry.perm |= IOMMU_WO;
+    }
+    memory_region_notify_iommu(&tcet->iommu, entry);
+
+    return H_SUCCESS;
+}
+
+static void spapr_tce_table_vfio_finalize(Object *obj)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(obj);
+
+    QLIST_REMOVE(tcet, list);
+}
+
+static void spapr_tce_table_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRTCETableClass *stc = SPAPR_TCE_TABLE_CLASS(klass);
+
+    dc->init = spapr_tce_table_vfio_realize;
+    stc->put_tce = put_tce_vfio;
+}
+
+static TypeInfo spapr_tce_table_vfio_info = {
+    .name = TYPE_SPAPR_TCE_TABLE_VFIO,
+    .parent = TYPE_SPAPR_TCE_TABLE,
+    .instance_size = sizeof(sPAPRTCETable),
+    .class_init = spapr_tce_table_vfio_class_init,
+    .class_size = sizeof(sPAPRTCETableClass),
+    .instance_finalize = spapr_tce_table_vfio_finalize,
+};
+
 static void register_types(void)
 {
     type_register_static(&spapr_tce_table_info);
+    type_register_static(&spapr_tce_table_vfio_info);
 }
 
 type_init(register_types);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ebcef7f..ceda354 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -383,6 +383,10 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 #define SPAPR_TCE_TABLE(obj) \
     OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE)
 
+#define TYPE_SPAPR_TCE_TABLE_VFIO "spapr-tce-table-vfio"
+#define SPAPR_TCE_TABLE_VFIO(obj) \
+    OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE_VFIO)
+
 #define SPAPR_TCE_TABLE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRTCETableClass, (klass), TYPE_SPAPR_TCE_TABLE)
 #define SPAPR_TCE_TABLE_GET_CLASS(obj) \
@@ -411,6 +415,7 @@ void spapr_events_init(sPAPREnvironment *spapr);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
                                    size_t window_size);
+sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn);
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
 void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info()
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

As sPAPR platform supports DMA windows on a PCI bus, the information
about their location and size should be passed into the guest via
the device tree.

The patch adds a helper to read this info from the container fd.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* added dup() to protect group_fd from accidental disposal

v5:
* reworked to reflect change in vfio_get_group() from one
of previous patches change

v4:
* fixed possible leaks on error paths
---
 hw/misc/vfio.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/hw/misc/vfio.h | 11 +++++++++++
 2 files changed, 47 insertions(+)
 create mode 100644 include/hw/misc/vfio.h

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 4f6f5da..6dee090 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -39,6 +39,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "hw/misc/vfio.h"
 
 /* #define DEBUG_VFIO */
 #ifdef DEBUG_VFIO
@@ -4163,3 +4164,38 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+int vfio_container_spapr_get_info(AddressSpace *as, uint64_t liobn,
+                                  int32_t groupid,
+                                  struct vfio_iommu_spapr_tce_info *info)
+{
+    VFIOGroup *group;
+    VFIOContainer *container;
+    int ret, fd;
+
+    group = vfio_get_group(groupid, as);
+    if (!group) {
+        return -1;
+    }
+    container = group->container;
+    if (!group->container) {
+        goto put_group_exit;
+    }
+    fd = container->fd;
+    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        goto put_group_exit;
+    }
+    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
+    if (ret) {
+        error_report("vfio: failed to get iommu info for container: %s",
+                     strerror(errno));
+        goto put_group_exit;
+    }
+
+    return 0;
+
+put_group_exit:
+    vfio_put_group(group);
+
+    return -1;
+}
diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
new file mode 100644
index 0000000..ec59989
--- /dev/null
+++ b/include/hw/misc/vfio.h
@@ -0,0 +1,11 @@
+#ifndef VFIO_API_H
+#define VFIO_API_H
+
+#include "qemu/typedefs.h"
+#include <linux/vfio.h>
+
+extern int vfio_container_spapr_get_info(AddressSpace *as, uint64_t liobn,
+                                         int32_t groupid,
+                                         struct vfio_iommu_spapr_tce_info *info);
+
+#endif
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-13  8:12   ` [Qemu-devel] [PATCH v6] " Alexey Kardashevskiy
  2014-03-19 19:57   ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
  2014-03-19 20:12 ` [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alex Williamson
  11 siblings, 2 replies; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

The patch adds a spapr-pci-vfio-host-bridge device type
which is a PCI Host Bridge with VFIO support. The new device
inherits from the spapr-pci-host-bridge device and adds
the following properties:
	iommu - IOMMU group ID which represents a Partitionable
	 	Endpoint, QEMU/ppc64 uses a separate PHB for
		an IOMMU group so the guest kernel has to have
		PCI Domain support enabled.
	forceaddr (optional, 0 by default) - forces QEMU to copy
		device:function from the host address as
		certain guest drivers expect devices to appear in
		particular locations;
	mf (optional, 0 by default) - forces multifunction bit for
		the function #0 of a found device, only makes sense
		for multifunction devices and only with the forceaddr
		property set. It would not be required if there
		was a way to know in advance whether a device is
		multifunctional or not.
	scan (optional, 1 by default) - if non-zero, the new PHB walks
		through all non-bridge devices in the group and tries
		adding them to the PHB; if zero, all devices in the group
		have to be configured manually via the QEMU command line.

Examples of use:
1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6

2) Configure and Add 3 functions of a multifunctional device to QEMU:
(the NEC PCI USB card is used as an example here):
	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* added handling of possible failure of spapr_vfio_new_table()

v4:
* moved IOMMU changes to separate patches
* moved spapr-pci-vfio-host-bridge to new file
---
 hw/ppc/Makefile.objs        |   2 +-
 hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  13 +++
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index ea747f0..2239192 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
 # IBM pSeries (sPAPR)
 obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
 obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
-obj-$(CONFIG_PSERIES) += spapr_pci.o
+obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
new file mode 100644
index 0000000..40f6673
--- /dev/null
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -0,0 +1,206 @@
+/*
+ * QEMU sPAPR PCI host for VFIO
+ *
+ * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include <sys/types.h>
+#include <dirent.h>
+
+#include "hw/hw.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "hw/misc/vfio.h"
+#include "hw/pci/pci_bus.h"
+#include "trace.h"
+#include "qemu/error-report.h"
+
+/* sPAPR VFIO */
+static Property spapr_phb_vfio_properties[] = {
+    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
+{
+    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
+    char *iommupath;
+    DIR *dirp;
+    struct dirent *entry;
+    Error *error = NULL;
+
+    if (!svphb->scan) {
+        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
+        return;
+    }
+
+    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
+                                svphb->iommugroupid);
+    if (!iommupath) {
+        return;
+    }
+
+    dirp = opendir(iommupath);
+    if (!dirp) {
+        error_report("spapr-vfio: vfio scan failed on opendir: %m");
+        g_free(iommupath);
+        return;
+    }
+
+    while ((entry = readdir(dirp)) != NULL) {
+        Error *err = NULL;
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+            if (ret != 1) {
+                continue;
+            }
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
+            /* Skip bridges */
+            continue;
+        }
+        trace_spapr_pci("Creating device from ", entry->d_name);
+
+        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
+        if (!dev) {
+            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
+            continue;
+        }
+        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
+        if (err != NULL) {
+            object_unref(OBJECT(dev));
+            continue;
+        }
+        if (svphb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            err = NULL;
+            object_property_parse(OBJECT(dev), addr, "addr", &err);
+            if (err != NULL) {
+                object_unref(OBJECT(dev));
+                continue;
+            }
+        }
+        if (svphb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        object_property_set_bool(OBJECT(dev), true, "realized", &error);
+        if (error) {
+            object_unref(OBJECT(dev));
+            error_propagate(errp, error);
+            break;
+        }
+    }
+    closedir(dirp);
+    g_free(iommupath);
+}
+
+static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
+    int ret;
+    Error *error = NULL;
+
+    if (svphb->iommugroupid == -1) {
+        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
+        return;
+    }
+
+    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
+
+    if (!svphb->phb.tcet) {
+        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
+        return;
+    }
+
+    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
+                       sphb->dtbusname);
+
+    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
+                                        sphb->dma_liobn, svphb->iommugroupid,
+                                        &info);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "spapr-vfio: get info from container failed");
+        return;
+    }
+
+    svphb->phb.dma_window_start = info.dma32_window_start;
+    svphb->phb.dma_window_size = info.dma32_window_size;
+
+    spapr_pci_vfio_scan(svphb, &error);
+    if (error) {
+        error_propagate(errp, error);
+    }
+}
+
+static void spapr_phb_vfio_reset(DeviceState *qdev)
+{
+    /* Do nothing */
+}
+
+static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
+
+    dc->props = spapr_phb_vfio_properties;
+    dc->reset = spapr_phb_vfio_reset;
+    spc->finish_realize = spapr_phb_vfio_finish_realize;
+}
+
+static const TypeInfo spapr_phb_vfio_info = {
+    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
+    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .class_init    = spapr_phb_vfio_class_init,
+    .class_size    = sizeof(sPAPRPHBClass),
+};
+
+static void spapr_pci_vfio_register_types(void)
+{
+    type_register_static(&spapr_phb_vfio_info);
+}
+
+type_init(spapr_pci_vfio_register_types)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 0f428a1..18acb67 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -30,10 +30,14 @@
 #define SPAPR_MSIX_MAX_DEVS 32
 
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
+#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
+
 #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
 #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
@@ -41,6 +45,7 @@
 
 typedef struct sPAPRPHBClass sPAPRPHBClass;
 typedef struct sPAPRPHBState sPAPRPHBState;
+typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
@@ -78,6 +83,14 @@ struct sPAPRPHBState {
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
+struct sPAPRPHBVFIOState {
+    sPAPRPHBState phb;
+
+    struct VFIOContainer *container;
+    int32_t iommugroupid;
+    uint8_t scan, enable_multifunction, force_addr;
+};
+
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
 #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2014-03-12  5:52 ` Alexey Kardashevskiy
  2014-03-19 19:57   ` Alex Williamson
  2014-03-19 20:12 ` [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alex Williamson
  11 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-12  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

This turns the sPAPR support on and enables VFIO container use
in the kernel.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* multiple returns converted to gotos

v4:
* fixed format string to use %m which is a glibc extension:
"Print output of strerror(errno). No argument is required."
---
 hw/misc/vfio.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 6dee090..e7b2b36 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -3494,6 +3494,34 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+        if (ret) {
+            error_report("vfio: failed to enable container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.type1.listener,
+                                 container->space->as);
+
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2014-03-13  8:12   ` Alexey Kardashevskiy
  2014-03-19 19:57   ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-13  8:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

The patch adds a spapr-pci-vfio-host-bridge device type
which is a PCI Host Bridge with VFIO support. The new device
inherits from the spapr-pci-host-bridge device and adds
the following properties:
	iommu - IOMMU group ID which represents a Partitionable
	 	Endpoint, QEMU/ppc64 uses a separate PHB for
		an IOMMU group so the guest kernel has to have
		PCI Domain support enabled.
	forceaddr (optional, 0 by default) - forces QEMU to copy
		device:function from the host address as
		certain guest drivers expect devices to appear in
		particular locations;
	mf (optional, 0 by default) - forces multifunction bit for
		the function #0 of a found device, only makes sense
		for multifunction devices and only with the forceaddr
		property set. It would not be required if there
		was a way to know in advance whether a device is
		multifunctional or not.
	scan (optional, 1 by default) - if non-zero, the new PHB walks
		through all non-bridge devices in the group and tries
		adding them to the PHB; if zero, all devices in the group
		have to be configured manually via the QEMU command line.

Examples of use:
1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6

2) Configure and Add 3 functions of a multifunctional device to QEMU:
(the NEC PCI USB card is used as an example here):
	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* moved device creation to a separate function
* reworked traces

v5:
* added handling of possible failure of spapr_vfio_new_table()

v4:
* moved IOMMU changes to separate patches
* moved spapr-pci-vfio-host-bridge to new file
---
 hw/ppc/Makefile.objs        |   2 +-
 hw/ppc/spapr_pci_vfio.c     | 225 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  13 +++
 trace-events                |   6 +-
 4 files changed, 244 insertions(+), 2 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index ea747f0..2239192 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
 # IBM pSeries (sPAPR)
 obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
 obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
-obj-$(CONFIG_PSERIES) += spapr_pci.o
+obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
new file mode 100644
index 0000000..ec16187
--- /dev/null
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -0,0 +1,225 @@
+/*
+ * QEMU sPAPR PCI host for VFIO
+ *
+ * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include <sys/types.h>
+#include <dirent.h>
+
+#include "hw/hw.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "hw/misc/vfio.h"
+#include "hw/pci/pci_bus.h"
+#include "trace.h"
+#include "qemu/error-report.h"
+#include "qapi/qmp/qerror.h"
+
+/* sPAPR VFIO */
+static Property spapr_phb_vfio_properties[] = {
+    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_pci_vfio_add_device(sPAPRPHBVFIOState *svphb,
+                                      const char *hostaddr,
+                                      unsigned devid, unsigned fnid,
+                                      Error **errp)
+{
+    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
+    DeviceState *dev;
+    Error *err = NULL;
+    char addr[32];
+
+    dev = qdev_create(&phb->bus->qbus, "vfio-pci");
+    if (!dev) {
+        error_set(&err, QERR_DEVICE_INIT_FAILED, hostaddr);
+        goto error_exit;
+    }
+
+    object_property_parse(OBJECT(dev), hostaddr, "host", &err);
+    if (err) {
+        goto unref_exit;
+    }
+
+    if (svphb->force_addr) {
+        snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+        object_property_parse(OBJECT(dev), addr, "addr", &err);
+        if (err) {
+            goto unref_exit;
+        }
+    }
+
+    if (svphb->enable_multifunction) {
+        qdev_prop_set_bit(dev, "multifunction", 1);
+    }
+
+    object_property_set_bool(OBJECT(dev), true, "realized", &err);
+    if (err) {
+        goto unref_exit;
+    }
+
+    trace_spapr_vfio_device_created(hostaddr);
+    return;
+
+unref_exit:
+    object_unref(OBJECT(dev));
+
+error_exit:
+    error_propagate(errp, err);
+    trace_spapr_vfio_device_failed(hostaddr);
+}
+
+static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
+{
+    char *iommupath;
+    DIR *dirp;
+    struct dirent *entry;
+
+    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
+                                svphb->iommugroupid);
+    if (!iommupath) {
+        return;
+    }
+
+    dirp = opendir(iommupath);
+    if (!dirp) {
+        error_report("spapr-vfio: vfio scan failed on opendir: %m");
+        g_free(iommupath);
+        return;
+    }
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        Error *err = NULL;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_vfio_try_read_class(tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+            if (ret != 1) {
+                continue;
+            }
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
+            /* Skip bridges */
+            continue;
+        }
+
+        spapr_pci_vfio_add_device(svphb, entry->d_name, devid, fnid, &err);
+        if (err) {
+            error_report("%s", error_get_pretty(err));
+            error_free(err);
+        }
+    }
+    closedir(dirp);
+    g_free(iommupath);
+}
+
+static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
+    int ret;
+    Error *error = NULL;
+
+    if (svphb->iommugroupid == -1) {
+        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
+        return;
+    }
+
+    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
+
+    if (!svphb->phb.tcet) {
+        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
+        return;
+    }
+
+    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
+                       sphb->dtbusname);
+
+    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
+                                        sphb->dma_liobn, svphb->iommugroupid,
+                                        &info);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "spapr-vfio: get info from container failed");
+        return;
+    }
+
+    svphb->phb.dma_window_start = info.dma32_window_start;
+    svphb->phb.dma_window_size = info.dma32_window_size;
+
+    if (svphb->scan) {
+        spapr_pci_vfio_scan(svphb, &error);
+        if (error) {
+            error_propagate(errp, error);
+        }
+    }
+}
+
+static void spapr_phb_vfio_reset(DeviceState *qdev)
+{
+    /* Do nothing */
+}
+
+static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
+
+    dc->props = spapr_phb_vfio_properties;
+    dc->reset = spapr_phb_vfio_reset;
+    spc->finish_realize = spapr_phb_vfio_finish_realize;
+}
+
+static const TypeInfo spapr_phb_vfio_info = {
+    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
+    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .class_init    = spapr_phb_vfio_class_init,
+    .class_size    = sizeof(sPAPRPHBClass),
+};
+
+static void spapr_pci_vfio_register_types(void)
+{
+    type_register_static(&spapr_phb_vfio_info);
+}
+
+type_init(spapr_pci_vfio_register_types)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 0f428a1..18acb67 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -30,10 +30,14 @@
 #define SPAPR_MSIX_MAX_DEVS 32
 
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
+#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
+
 #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
 #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
@@ -41,6 +45,7 @@
 
 typedef struct sPAPRPHBClass sPAPRPHBClass;
 typedef struct sPAPRPHBState sPAPRPHBState;
+typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
@@ -78,6 +83,14 @@ struct sPAPRPHBState {
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
+struct sPAPRPHBVFIOState {
+    sPAPRPHBState phb;
+
+    struct VFIOContainer *container;
+    int32_t iommugroupid;
+    uint8_t scan, enable_multifunction, force_addr;
+};
+
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
 #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
diff --git a/trace-events b/trace-events
index 944a798..f6b50b8 100644
--- a/trace-events
+++ b/trace-events
@@ -1124,7 +1124,6 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/ppc/spapr_pci.c
-spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
@@ -1132,6 +1131,11 @@ spapr_pci_rtas_ibm_query_interrupt_source_number(unsigned ioa, unsigned intr) "q
 spapr_pci_msi_write(uint64_t addr, uint64_t data, uint32_t dt_irq) "@%"PRIx64"<=%"PRIx64" IRQ %u"
 spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
 
+# hw/ppc/spapr_pci_vfio.c
+spapr_vfio_try_read_class(const char *s) "from %s"
+spapr_vfio_device_created(const char *s) "%s"
+spapr_vfio_device_failed(const char *s) "%s"
+
 # hw/intc/xics.c
 xics_icp_check_ipi(int server, uint8_t mfrr) "CPU %d can take IPI mfrr=%#x"
 xics_icp_accept(uint32_t old_xirr, uint32_t new_xirr) "icp_accept: XIRR %#"PRIx32"->%#"PRIx32
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
@ 2014-03-19 19:57   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 19:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> This turns the sPAPR support on and enables VFIO container use
> in the kernel.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v5:
> * multiple returns converted to gotos
> 
> v4:
> * fixed format string to use %m which is a glibc extension:
> "Print output of strerror(errno). No argument is required."
> ---
>  hw/misc/vfio.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 6dee090..e7b2b36 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -3494,6 +3494,34 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  
>          container->iommu_data.type1.initialized = true;
>  
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +        if (ret) {
> +            error_report("vfio: failed to enable container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        container->iommu_data.type1.listener = vfio_memory_listener;


Hmm, seems sloppy to use the type1 part of the union here.  Should we
pull the listener out of the union?

> +        container->iommu_data.release = vfio_listener_release;
> +
> +        memory_listener_register(&container->iommu_data.type1.listener,
> +                                 container->space->as);
> +
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
  2014-03-13  8:12   ` [Qemu-devel] [PATCH v6] " Alexey Kardashevskiy
@ 2014-03-19 19:57   ` Alex Williamson
  2014-03-28  6:01     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 19:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> The patch adds a spapr-pci-vfio-host-bridge device type
> which is a PCI Host Bridge with VFIO support. The new device
> inherits from the spapr-pci-host-bridge device and adds
> the following properties:
> 	iommu - IOMMU group ID which represents a Partitionable
> 	 	Endpoint, QEMU/ppc64 uses a separate PHB for
> 		an IOMMU group so the guest kernel has to have
> 		PCI Domain support enabled.
> 	forceaddr (optional, 0 by default) - forces QEMU to copy
> 		device:function from the host address as
> 		certain guest drivers expect devices to appear in
> 		particular locations;
> 	mf (optional, 0 by default) - forces multifunction bit for
> 		the function #0 of a found device, only makes sense
> 		for multifunction devices and only with the forceaddr
> 		property set. It would not be required if there
> 		was a way to know in advance whether a device is
> 		multifunctional or not.
> 	scan (optional, 1 by default) - if non-zero, the new PHB walks
> 		through all non-bridge devices in the group and tries
> 		adding them to the PHB; if zero, all devices in the group
> 		have to be configured manually via the QEMU command line.
> 
> Examples of use:
> 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
> 	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6
> 
> 2) Configure and Add 3 functions of a multifunctional device to QEMU:
> (the NEC PCI USB card is used as an example here):
> 	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
> 	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
> 	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
> 	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

I won't pretend to predict the reaction of QEMU device architects to
this, but it seems like the assembly we expect from config files or
outside utilities, ex. libvirt.  I don't doubt this makes qemu
commandline usage more palatable, but it seems contrary to some of the
other things we do in QEMU with devices.  If this is something we need,
why is it specific to spapr?  IOMMU group can contain multiple devices
on any platform.  On x86 we could do something similar with a p2p
bridge, switch, or root port.

BTW, the code skips bridges, but doesn't that mean you'll have a hard
time with forceaddr as you potentially try to overlay devfn from
multiple buses onto a single bus?  It also makes the value of this a bit
more questionable since it seems to fall apart so easily.  Thanks,

Alex

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v5:
> * added handling of possible failure of spapr_vfio_new_table()
> 
> v4:
> * moved IOMMU changes to separate patches
> * moved spapr-pci-vfio-host-bridge to new file
> ---
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  13 +++
>  3 files changed, 220 insertions(+), 1 deletion(-)
>  create mode 100644 hw/ppc/spapr_pci_vfio.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index ea747f0..2239192 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
>  # IBM pSeries (sPAPR)
>  obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
>  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> -obj-$(CONFIG_PSERIES) += spapr_pci.o
> +obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> new file mode 100644
> index 0000000..40f6673
> --- /dev/null
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -0,0 +1,206 @@
> +/*
> + * QEMU sPAPR PCI host for VFIO
> + *
> + * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
> +#include "hw/hw.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "hw/misc/vfio.h"
> +#include "hw/pci/pci_bus.h"
> +#include "trace.h"
> +#include "qemu/error-report.h"
> +
> +/* sPAPR VFIO */
> +static Property spapr_phb_vfio_properties[] = {
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
> +{
> +    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
> +    char *iommupath;
> +    DIR *dirp;
> +    struct dirent *entry;
> +    Error *error = NULL;
> +
> +    if (!svphb->scan) {
> +        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
> +        return;
> +    }
> +
> +    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
> +                                svphb->iommugroupid);
> +    if (!iommupath) {
> +        return;
> +    }
> +
> +    dirp = opendir(iommupath);
> +    if (!dirp) {
> +        error_report("spapr-vfio: vfio scan failed on opendir: %m");
> +        g_free(iommupath);
> +        return;
> +    }
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        Error *err = NULL;
> +        char *tmp;
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> +                   &domainid, &busid, &devid, &fnid) != 4) {
> +            continue;
> +        }
> +
> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> +        trace_spapr_pci("Reading device class from ", tmp);
> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +            if (ret != 1) {
> +                continue;
> +            }
> +        }
> +        g_free(tmp);
> +
> +        if (!deviceclass) {
> +            continue;
> +        }
> +        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
> +            /* Skip bridges */
> +            continue;
> +        }
> +        trace_spapr_pci("Creating device from ", entry->d_name);
> +
> +        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
> +            continue;
> +        }
> +        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
> +        if (err != NULL) {
> +            object_unref(OBJECT(dev));
> +            continue;
> +        }
> +        if (svphb->force_addr) {
> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> +            err = NULL;
> +            object_property_parse(OBJECT(dev), addr, "addr", &err);
> +            if (err != NULL) {
> +                object_unref(OBJECT(dev));
> +                continue;
> +            }
> +        }
> +        if (svphb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        object_property_set_bool(OBJECT(dev), true, "realized", &error);
> +        if (error) {
> +            object_unref(OBJECT(dev));
> +            error_propagate(errp, error);
> +            break;
> +        }
> +    }
> +    closedir(dirp);
> +    g_free(iommupath);
> +}
> +
> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
> +    int ret;
> +    Error *error = NULL;
> +
> +    if (svphb->iommugroupid == -1) {
> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
> +        return;
> +    }
> +
> +    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
> +
> +    if (!svphb->phb.tcet) {
> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
> +        return;
> +    }
> +
> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
> +                       sphb->dtbusname);
> +
> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
> +                                        sphb->dma_liobn, svphb->iommugroupid,
> +                                        &info);
> +    if (ret) {
> +        error_setg_errno(errp, -ret,
> +                         "spapr-vfio: get info from container failed");
> +        return;
> +    }
> +
> +    svphb->phb.dma_window_start = info.dma32_window_start;
> +    svphb->phb.dma_window_size = info.dma32_window_size;
> +
> +    spapr_pci_vfio_scan(svphb, &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +    }
> +}
> +
> +static void spapr_phb_vfio_reset(DeviceState *qdev)
> +{
> +    /* Do nothing */
> +}
> +
> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
> +
> +    dc->props = spapr_phb_vfio_properties;
> +    dc->reset = spapr_phb_vfio_reset;
> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
> +}
> +
> +static const TypeInfo spapr_phb_vfio_info = {
> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
> +    .instance_size = sizeof(sPAPRPHBVFIOState),
> +    .class_init    = spapr_phb_vfio_class_init,
> +    .class_size    = sizeof(sPAPRPHBClass),
> +};
> +
> +static void spapr_pci_vfio_register_types(void)
> +{
> +    type_register_static(&spapr_phb_vfio_info);
> +}
> +
> +type_init(spapr_pci_vfio_register_types)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 0f428a1..18acb67 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -30,10 +30,14 @@
>  #define SPAPR_MSIX_MAX_DEVS 32
>  
>  #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
>  
>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>  
> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
> +
>  #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
>       OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
>  #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
> @@ -41,6 +45,7 @@
>  
>  typedef struct sPAPRPHBClass sPAPRPHBClass;
>  typedef struct sPAPRPHBState sPAPRPHBState;
> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>  
>  struct sPAPRPHBClass {
>      PCIHostBridgeClass parent_class;
> @@ -78,6 +83,14 @@ struct sPAPRPHBState {
>      QLIST_ENTRY(sPAPRPHBState) list;
>  };
>  
> +struct sPAPRPHBVFIOState {
> +    sPAPRPHBState phb;
> +
> +    struct VFIOContainer *container;
> +    int32_t iommugroupid;
> +    uint8_t scan, enable_multifunction, force_addr;
> +};
> +
>  #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>  
>  #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
@ 2014-03-19 19:57   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 19:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> So far, VFIO has a notion of different logical DMA address spaces, but
> only ever uses one (system memory).  This patch extends this, creating
> new VFIOAddressSpace objects as necessary, according to the AddressSpace
> reported by the PCI subsystem for this device's DMAs.
> 
> This isn't enough yet to support guest side IOMMUs with VFIO, but it does
> mean we could now support VFIO devices on, for example, a guest side PCI
> host bridge which maps system memory at somewhere other than 0 in PCI
> space.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v6:
> * vfio_get_address_space() moved to vfio_connect_container()
> 
> v5:
> * vfio_get_group() now takes AddressSpace* instead of VFIOAddressSpace
> ---
>  hw/misc/vfio.c | 58 +++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 41 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index c8236c3..038010b 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -141,14 +141,6 @@ typedef struct VFIOAddressSpace {
>  
>  QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
>  
> -static VFIOAddressSpace vfio_address_space_memory;
> -
> -static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
> -{
> -    space->as = as;
> -    QLIST_INIT(&space->containers);
> -}
> -
>  struct VFIOGroup;
>  
>  typedef struct VFIOType1 {
> @@ -3294,13 +3286,43 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>  #endif
>  }
>  
> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> +{
> +    VFIOAddressSpace *space;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        if (space->as == as) {
> +            return space;
> +        }
> +    }
> +
> +    /* No suitable VFIOAddressSpace, create a new one */
> +    space = g_malloc0(sizeof(*space));
> +    space->as = as;
> +    QLIST_INIT(&space->containers);
> +
> +    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
> +
> +    return space;
> +}
> +
> +static void vfio_put_address_space(VFIOAddressSpace *space)
> +{
> +    if (!QLIST_EMPTY(&space->containers)) {
> +        return;
> +    }
> +
> +    QLIST_REMOVE(space, list);
> +    g_free(space);
> +}
> +

nit, I realize you're probably copying vfio_group_put, but this one
looks cleaner as:

if (QLIST_EMPTY(...)) {
    QLIST_REMOVE(...);
    g_free(...);
}

>  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  {
>      VFIOContainer *container;
>      int ret, fd;
>      VFIOAddressSpace *space;
>  
> -    space = &vfio_address_space_memory;
> +    space = vfio_get_address_space(as);
>  
>      QLIST_FOREACH(container, &space->containers, next) {
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> @@ -3313,7 +3335,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>      fd = qemu_open("/dev/vfio/vfio", O_RDWR);
>      if (fd < 0) {
>          error_report("vfio: failed to open /dev/vfio/vfio: %m");
> -        return -errno;
> +        ret = -errno;
> +        goto put_space_exit;
>      }
>  
>      ret = ioctl(fd, VFIO_GET_API_VERSION);
> @@ -3380,6 +3403,9 @@ free_container_exit:
>  close_fd_exit:
>      close(fd);
>  
> +put_space_exit:
> +    vfio_put_address_space(space);
> +
>      return ret;
>  }
>  
> @@ -3396,6 +3422,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      group->container = NULL;
>  
>      if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOAddressSpace *space = container->space;
> +
>          if (container->iommu_data.release) {
>              container->iommu_data.release(container);
>          }
> @@ -3403,6 +3431,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          DPRINTF("vfio_disconnect_container: close container->fd\n");
>          close(container->fd);
>          g_free(container);
> +
> +        vfio_put_address_space(space);
>      }
>  }
>  
> @@ -3801,12 +3831,7 @@ static int vfio_initfn(PCIDevice *pdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>  
> -    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
> -        error_report("vfio: DMA address space must be system memory");
> -        return -EINVAL;
> -    }
> -
> -    group = vfio_get_group(groupid, &address_space_memory);
> +    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
>      if (!group) {
>          error_report("vfio: failed to get group %d", groupid);
>          return -ENOENT;
> @@ -4020,7 +4045,6 @@ static const TypeInfo vfio_pci_dev_info = {
>  
>  static void register_vfio_pci_dev_type(void)
>  {
> -    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>      type_register_static(&vfio_pci_dev_info);
>  }
>  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
@ 2014-03-19 19:57   ` Alex Williamson
  2014-03-20  5:25     ` David Gibson
  2014-03-21  7:59     ` Alexey Kardashevskiy
  0 siblings, 2 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 19:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> sufficient capability and granularity to match the guest side. This works
> by tracking all map and unmap operations on the guest IOMMU using the
> notifiers, and mirroring them into VFIO.
> 
> There are a number of FIXMEs, and the scheme involves rather more notifier
> structures than I'd like, but it should make for a reasonable proof of
> concept.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> ---
> Changes:
> v4:
> * fixed list objects naming
> * vfio_listener_region_add() reworked to call memory_region_ref() from one
> place only, it is also easier to review the changes
> * fixes boundary check not to fail on sections == 2^64 bytes,
> the "vfio: Fix debug output for int128 values" patch is required;
> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> for 128 bit memory section sizes" patch proposal
> ---
>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 120 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 038010b..4f6f5da 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
>          };
>          void (*release)(struct VFIOContainer *);
>      } iommu_data;
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
>  
> +typedef struct VFIOGuestIOMMU {
> +    VFIOContainer *container;
> +    MemoryRegion *iommu;
> +    Notifier n;
> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +} VFIOGuestIOMMU;
> +
>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>  typedef struct VFIOMSIXInfo {
>      uint8_t table_bar;
> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
> -    return !memory_region_is_ram(section->mr) ||
> -           /*
> +    return (!memory_region_is_ram(section->mr) &&
> +            !memory_region_is_iommu(section->mr)) ||
> +        /*

White space damage

>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>              * addresses in the upper part of the 64-bit address space.  These
>              * are never accessed by the CPU and beyond the address width of
> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>             section->offset_within_address_space & (1ULL << 63);
>  }
>  
> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> +{
> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> +    VFIOContainer *container = giommu->container;
> +    IOMMUTLBEntry *iotlb = data;
> +    MemoryRegion *mr;
> +    hwaddr xlat;
> +    hwaddr len = iotlb->addr_mask + 1;
> +    void *vaddr;
> +    int ret;
> +
> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> +
> +    /*
> +     * The IOMMU TLB entry we have just covers translation through
> +     * this IOMMU to its immediate target.  We need to translate
> +     * it the rest of the way through to memory.
> +     */
> +    mr = address_space_translate(&address_space_memory,
> +                                 iotlb->translated_addr,
> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);

Write-only?  Is this supposed to be read-write to mask just 2 bits?

> +    if (!memory_region_is_ram(mr)) {
> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> +                xlat);
> +        return;
> +    }
> +    if (len & iotlb->addr_mask) {
> +        DPRINTF("iommu has granularity incompatible with target AS\n");

Is this possible?  Assuming len is initially a power-of-2, would the
translate function change it?  Maybe worth a comment to explain.

> +        return;
> +    }
> +
> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;

This lookup isn't free and the unmap path doesn't need it, maybe move
the variable and lookup into the first branch below?

> +
> +    if (iotlb->perm != IOMMU_NONE) {
> +        ret = vfio_dma_map(container, iotlb->iova,
> +                           iotlb->addr_mask + 1, vaddr,
> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> +        if (ret) {
> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                         container, iotlb->iova,
> +                         iotlb->addr_mask + 1, vaddr, ret);
> +        }
> +    } else {
> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iotlb->iova,
> +                         iotlb->addr_mask + 1, ret);
> +        }
> +    }
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      void *vaddr;
>      int ret;
>  
> -    assert(!memory_region_is_iommu(section->mr));
> -
>      if (vfio_listener_skipped_section(section)) {
>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>                  section->offset_within_address_space,
> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          return;
>      }
>  
> +    memory_region_ref(section->mr);
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> +                iova, int128_get64(int128_sub(llend, int128_one())));
> +        /*
> +         * FIXME: We should do some checking to see if the
> +         * capabilities of the host VFIO IOMMU are adequate to model
> +         * the guest IOMMU
> +         *
> +         * FIXME: This assumes that the guest IOMMU is empty of
> +         * mappings at this point - we should either enforce this, or
> +         * loop through existing mappings to map them into VFIO.
> +         *
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */

That's a lot of FIXMEs...  The second one in particular looks like it
needs to expand a bit on why this is likely a valid assumption.  The
last one is more of a TODO than a FIXME.

> +        giommu = g_malloc0(sizeof(*giommu));
> +        giommu->iommu = section->mr;
> +        giommu->container = container;
> +        giommu->n.notify = vfio_iommu_map_notify;
> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +
> +        return;
> +    }
> +
> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> +
>      end = int128_get64(llend);
>      vaddr = memory_region_get_ram_ptr(section->mr) +
>              section->offset_within_region +
>              (iova - section->offset_within_address_space);
>  
> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>              iova, end - 1, vaddr);
>  
> -    memory_region_ref(section->mr);
>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>      if (ret) {
>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          return;
>      }
>  
> +    if (memory_region_is_iommu(section->mr)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (giommu->iommu == section->mr) {
> +                memory_region_unregister_iommu_notifier(&giommu->n);
> +                QLIST_REMOVE(giommu, giommu_next);
> +                g_free(giommu);
> +                break;
> +            }
> +        }
> +
> +        /*
> +         * FIXME: We assume the one big unmap below is adequate to
> +         * remove any individual page mappings in the IOMMU which
> +         * might have been copied into VFIO.  That may not be true for
> +         * all IOMMU types
> +         */

We assume this because the IOVA that gets unmapped is the same
regardless of whether a guest IOMMU is present?

> +    }
> +
>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>            TARGET_PAGE_MASK;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
@ 2014-03-19 19:57   ` Alex Williamson
  2014-03-28  3:42     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 19:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> The only model so far supported for VFIO passthrough devices is the model
> usually used on x86, where all of the guest's RAM is mapped into the
> (host) IOMMU and there is no IOMMU visible in the guest.
> 
> This patch begins to relax this model, introducing the notion of a
> VFIOAddressSpace.  This represents a logical DMA address space which will
> be visible to one or more VFIO devices by appropriate mapping in the (host)
> IOMMU.  Thus the currently global list of containers becomes local to
> a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
> group to multiple address spaces.
> 
> For now, only one VFIOAddressSpace is created and used, corresponding to
> main system memory, that will change in future patches.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> ---
> Changes:
> v5:
> * vfio_get_group() now receives AddressSpace*
> 
> v4:
> * removed redundant checks and asserts
> * fixed some return error codes
> ---
>  hw/misc/vfio.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 6a04c2a..c8236c3 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -133,6 +133,22 @@ enum {
>      VFIO_INT_MSIX = 3,
>  };
>  
> +typedef struct VFIOAddressSpace {
> +    AddressSpace *as;
> +    QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
> +} VFIOAddressSpace;
> +
> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;

Why isn't this static and initialized with QLIST_HEAD_INITIALIZER like
the qlist it replaces?

> +
> +static VFIOAddressSpace vfio_address_space_memory;
> +
> +static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
> +{
> +    space->as = as;
> +    QLIST_INIT(&space->containers);
> +}
> +
>  struct VFIOGroup;
>  
>  typedef struct VFIOType1 {
> @@ -142,6 +158,7 @@ typedef struct VFIOType1 {
>  } VFIOType1;
>  
>  typedef struct VFIOContainer {
> +    VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      struct {
>          /* enable abstraction to support various iommu backends */
> @@ -234,9 +251,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -static QLIST_HEAD(, VFIOContainer)
> -    container_list = QLIST_HEAD_INITIALIZER(container_list);
> -
>  static QLIST_HEAD(, VFIOGroup)
>      group_list = QLIST_HEAD_INITIALIZER(group_list);
>  
> @@ -3280,16 +3294,15 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>  #endif
>  }
>  
> -static int vfio_connect_container(VFIOGroup *group)
> +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    VFIOAddressSpace *space;
>  
> -    if (group->container) {
> -        return 0;
> -    }
> +    space = &vfio_address_space_memory;
>  
> -    QLIST_FOREACH(container, &container_list, next) {
> +    QLIST_FOREACH(container, &space->containers, next) {
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> @@ -3312,6 +3325,7 @@ static int vfio_connect_container(VFIOGroup *group)
>      }
>  
>      container = g_malloc0(sizeof(*container));
> +    container->space = space;
>      container->fd = fd;
>  
>      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
> @@ -3349,7 +3363,7 @@ static int vfio_connect_container(VFIOGroup *group)
>      }
>  
>      QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&container_list, container, next);
> +    QLIST_INSERT_HEAD(&space->containers, container, next);
>  
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> @@ -3392,7 +3406,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  }
>  
> -static VFIOGroup *vfio_get_group(int groupid)
> +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
>  {
>      VFIOGroup *group;
>      char path[32];
> @@ -3400,7 +3414,14 @@ static VFIOGroup *vfio_get_group(int groupid)
>  
>      QLIST_FOREACH(group, &group_list, next) {
>          if (group->groupid == groupid) {
> -            return group;
> +            /* Found it.  Now is it already in the right context? */
> +            if (group->container->space->as == as) {
> +                return group;
> +            } else {
> +                error_report("vfio: group %d used in multiple address spaces",
> +                             group->groupid);
> +                return NULL;
> +            }
>          }
>      }
>  
> @@ -3428,7 +3449,7 @@ static VFIOGroup *vfio_get_group(int groupid)
>      group->groupid = groupid;
>      QLIST_INIT(&group->device_list);
>  
> -    if (vfio_connect_container(group)) {
> +    if (vfio_connect_container(group, as)) {
>          error_report("vfio: failed to setup container for group %d", groupid);
>          goto close_fd_exit;
>      }
> @@ -3780,7 +3801,12 @@ static int vfio_initfn(PCIDevice *pdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>  
> -    group = vfio_get_group(groupid);
> +    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
> +        error_report("vfio: DMA address space must be system memory");
> +        return -EINVAL;
> +    }
> +
> +    group = vfio_get_group(groupid, &address_space_memory);
>      if (!group) {
>          error_report("vfio: failed to get group %d", groupid);
>          return -ENOENT;
> @@ -3994,6 +4020,7 @@ static const TypeInfo vfio_pci_dev_info = {
>  
>  static void register_vfio_pci_dev_type(void)
>  {
> +    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>      type_register_static(&vfio_pci_dev_info);
>  }
>  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64
  2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
@ 2014-03-19 20:12 ` Alex Williamson
  11 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-19 20:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> Yet another try with VFIO on SPAPR (server PPC64).
> As the previous try was too long time ago, I did not bother with
> the change log much as all of this requires review again. Also,
> it depends on these 2 patchsets which I cannot get reviewed yet
> (keep pinging...):
> [PATCH] spapr-iommu: extend SPAPR_TCE_TABLE class
> [PATCH 0/4] spapr-pci: prepare for vfio
> 
> This does not include VFIO KVM device support as the host kernel
> part is not there yet because bigger rework of the host VFIO driver
> is going to happen soon.
> 
> 
> Alex (Williamson), if you find it possible, please "ack" or "rb" as much
> as you can. Thanks!

It's probably not a good time to expect reviews while we're in the 2.0
freeze and this being obvious post-2.0 material.  I would suggest that
you split this series up and target specific developers and maintainers
and try to work changes in parallel (after the freeze) though.

I'm open to the possibility of accepting a number of the vfio patches
regardless of the spapr stuff you're blocked on, but the patch series
leads with an int128 change that at least needs to be ack'd by someone
like Paolo and a memory change, that may be nice, but doesn't block
anything else and needs an ack from someone other than me.  A number of
the vfio patches have no dependency on these.  Patch 10 seems to start
getting into more optional stuff, but then you depend on patch 11.

Split, prioritize, figure out and reduce dependencies and work patches
in parallel.  Send patches that are actionable and you'll likely get
more response.  Anything that requires coordination with others or is
blocked by architecture specific stuff is likely to get as much
attention as an RFC.  Thanks,

Alex

> Changes:
> v5:
> * rebase on top of the current upstream
> 
> v4:
> * addressed all comments from Alex Williamson
> * moved spapr-pci-phb-vfio-phb to new file
> * split spapr-pci-phb-vfio to many smaller patches
> 
> 
> Alexey Kardashevskiy (7):
>   int128: add int128_exts64()
>   vfio: Fix 128 bit handling
>   vfio: rework to have error paths
>   spapr-iommu: add SPAPR VFIO IOMMU device
>   spapr vfio: add vfio_container_spapr_get_info()
>   spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
>   spapr-vfio: enable for spapr
> 
> David Gibson (4):
>   memory: Sanity check that no listeners remain on a destroyed
>     AddressSpace
>   vfio: Introduce VFIO address spaces
>   vfio: Create VFIOAddressSpace objects as needed
>   vfio: Add guest side IOMMU support
> 
>  hw/misc/vfio.c              | 338 +++++++++++++++++++++++++++++++++++++-------
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/ppc/spapr_iommu.c        |  97 +++++++++++++
>  hw/ppc/spapr_pci_vfio.c     | 206 +++++++++++++++++++++++++++
>  include/hw/misc/vfio.h      |  11 ++
>  include/hw/pci-host/spapr.h |  13 ++
>  include/hw/ppc/spapr.h      |   5 +
>  include/qemu/int128.h       |   5 +
>  memory.c                    |   7 +
>  9 files changed, 633 insertions(+), 51 deletions(-)
>  create mode 100644 hw/ppc/spapr_pci_vfio.c
>  create mode 100644 include/hw/misc/vfio.h
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-19 19:57   ` Alex Williamson
@ 2014-03-20  5:25     ` David Gibson
  2014-03-28  5:12       ` Alexey Kardashevskiy
  2014-03-21  7:59     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 42+ messages in thread
From: David Gibson @ 2014-03-20  5:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

On Wed, Mar 19, 2014 at 01:57:41PM -0600, Alex Williamson wrote:
> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
[snip]
> > +    if (!memory_region_is_ram(mr)) {
> > +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> > +                xlat);
> > +        return;
> > +    }
> > +    if (len & iotlb->addr_mask) {
> > +        DPRINTF("iommu has granularity incompatible with target AS\n");
> 
> Is this possible?  Assuming len is initially a power-of-2, would the
> translate function change it?  Maybe worth a comment to explain.

translate can absolutely change the length.  It will generally
truncate it to the IOMMU page size, in fact.

[snip]
> > +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> > +                iova, int128_get64(int128_sub(llend, int128_one())));
> > +        /*
> > +         * FIXME: We should do some checking to see if the
> > +         * capabilities of the host VFIO IOMMU are adequate to model
> > +         * the guest IOMMU
> > +         *
> > +         * FIXME: This assumes that the guest IOMMU is empty of
> > +         * mappings at this point - we should either enforce this, or
> > +         * loop through existing mappings to map them into VFIO.
> > +         *
> > +         * FIXME: For VFIO iommu types which have KVM acceleration to
> > +         * avoid bouncing all map/unmaps through qemu this way, this
> > +         * would be the right place to wire that up (tell the KVM
> > +         * device emulation the VFIO iommu handles to use).
> > +         */
> 
> That's a lot of FIXMEs...  The second one in particular looks like it
> needs to expand a bit on why this is likely a valid assumption.  The
> last one is more of a TODO than a FIXME.

I think #2 isn't a valid assumption in general.  It was true for the
situation I was testing at the time, due to the order of pseries
initialization, so I left it to get a proof of concept reasonably
quickly.

But I think that one's a FIXME that actually needs to be fixed.

[snip]
> > +        /*
> > +         * FIXME: We assume the one big unmap below is adequate to
> > +         * remove any individual page mappings in the IOMMU which
> > +         * might have been copied into VFIO.  That may not be true for
> > +         * all IOMMU types
> > +         */
> 
> We assume this because the IOVA that gets unmapped is the same
> regardless of whether a guest IOMMU is present?

Uh.. no.  This assumption works for a page table based IOMMU where a
big unmap just flattens a large range of IO-PTEs.  It might not work
for some kind of extent or TLB based IOMMU, where operations are
expected to exactly match the addresses of map operations.

I don't know if IOMMUs that have trouble with this are a realistic
prospect, but they're at least a theoretical possibility, hence the
comment.

> 
> > +    }
> > +
> >      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >      end = (section->offset_within_address_space + int128_get64(section->size)) &
> >            TARGET_PAGE_MASK;
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64()
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
@ 2014-03-20 10:19   ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2014-03-20 10:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson
  Cc: qemu-ppc, qemu-devel, Alexander Graf

Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
> This adds macro to extend signed 64bit value to signed 128bit value.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * (.hi = (a >> 63) ? -1 : 0) changed to (.hi = (a < 0) ? -1 : 0)
> ---
>  include/qemu/int128.h | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/qemu/int128.h b/include/qemu/int128.h
> index 9ed47aa..ef87e5e 100644
> --- a/include/qemu/int128.h
> +++ b/include/qemu/int128.h
> @@ -38,6 +38,11 @@ static inline Int128 int128_2_64(void)
>      return (Int128) { 0, 1 };
>  }
>
> +static inline Int128 int128_exts64(int64_t a)
> +{
> +    return (Int128) { .lo = a, .hi = (a < 0) ? -1 : 0 };
> +}
> +
>  static inline Int128 int128_and(Int128 a, Int128 b)
>  {
>      return (Int128) { a.lo & b.lo, a.hi & b.hi };
>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

who just preferred a simple "a >> 63" but is not going to be picky.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
@ 2014-03-20 10:20   ` Paolo Bonzini
  2014-03-20 11:45     ` David Gibson
  2014-03-27  5:40     ` Alexey Kardashevskiy
  0 siblings, 2 replies; 42+ messages in thread
From: Paolo Bonzini @ 2014-03-20 10:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson
  Cc: David Gibson, qemu-ppc, Alexander Graf, qemu-devel

Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
> From: David Gibson <david@gibson.dropbear.id.au>
>
> At the moment, most AddressSpace objects last as long as the guest system
> in practice, but that could well change in future.  In addition, for VFIO
> we will be introducing some private per-AdressSpace information, which must
> be disposed of before the AddressSpace itself is destroyed.
>
> To reduce the chances of subtle bugs in this area, this patch adds
> asssertions to ensure that when an AddressSpace is destroyed, there are no
> remaining MemoryListeners using that AS as a filter.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  memory.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/memory.c b/memory.c
> index 3f1df23..678661e 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1722,12 +1722,19 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
>
>  void address_space_destroy(AddressSpace *as)
>  {
> +    MemoryListener *listener;
> +
>      /* Flush out anything from MemoryListeners listening in on this */
>      memory_region_transaction_begin();
>      as->root = NULL;
>      memory_region_transaction_commit();
>      QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
>      address_space_destroy_dispatch(as);
> +
> +    QTAILQ_FOREACH(listener, &memory_listeners, link) {
> +        assert(listener->address_space_filter != as);
> +    }
> +
>      flatview_unref(as->current_map);
>      g_free(as->name);
>      g_free(as->ioeventfds);
>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

An alternative is to add a count of listeners to the address space and 
assert that it is 0.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
@ 2014-03-20 10:20   ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2014-03-20 10:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson
  Cc: qemu-ppc, qemu-devel, Alexander Graf

Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
> Upcoming VFIO on SPAPR PPC64 support will initialize the IOMMU
> memory region with UINT64_MAX (2^64 bytes) size so int128_get64()
> will assert.
>
> The patch takes care of this check. The existing type1 IOMMU code
> is not expected to map all 64 bits of RAM so the patch does not
> touch that part.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v3:
> * 64bit @end is calculated from 128-bit @llend instead of repeating
> the same calculation steps
>
> v2:
> * used new function int128_exts64()
> ---
>  hw/misc/vfio.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index c2c688c..029a100 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -2251,6 +2251,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      VFIOContainer *container = container_of(listener, VFIOContainer,
>                                              iommu_data.type1.listener);
>      hwaddr iova, end;
> +    Int128 llend;
>      void *vaddr;
>      int ret;
>
> @@ -2271,13 +2272,15 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>
>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> -    end = (section->offset_within_address_space + int128_get64(section->size)) &
> -          TARGET_PAGE_MASK;
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
>
> -    if (iova >= end) {
> +    if (int128_ge(int128_make64(iova), llend)) {
>          return;
>      }
>
> +    end = int128_get64(llend);
>      vaddr = memory_region_get_ram_ptr(section->mr) +
>              section->offset_within_region +
>              (iova - section->offset_within_address_space);
>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2014-03-20 10:20   ` Paolo Bonzini
@ 2014-03-20 11:45     ` David Gibson
  2014-03-27  5:40     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 42+ messages in thread
From: David Gibson @ 2014-03-20 11:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2281 bytes --]

On Thu, Mar 20, 2014 at 11:20:09AM +0100, Paolo Bonzini wrote:
> Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
> >From: David Gibson <david@gibson.dropbear.id.au>
> >
> >At the moment, most AddressSpace objects last as long as the guest system
> >in practice, but that could well change in future.  In addition, for VFIO
> >we will be introducing some private per-AdressSpace information, which must
> >be disposed of before the AddressSpace itself is destroyed.
> >
> >To reduce the chances of subtle bugs in this area, this patch adds
> >asssertions to ensure that when an AddressSpace is destroyed, there are no
> >remaining MemoryListeners using that AS as a filter.
> >
> >Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >---
> > memory.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> >diff --git a/memory.c b/memory.c
> >index 3f1df23..678661e 100644
> >--- a/memory.c
> >+++ b/memory.c
> >@@ -1722,12 +1722,19 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> >
> > void address_space_destroy(AddressSpace *as)
> > {
> >+    MemoryListener *listener;
> >+
> >     /* Flush out anything from MemoryListeners listening in on this */
> >     memory_region_transaction_begin();
> >     as->root = NULL;
> >     memory_region_transaction_commit();
> >     QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
> >     address_space_destroy_dispatch(as);
> >+
> >+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
> >+        assert(listener->address_space_filter != as);
> >+    }
> >+
> >     flatview_unref(as->current_map);
> >     g_free(as->name);
> >     g_free(as->ioeventfds);
> >
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> An alternative is to add a count of listeners to the address space
> and assert that it is 0.

Address space destruction should be pretty rare, so I don't think it's
worth the bother of adding a count, which just adds the possibility of
bugs updating it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-19 19:57   ` Alex Williamson
  2014-03-20  5:25     ` David Gibson
@ 2014-03-21  7:59     ` Alexey Kardashevskiy
  2014-03-21 14:17       ` Alex Williamson
  1 sibling, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-21  7:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexander Graf, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 03/20/2014 06:57 AM, Alex Williamson wrote:
> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
>> sufficient capability and granularity to match the guest side. This works
>> by tracking all map and unmap operations on the guest IOMMU using the
>> notifiers, and mirroring them into VFIO.
>>
>> There are a number of FIXMEs, and the scheme involves rather more notifier
>> structures than I'd like, but it should make for a reasonable proof of
>> concept.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> ---
>> Changes:
>> v4:
>> * fixed list objects naming
>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
>> place only, it is also easier to review the changes
>> * fixes boundary check not to fail on sections == 2^64 bytes,
>> the "vfio: Fix debug output for int128 values" patch is required;
>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
>> for 128 bit memory section sizes" patch proposal
>> ---
>>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 120 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 038010b..4f6f5da 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
>>          };
>>          void (*release)(struct VFIOContainer *);
>>      } iommu_data;
>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>>      QLIST_ENTRY(VFIOContainer) next;
>>  } VFIOContainer;
>>  
>> +typedef struct VFIOGuestIOMMU {
>> +    VFIOContainer *container;
>> +    MemoryRegion *iommu;
>> +    Notifier n;
>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>>  typedef struct VFIOMSIXInfo {
>>      uint8_t table_bar;
>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>  
>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>> -    return !memory_region_is_ram(section->mr) ||
>> -           /*
>> +    return (!memory_region_is_ram(section->mr) &&
>> +            !memory_region_is_iommu(section->mr)) ||
>> +        /*
> 
> White space damage
> 
>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>              * addresses in the upper part of the 64-bit address space.  These
>>              * are never accessed by the CPU and beyond the address width of
>> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>             section->offset_within_address_space & (1ULL << 63);
>>  }
>>  
>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
>> +{
>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> +    VFIOContainer *container = giommu->container;
>> +    IOMMUTLBEntry *iotlb = data;
>> +    MemoryRegion *mr;
>> +    hwaddr xlat;
>> +    hwaddr len = iotlb->addr_mask + 1;
>> +    void *vaddr;
>> +    int ret;
>> +
>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
>> +
>> +    /*
>> +     * The IOMMU TLB entry we have just covers translation through
>> +     * this IOMMU to its immediate target.  We need to translate
>> +     * it the rest of the way through to memory.
>> +     */
>> +    mr = address_space_translate(&address_space_memory,
>> +                                 iotlb->translated_addr,
>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> 
> Write-only?  Is this supposed to be read-write to mask just 2 bits?


The last parameter of address_space_translate() bool is_write. So I do not
really understand the problem here.


>> +    if (!memory_region_is_ram(mr)) {
>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>> +                xlat);
>> +        return;
>> +    }
>> +    if (len & iotlb->addr_mask) {
>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> 
> Is this possible?  Assuming len is initially a power-of-2, would the
> translate function change it?  Maybe worth a comment to explain.


Oh. address_space_translate() actually changes @len to min(len,
TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
but lately I have been implementing a huge DMA window (plus one
sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
16MB pages (can do 64K pages too) and now this "granularity incompatible"
is happening.

I disabled that check but I need to think of better fix...

Adding Paolo to cc, may be he picks the context and gives good piece of
advise :)



> 
>> +        return;
>> +    }
>> +
>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> 
> This lookup isn't free and the unmap path doesn't need it, maybe move
> the variable and lookup into the first branch below?
> 
>> +
>> +    if (iotlb->perm != IOMMU_NONE) {
>> +        ret = vfio_dma_map(container, iotlb->iova,
>> +                           iotlb->addr_mask + 1, vaddr,
>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
>> +        if (ret) {
>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> +                         container, iotlb->iova,
>> +                         iotlb->addr_mask + 1, vaddr, ret);
>> +        }
>> +    } else {
>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>> +        if (ret) {
>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>> +                         container, iotlb->iova,
>> +                         iotlb->addr_mask + 1, ret);
>> +        }
>> +    }
>> +}
>> +
>>  static void vfio_listener_region_add(MemoryListener *listener,
>>                                       MemoryRegionSection *section)
>>  {
>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      void *vaddr;
>>      int ret;
>>  
>> -    assert(!memory_region_is_iommu(section->mr));
>> -
>>      if (vfio_listener_skipped_section(section)) {
>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>>                  section->offset_within_address_space,
>> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>          return;
>>      }
>>  
>> +    memory_region_ref(section->mr);
>> +
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        VFIOGuestIOMMU *giommu;
>> +
>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> +                iova, int128_get64(int128_sub(llend, int128_one())));
>> +        /*
>> +         * FIXME: We should do some checking to see if the
>> +         * capabilities of the host VFIO IOMMU are adequate to model
>> +         * the guest IOMMU
>> +         *
>> +         * FIXME: This assumes that the guest IOMMU is empty of
>> +         * mappings at this point - we should either enforce this, or
>> +         * loop through existing mappings to map them into VFIO.
>> +         *
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
> 
> That's a lot of FIXMEs...  The second one in particular looks like it
> needs to expand a bit on why this is likely a valid assumption.  The
> last one is more of a TODO than a FIXME.
> 
>> +        giommu = g_malloc0(sizeof(*giommu));
>> +        giommu->iommu = section->mr;
>> +        giommu->container = container;
>> +        giommu->n.notify = vfio_iommu_map_notify;
>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +
>> +        return;
>> +    }
>> +
>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
>> +
>>      end = int128_get64(llend);
>>      vaddr = memory_region_get_ram_ptr(section->mr) +
>>              section->offset_within_region +
>>              (iova - section->offset_within_address_space);
>>  
>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>>              iova, end - 1, vaddr);
>>  
>> -    memory_region_ref(section->mr);
>>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>>      if (ret) {
>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>          return;
>>      }
>>  
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        VFIOGuestIOMMU *giommu;
>> +
>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> +            if (giommu->iommu == section->mr) {
>> +                memory_region_unregister_iommu_notifier(&giommu->n);
>> +                QLIST_REMOVE(giommu, giommu_next);
>> +                g_free(giommu);
>> +                break;
>> +            }
>> +        }
>> +
>> +        /*
>> +         * FIXME: We assume the one big unmap below is adequate to
>> +         * remove any individual page mappings in the IOMMU which
>> +         * might have been copied into VFIO.  That may not be true for
>> +         * all IOMMU types
>> +         */
> 
> We assume this because the IOVA that gets unmapped is the same
> regardless of whether a guest IOMMU is present?


What exactly is meant by "guest IOMMU is present"? Doing the second DMA
window, now I am really confused about terminology :(



>> +    }
>> +
>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>>            TARGET_PAGE_MASK;
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-21  7:59     ` Alexey Kardashevskiy
@ 2014-03-21 14:17       ` Alex Williamson
  2014-03-21 14:23         ` Paolo Bonzini
  2014-03-28  4:49         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-21 14:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote:
> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> > On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >> From: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
> >> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> >> sufficient capability and granularity to match the guest side. This works
> >> by tracking all map and unmap operations on the guest IOMMU using the
> >> notifiers, and mirroring them into VFIO.
> >>
> >> There are a number of FIXMEs, and the scheme involves rather more notifier
> >> structures than I'd like, but it should make for a reasonable proof of
> >> concept.
> >>
> >> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >> ---
> >> Changes:
> >> v4:
> >> * fixed list objects naming
> >> * vfio_listener_region_add() reworked to call memory_region_ref() from one
> >> place only, it is also easier to review the changes
> >> * fixes boundary check not to fail on sections == 2^64 bytes,
> >> the "vfio: Fix debug output for int128 values" patch is required;
> >> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> >> for 128 bit memory section sizes" patch proposal
> >> ---
> >>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 120 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index 038010b..4f6f5da 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
> >>          };
> >>          void (*release)(struct VFIOContainer *);
> >>      } iommu_data;
> >> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>      QLIST_HEAD(, VFIOGroup) group_list;
> >>      QLIST_ENTRY(VFIOContainer) next;
> >>  } VFIOContainer;
> >>  
> >> +typedef struct VFIOGuestIOMMU {
> >> +    VFIOContainer *container;
> >> +    MemoryRegion *iommu;
> >> +    Notifier n;
> >> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >> +} VFIOGuestIOMMU;
> >> +
> >>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
> >>  typedef struct VFIOMSIXInfo {
> >>      uint8_t table_bar;
> >> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>  
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >> -    return !memory_region_is_ram(section->mr) ||
> >> -           /*
> >> +    return (!memory_region_is_ram(section->mr) &&
> >> +            !memory_region_is_iommu(section->mr)) ||
> >> +        /*
> > 
> > White space damage
> > 
> >>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
> >>              * addresses in the upper part of the 64-bit address space.  These
> >>              * are never accessed by the CPU and beyond the address width of
> >> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>             section->offset_within_address_space & (1ULL << 63);
> >>  }
> >>  
> >> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> >> +{
> >> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >> +    VFIOContainer *container = giommu->container;
> >> +    IOMMUTLBEntry *iotlb = data;
> >> +    MemoryRegion *mr;
> >> +    hwaddr xlat;
> >> +    hwaddr len = iotlb->addr_mask + 1;
> >> +    void *vaddr;
> >> +    int ret;
> >> +
> >> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> >> +
> >> +    /*
> >> +     * The IOMMU TLB entry we have just covers translation through
> >> +     * this IOMMU to its immediate target.  We need to translate
> >> +     * it the rest of the way through to memory.
> >> +     */
> >> +    mr = address_space_translate(&address_space_memory,
> >> +                                 iotlb->translated_addr,
> >> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> > 
> > Write-only?  Is this supposed to be read-write to mask just 2 bits?
> 
> 
> The last parameter of address_space_translate() bool is_write. So I do not
> really understand the problem here.

Oops, my bad, I didn't look at what address_space_translate() used that
for.  Ok.

> >> +    if (!memory_region_is_ram(mr)) {
> >> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> >> +                xlat);
> >> +        return;
> >> +    }
> >> +    if (len & iotlb->addr_mask) {
> >> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> > 
> > Is this possible?  Assuming len is initially a power-of-2, would the
> > translate function change it?  Maybe worth a comment to explain.
> 
> 
> Oh. address_space_translate() actually changes @len to min(len,
> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
> but lately I have been implementing a huge DMA window (plus one
> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
> is happening.
> 
> I disabled that check but I need to think of better fix...
> 
> Adding Paolo to cc, may be he picks the context and gives good piece of
> advise :)
> 
> 
> 
> > 
> >> +        return;
> >> +    }
> >> +
> >> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > 
> > This lookup isn't free and the unmap path doesn't need it, maybe move
> > the variable and lookup into the first branch below?
> > 
> >> +
> >> +    if (iotlb->perm != IOMMU_NONE) {
> >> +        ret = vfio_dma_map(container, iotlb->iova,
> >> +                           iotlb->addr_mask + 1, vaddr,
> >> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >> +        if (ret) {
> >> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >> +                         container, iotlb->iova,
> >> +                         iotlb->addr_mask + 1, vaddr, ret);
> >> +        }
> >> +    } else {
> >> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >> +        if (ret) {
> >> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> >> +                         container, iotlb->iova,
> >> +                         iotlb->addr_mask + 1, ret);
> >> +        }
> >> +    }
> >> +}
> >> +
> >>  static void vfio_listener_region_add(MemoryListener *listener,
> >>                                       MemoryRegionSection *section)
> >>  {
> >> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      void *vaddr;
> >>      int ret;
> >>  
> >> -    assert(!memory_region_is_iommu(section->mr));
> >> -
> >>      if (vfio_listener_skipped_section(section)) {
> >>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
> >>                  section->offset_within_address_space,
> >> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>          return;
> >>      }
> >>  
> >> +    memory_region_ref(section->mr);
> >> +
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +        VFIOGuestIOMMU *giommu;
> >> +
> >> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >> +                iova, int128_get64(int128_sub(llend, int128_one())));
> >> +        /*
> >> +         * FIXME: We should do some checking to see if the
> >> +         * capabilities of the host VFIO IOMMU are adequate to model
> >> +         * the guest IOMMU
> >> +         *
> >> +         * FIXME: This assumes that the guest IOMMU is empty of
> >> +         * mappings at this point - we should either enforce this, or
> >> +         * loop through existing mappings to map them into VFIO.
> >> +         *
> >> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >> +         * avoid bouncing all map/unmaps through qemu this way, this
> >> +         * would be the right place to wire that up (tell the KVM
> >> +         * device emulation the VFIO iommu handles to use).
> >> +         */
> > 
> > That's a lot of FIXMEs...  The second one in particular looks like it
> > needs to expand a bit on why this is likely a valid assumption.  The
> > last one is more of a TODO than a FIXME.
> > 
> >> +        giommu = g_malloc0(sizeof(*giommu));
> >> +        giommu->iommu = section->mr;
> >> +        giommu->container = container;
> >> +        giommu->n.notify = vfio_iommu_map_notify;
> >> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >> +
> >> +        return;
> >> +    }
> >> +
> >> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> >> +
> >>      end = int128_get64(llend);
> >>      vaddr = memory_region_get_ram_ptr(section->mr) +
> >>              section->offset_within_region +
> >>              (iova - section->offset_within_address_space);
> >>  
> >> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >>              iova, end - 1, vaddr);
> >>  
> >> -    memory_region_ref(section->mr);
> >>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
> >>      if (ret) {
> >>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>          return;
> >>      }
> >>  
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +        VFIOGuestIOMMU *giommu;
> >> +
> >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >> +            if (giommu->iommu == section->mr) {
> >> +                memory_region_unregister_iommu_notifier(&giommu->n);
> >> +                QLIST_REMOVE(giommu, giommu_next);
> >> +                g_free(giommu);
> >> +                break;
> >> +            }
> >> +        }
> >> +
> >> +        /*
> >> +         * FIXME: We assume the one big unmap below is adequate to
> >> +         * remove any individual page mappings in the IOMMU which
> >> +         * might have been copied into VFIO.  That may not be true for
> >> +         * all IOMMU types
> >> +         */
> > 
> > We assume this because the IOVA that gets unmapped is the same
> > regardless of whether a guest IOMMU is present?
> 
> 
> What exactly is meant by "guest IOMMU is present"? Doing the second DMA
> window, now I am really confused about terminology :(

The confusion for me is that add_region initializes the giommu and all
the DMA mapping through VFIO is done in the notifier for the giommu.
It's therefore asymmetric that add_region doesn't vfio_dma_map anything,
but region_del does vfio_dma_unmap, which is the basis of my question.
I thought this comment was trying to address why that is, but apparently
it's something else entirely, so it would be nice to understand why this
doesn't return() and decode a bit more clearly what the FIXME is trying
to say.  Thanks,

Alex

> >> +    }
> >> +
> >>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> >>            TARGET_PAGE_MASK;
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-21 14:17       ` Alex Williamson
@ 2014-03-21 14:23         ` Paolo Bonzini
  2014-03-28  4:49         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2014-03-21 14:23 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

Il 21/03/2014 15:17, Alex Williamson ha scritto:
>> > > Is this possible?  Assuming len is initially a power-of-2, would the
>> > > translate function change it?  Maybe worth a comment to explain.
>>
>>
>> Oh. address_space_translate() actually changes @len to min(len,
>> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
>> but lately I have been implementing a huge DMA window (plus one
>> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
>> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
>> is happening.
>>
>> I disabled that check but I need to think of better fix...
>>
>> Adding Paolo to cc, may be he picks the context and gives good piece of
>> advise :)

I think that "if" in address_space_translate should be restricted to 
Xen, since that's what it was added for.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2014-03-20 10:20   ` Paolo Bonzini
  2014-03-20 11:45     ` David Gibson
@ 2014-03-27  5:40     ` Alexey Kardashevskiy
  2014-03-27 12:15       ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-27  5:40 UTC (permalink / raw)
  To: Paolo Bonzini, Alex Williamson
  Cc: David Gibson, qemu-ppc, Alexander Graf, qemu-devel

On 03/20/2014 09:20 PM, Paolo Bonzini wrote:
> Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> At the moment, most AddressSpace objects last as long as the guest system
>> in practice, but that could well change in future.  In addition, for VFIO
>> we will be introducing some private per-AdressSpace information, which must
>> be disposed of before the AddressSpace itself is destroyed.
>>
>> To reduce the chances of subtle bugs in this area, this patch adds
>> asssertions to ensure that when an AddressSpace is destroyed, there are no
>> remaining MemoryListeners using that AS as a filter.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  memory.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/memory.c b/memory.c
>> index 3f1df23..678661e 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -1722,12 +1722,19 @@ void address_space_init(AddressSpace *as,
>> MemoryRegion *root, const char *name)
>>
>>  void address_space_destroy(AddressSpace *as)
>>  {
>> +    MemoryListener *listener;
>> +
>>      /* Flush out anything from MemoryListeners listening in on this */
>>      memory_region_transaction_begin();
>>      as->root = NULL;
>>      memory_region_transaction_commit();
>>      QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
>>      address_space_destroy_dispatch(as);
>> +
>> +    QTAILQ_FOREACH(listener, &memory_listeners, link) {
>> +        assert(listener->address_space_filter != as);
>> +    }
>> +
>>      flatview_unref(as->current_map);
>>      g_free(as->name);
>>      g_free(as->ioeventfds);
>>
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


What happens next to this patch and the next one ("int128: add
int128_exts64()")? I mean who you expect to pull them? Alex Graf? :) Thanks.


> 
> An alternative is to add a count of listeners to the address space and
> assert that it is 0.
> 
> Paolo




-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2014-03-27  5:40     ` Alexey Kardashevskiy
@ 2014-03-27 12:15       ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2014-03-27 12:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson
  Cc: David Gibson, qemu-ppc, Alexander Graf, qemu-devel

Il 27/03/2014 06:40, Alexey Kardashevskiy ha scritto:
> On 03/20/2014 09:20 PM, Paolo Bonzini wrote:
>> Il 12/03/2014 06:52, Alexey Kardashevskiy ha scritto:
>>> From: David Gibson <david@gibson.dropbear.id.au>
>>>
>>> At the moment, most AddressSpace objects last as long as the guest system
>>> in practice, but that could well change in future.  In addition, for VFIO
>>> we will be introducing some private per-AdressSpace information, which must
>>> be disposed of before the AddressSpace itself is destroyed.
>>>
>>> To reduce the chances of subtle bugs in this area, this patch adds
>>> asssertions to ensure that when an AddressSpace is destroyed, there are no
>>> remaining MemoryListeners using that AS as a filter.
>>>
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>  memory.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/memory.c b/memory.c
>>> index 3f1df23..678661e 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -1722,12 +1722,19 @@ void address_space_init(AddressSpace *as,
>>> MemoryRegion *root, const char *name)
>>>
>>>  void address_space_destroy(AddressSpace *as)
>>>  {
>>> +    MemoryListener *listener;
>>> +
>>>      /* Flush out anything from MemoryListeners listening in on this */
>>>      memory_region_transaction_begin();
>>>      as->root = NULL;
>>>      memory_region_transaction_commit();
>>>      QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
>>>      address_space_destroy_dispatch(as);
>>> +
>>> +    QTAILQ_FOREACH(listener, &memory_listeners, link) {
>>> +        assert(listener->address_space_filter != as);
>>> +    }
>>> +
>>>      flatview_unref(as->current_map);
>>>      g_free(as->name);
>>>      g_free(as->ioeventfds);
>>>
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
>
> What happens next to this patch and the next one ("int128: add
> int128_exts64()")? I mean who you expect to pull them? Alex Graf? :) Thanks.

Either him, or Alex Williamson.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces
  2014-03-19 19:57   ` Alex Williamson
@ 2014-03-28  3:42     ` Alexey Kardashevskiy
  2014-03-31 19:14       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-28  3:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 03/20/2014 06:57 AM, Alex Williamson wrote:
> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> The only model so far supported for VFIO passthrough devices is the model
>> usually used on x86, where all of the guest's RAM is mapped into the
>> (host) IOMMU and there is no IOMMU visible in the guest.
>>
>> This patch begins to relax this model, introducing the notion of a
>> VFIOAddressSpace.  This represents a logical DMA address space which will
>> be visible to one or more VFIO devices by appropriate mapping in the (host)
>> IOMMU.  Thus the currently global list of containers becomes local to
>> a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
>> group to multiple address spaces.
>>
>> For now, only one VFIOAddressSpace is created and used, corresponding to
>> main system memory, that will change in future patches.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> ---
>> Changes:
>> v5:
>> * vfio_get_group() now receives AddressSpace*
>>
>> v4:
>> * removed redundant checks and asserts
>> * fixed some return error codes
>> ---
>>  hw/misc/vfio.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 40 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 6a04c2a..c8236c3 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -133,6 +133,22 @@ enum {
>>      VFIO_INT_MSIX = 3,
>>  };
>>  
>> +typedef struct VFIOAddressSpace {
>> +    AddressSpace *as;
>> +    QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_ENTRY(VFIOAddressSpace) list;
>> +} VFIOAddressSpace;
>> +
>> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
> 
> Why isn't this static and initialized with QLIST_HEAD_INITIALIZER like
> the qlist it replaces?


I'll use QLIST_HEAD_INITIALIZER, ok.


It is not static because otherwise it does not compile - the actual use of
vfio_address_spaces happens in the next patches. Should I make it static
and move to the next patch? Thanks.



>> +
>> +static VFIOAddressSpace vfio_address_space_memory;
>> +
>> +static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
>> +{
>> +    space->as = as;
>> +    QLIST_INIT(&space->containers);
>> +}
>> +
>>  struct VFIOGroup;
>>  
>>  typedef struct VFIOType1 {
>> @@ -142,6 +158,7 @@ typedef struct VFIOType1 {
>>  } VFIOType1;
>>  
>>  typedef struct VFIOContainer {
>> +    VFIOAddressSpace *space;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      struct {
>>          /* enable abstraction to support various iommu backends */
>> @@ -234,9 +251,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
>>  
>>  #define MSIX_CAP_LENGTH 12
>>  
>> -static QLIST_HEAD(, VFIOContainer)
>> -    container_list = QLIST_HEAD_INITIALIZER(container_list);
>> -
>>  static QLIST_HEAD(, VFIOGroup)
>>      group_list = QLIST_HEAD_INITIALIZER(group_list);
>>  
>> @@ -3280,16 +3294,15 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>>  #endif
>>  }
>>  
>> -static int vfio_connect_container(VFIOGroup *group)
>> +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    VFIOAddressSpace *space;
>>  
>> -    if (group->container) {
>> -        return 0;
>> -    }
>> +    space = &vfio_address_space_memory;
>>  
>> -    QLIST_FOREACH(container, &container_list, next) {
>> +    QLIST_FOREACH(container, &space->containers, next) {
>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>              group->container = container;
>>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> @@ -3312,6 +3325,7 @@ static int vfio_connect_container(VFIOGroup *group)
>>      }
>>  
>>      container = g_malloc0(sizeof(*container));
>> +    container->space = space;
>>      container->fd = fd;
>>  
>>      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
>> @@ -3349,7 +3363,7 @@ static int vfio_connect_container(VFIOGroup *group)
>>      }
>>  
>>      QLIST_INIT(&container->group_list);
>> -    QLIST_INSERT_HEAD(&container_list, container, next);
>> +    QLIST_INSERT_HEAD(&space->containers, container, next);
>>  
>>      group->container = container;
>>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> @@ -3392,7 +3406,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>      }
>>  }
>>  
>> -static VFIOGroup *vfio_get_group(int groupid)
>> +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
>>  {
>>      VFIOGroup *group;
>>      char path[32];
>> @@ -3400,7 +3414,14 @@ static VFIOGroup *vfio_get_group(int groupid)
>>  
>>      QLIST_FOREACH(group, &group_list, next) {
>>          if (group->groupid == groupid) {
>> -            return group;
>> +            /* Found it.  Now is it already in the right context? */
>> +            if (group->container->space->as == as) {
>> +                return group;
>> +            } else {
>> +                error_report("vfio: group %d used in multiple address spaces",
>> +                             group->groupid);
>> +                return NULL;
>> +            }
>>          }
>>      }
>>  
>> @@ -3428,7 +3449,7 @@ static VFIOGroup *vfio_get_group(int groupid)
>>      group->groupid = groupid;
>>      QLIST_INIT(&group->device_list);
>>  
>> -    if (vfio_connect_container(group)) {
>> +    if (vfio_connect_container(group, as)) {
>>          error_report("vfio: failed to setup container for group %d", groupid);
>>          goto close_fd_exit;
>>      }
>> @@ -3780,7 +3801,12 @@ static int vfio_initfn(PCIDevice *pdev)
>>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>>  
>> -    group = vfio_get_group(groupid);
>> +    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
>> +        error_report("vfio: DMA address space must be system memory");
>> +        return -EINVAL;
>> +    }
>> +
>> +    group = vfio_get_group(groupid, &address_space_memory);
>>      if (!group) {
>>          error_report("vfio: failed to get group %d", groupid);
>>          return -ENOENT;
>> @@ -3994,6 +4020,7 @@ static const TypeInfo vfio_pci_dev_info = {
>>  
>>  static void register_vfio_pci_dev_type(void)
>>  {
>> +    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>>      type_register_static(&vfio_pci_dev_info);
>>  }
>>  
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-21 14:17       ` Alex Williamson
  2014-03-21 14:23         ` Paolo Bonzini
@ 2014-03-28  4:49         ` Alexey Kardashevskiy
  2014-03-31 19:54           ` Alex Williamson
  1 sibling, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-28  4:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexander Graf, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 03/22/2014 01:17 AM, Alex Williamson wrote:
> On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote:
>> On 03/20/2014 06:57 AM, Alex Williamson wrote:
>>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>>>> From: David Gibson <david@gibson.dropbear.id.au>
>>>>
>>>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
>>>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
>>>> sufficient capability and granularity to match the guest side. This works
>>>> by tracking all map and unmap operations on the guest IOMMU using the
>>>> notifiers, and mirroring them into VFIO.
>>>>
>>>> There are a number of FIXMEs, and the scheme involves rather more notifier
>>>> structures than I'd like, but it should make for a reasonable proof of
>>>> concept.
>>>>
>>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>
>>>> ---
>>>> Changes:
>>>> v4:
>>>> * fixed list objects naming
>>>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
>>>> place only, it is also easier to review the changes
>>>> * fixes boundary check not to fail on sections == 2^64 bytes,
>>>> the "vfio: Fix debug output for int128 values" patch is required;
>>>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
>>>> for 128 bit memory section sizes" patch proposal
>>>> ---
>>>>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>  1 file changed, 120 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>>> index 038010b..4f6f5da 100644
>>>> --- a/hw/misc/vfio.c
>>>> +++ b/hw/misc/vfio.c
>>>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
>>>>          };
>>>>          void (*release)(struct VFIOContainer *);
>>>>      } iommu_data;
>>>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>>>      QLIST_HEAD(, VFIOGroup) group_list;
>>>>      QLIST_ENTRY(VFIOContainer) next;
>>>>  } VFIOContainer;
>>>>  
>>>> +typedef struct VFIOGuestIOMMU {
>>>> +    VFIOContainer *container;
>>>> +    MemoryRegion *iommu;
>>>> +    Notifier n;
>>>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>>> +} VFIOGuestIOMMU;
>>>> +
>>>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>>>>  typedef struct VFIOMSIXInfo {
>>>>      uint8_t table_bar;
>>>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>  
>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>  {
>>>> -    return !memory_region_is_ram(section->mr) ||
>>>> -           /*
>>>> +    return (!memory_region_is_ram(section->mr) &&
>>>> +            !memory_region_is_iommu(section->mr)) ||
>>>> +        /*
>>>
>>> White space damage
>>>
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>              * are never accessed by the CPU and beyond the address width of
>>>> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>             section->offset_within_address_space & (1ULL << 63);
>>>>  }
>>>>  
>>>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>> +{
>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> +    VFIOContainer *container = giommu->container;
>>>> +    IOMMUTLBEntry *iotlb = data;
>>>> +    MemoryRegion *mr;
>>>> +    hwaddr xlat;
>>>> +    hwaddr len = iotlb->addr_mask + 1;
>>>> +    void *vaddr;
>>>> +    int ret;
>>>> +
>>>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>>>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
>>>> +
>>>> +    /*
>>>> +     * The IOMMU TLB entry we have just covers translation through
>>>> +     * this IOMMU to its immediate target.  We need to translate
>>>> +     * it the rest of the way through to memory.
>>>> +     */
>>>> +    mr = address_space_translate(&address_space_memory,
>>>> +                                 iotlb->translated_addr,
>>>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
>>>
>>> Write-only?  Is this supposed to be read-write to mask just 2 bits?
>>
>>
>> The last parameter of address_space_translate() bool is_write. So I do not
>> really understand the problem here.
> 
> Oops, my bad, I didn't look at what address_space_translate() used that
> for.  Ok.
> 
>>>> +    if (!memory_region_is_ram(mr)) {
>>>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>>>> +                xlat);
>>>> +        return;
>>>> +    }
>>>> +    if (len & iotlb->addr_mask) {
>>>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
>>>
>>> Is this possible?  Assuming len is initially a power-of-2, would the
>>> translate function change it?  Maybe worth a comment to explain.
>>
>>
>> Oh. address_space_translate() actually changes @len to min(len,
>> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
>> but lately I have been implementing a huge DMA window (plus one
>> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
>> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
>> is happening.
>>
>> I disabled that check but I need to think of better fix...
>>
>> Adding Paolo to cc, may be he picks the context and gives good piece of
>> advise :)
>>
>>
>>
>>>
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>
>>> This lookup isn't free and the unmap path doesn't need it, maybe move
>>> the variable and lookup into the first branch below?
>>>
>>>> +
>>>> +    if (iotlb->perm != IOMMU_NONE) {
>>>> +        ret = vfio_dma_map(container, iotlb->iova,
>>>> +                           iotlb->addr_mask + 1, vaddr,
>>>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
>>>> +        if (ret) {
>>>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>> +                         container, iotlb->iova,
>>>> +                         iotlb->addr_mask + 1, vaddr, ret);
>>>> +        }
>>>> +    } else {
>>>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>>>> +        if (ret) {
>>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>>>> +                         container, iotlb->iova,
>>>> +                         iotlb->addr_mask + 1, ret);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>>  static void vfio_listener_region_add(MemoryListener *listener,
>>>>                                       MemoryRegionSection *section)
>>>>  {
>>>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>      void *vaddr;
>>>>      int ret;
>>>>  
>>>> -    assert(!memory_region_is_iommu(section->mr));
>>>> -
>>>>      if (vfio_listener_skipped_section(section)) {
>>>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>>>>                  section->offset_within_address_space,
>>>> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>          return;
>>>>      }
>>>>  
>>>> +    memory_region_ref(section->mr);
>>>> +
>>>> +    if (memory_region_is_iommu(section->mr)) {
>>>> +        VFIOGuestIOMMU *giommu;
>>>> +
>>>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>>>> +                iova, int128_get64(int128_sub(llend, int128_one())));
>>>> +        /*
>>>> +         * FIXME: We should do some checking to see if the
>>>> +         * capabilities of the host VFIO IOMMU are adequate to model
>>>> +         * the guest IOMMU
>>>> +         *
>>>> +         * FIXME: This assumes that the guest IOMMU is empty of
>>>> +         * mappings at this point - we should either enforce this, or
>>>> +         * loop through existing mappings to map them into VFIO.
>>>> +         *
>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>> +         * would be the right place to wire that up (tell the KVM
>>>> +         * device emulation the VFIO iommu handles to use).
>>>> +         */
>>>
>>> That's a lot of FIXMEs...  The second one in particular looks like it
>>> needs to expand a bit on why this is likely a valid assumption.  The
>>> last one is more of a TODO than a FIXME.
>>>
>>>> +        giommu = g_malloc0(sizeof(*giommu));
>>>> +        giommu->iommu = section->mr;
>>>> +        giommu->container = container;
>>>> +        giommu->n.notify = vfio_iommu_map_notify;
>>>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>>> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>>>> +
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
>>>> +
>>>>      end = int128_get64(llend);
>>>>      vaddr = memory_region_get_ram_ptr(section->mr) +
>>>>              section->offset_within_region +
>>>>              (iova - section->offset_within_address_space);
>>>>  
>>>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>>>> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>>>>              iova, end - 1, vaddr);
>>>>  
>>>> -    memory_region_ref(section->mr);
>>>>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>>>>      if (ret) {
>>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>          return;
>>>>      }
>>>>  
>>>> +    if (memory_region_is_iommu(section->mr)) {
>>>> +        VFIOGuestIOMMU *giommu;
>>>> +
>>>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>>> +            if (giommu->iommu == section->mr) {
>>>> +                memory_region_unregister_iommu_notifier(&giommu->n);
>>>> +                QLIST_REMOVE(giommu, giommu_next);
>>>> +                g_free(giommu);
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        /*
>>>> +         * FIXME: We assume the one big unmap below is adequate to
>>>> +         * remove any individual page mappings in the IOMMU which
>>>> +         * might have been copied into VFIO.  That may not be true for
>>>> +         * all IOMMU types
>>>> +         */
>>>
>>> We assume this because the IOVA that gets unmapped is the same
>>> regardless of whether a guest IOMMU is present?
>>
>>
>> What exactly is meant by "guest IOMMU is present"? Doing the second DMA
>> window, now I am really confused about terminology :(
> 
> The confusion for me is that add_region initializes the giommu and all
> the DMA mapping through VFIO is done in the notifier for the giommu.
> It's therefore asymmetric that add_region doesn't vfio_dma_map anything,
> but region_del does vfio_dma_unmap, which is the basis of my question.
> I thought this comment was trying to address why that is, but apparently
> it's something else entirely, so it would be nice to understand why this
> doesn't return() and decode a bit more clearly what the FIXME is trying
> to say.  Thanks,

I do not mind extending that comment but need some assistance.

My understanding is that:
===
region_del is called on memory region removal so this particular window is
not going to be used anymore and this is the only place where such cleanup
could be done.
===

Asymmetry is here but it does not look terrible.


> Alex
> 
>>>> +    }
>>>> +
>>>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>>>>            TARGET_PAGE_MASK;



-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-20  5:25     ` David Gibson
@ 2014-03-28  5:12       ` Alexey Kardashevskiy
  2014-03-31 19:59         ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-28  5:12 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 03/20/2014 04:25 PM, David Gibson wrote:
> On Wed, Mar 19, 2014 at 01:57:41PM -0600, Alex Williamson wrote:
>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>>> From: David Gibson <david@gibson.dropbear.id.au>
> [snip]
>>> +    if (!memory_region_is_ram(mr)) {
>>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>>> +                xlat);
>>> +        return;
>>> +    }
>>> +    if (len & iotlb->addr_mask) {
>>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
>>
>> Is this possible?  Assuming len is initially a power-of-2, would the
>> translate function change it?  Maybe worth a comment to explain.
> 
> translate can absolutely change the length.  It will generally
> truncate it to the IOMMU page size, in fact.
> 
> [snip]
>>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>>> +                iova, int128_get64(int128_sub(llend, int128_one())));
>>> +        /*
>>> +         * FIXME: We should do some checking to see if the
>>> +         * capabilities of the host VFIO IOMMU are adequate to model
>>> +         * the guest IOMMU
>>> +         *
>>> +         * FIXME: This assumes that the guest IOMMU is empty of
>>> +         * mappings at this point - we should either enforce this, or
>>> +         * loop through existing mappings to map them into VFIO.
>>> +         *
>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>> +         * would be the right place to wire that up (tell the KVM
>>> +         * device emulation the VFIO iommu handles to use).
>>> +         */
>>
>> That's a lot of FIXMEs...  The second one in particular looks like it
>> needs to expand a bit on why this is likely a valid assumption.  The
>> last one is more of a TODO than a FIXME.
> 
> I think #2 isn't a valid assumption in general.  It was true for the
> situation I was testing at the time, due to the order of pseries
> initialization, so I left it to get a proof of concept reasonably
> quickly.
> 
> But I think that one's a FIXME that actually needs to be fixed.


But how?

At the moment, for SPAPR, the table gets cleaned when group is attached to
a container (VFIO_GROUP_SET_CONTAINER ioctl) which happens right before
registering the memory listener so we are fine (at least for SPAPR).
Is that true for x86 or we need something more?

"loop through existing mappings to map them into VFIO" - this I do not
really understand. We do not track mapping in QEMU so we cannot really loop
through them.



> [snip]
>>> +        /*
>>> +         * FIXME: We assume the one big unmap below is adequate to
>>> +         * remove any individual page mappings in the IOMMU which
>>> +         * might have been copied into VFIO.  That may not be true for
>>> +         * all IOMMU types
>>> +         */
>>
>> We assume this because the IOVA that gets unmapped is the same
>> regardless of whether a guest IOMMU is present?
> 
> Uh.. no.  This assumption works for a page table based IOMMU where a
> big unmap just flattens a large range of IO-PTEs. 


Sorry for my poor english but how exactly is that different from what Alex
said? IOVA is a PCI bus address between dma_window_start and
dma_window_start+dma_window_size.


> It might not work
> for some kind of extent or TLB based IOMMU, where operations are
> expected to exactly match the addresses of map operations.
> 
> I don't know if IOMMUs that have trouble with this are a realistic
> prospect, but they're at least a theoretical possibility, hence the
> comment.
> 
>>
>>> +    }
>>> +
>>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>>>            TARGET_PAGE_MASK;
>>
>>
>>
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-19 19:57   ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
@ 2014-03-28  6:01     ` Alexey Kardashevskiy
  2014-03-31 20:09       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-03-28  6:01 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 03/20/2014 06:57 AM, Alex Williamson wrote:
> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>> The patch adds a spapr-pci-vfio-host-bridge device type
>> which is a PCI Host Bridge with VFIO support. The new device
>> inherits from the spapr-pci-host-bridge device and adds
>> the following properties:
>> 	iommu - IOMMU group ID which represents a Partitionable
>> 	 	Endpoint, QEMU/ppc64 uses a separate PHB for
>> 		an IOMMU group so the guest kernel has to have
>> 		PCI Domain support enabled.
>> 	forceaddr (optional, 0 by default) - forces QEMU to copy
>> 		device:function from the host address as
>> 		certain guest drivers expect devices to appear in
>> 		particular locations;
>> 	mf (optional, 0 by default) - forces multifunction bit for
>> 		the function #0 of a found device, only makes sense
>> 		for multifunction devices and only with the forceaddr
>> 		property set. It would not be required if there
>> 		was a way to know in advance whether a device is
>> 		multifunctional or not.
>> 	scan (optional, 1 by default) - if non-zero, the new PHB walks
>> 		through all non-bridge devices in the group and tries
>> 		adding them to the PHB; if zero, all devices in the group
>> 		have to be configured manually via the QEMU command line.
>>
>> Examples of use:
>> 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
>> 	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6
>>
>> 2) Configure and Add 3 functions of a multifunctional device to QEMU:
>> (the NEC PCI USB card is used as an example here):
>> 	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
>> 	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
>> 	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
>> 	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
> 
> I won't pretend to predict the reaction of QEMU device architects to
> this, but it seems like the assembly we expect from config files or
> outside utilities, ex. libvirt.  I don't doubt this makes qemu
> commandline usage more palatable, but it seems contrary to some of the
> other things we do in QEMU with devices.  If this is something we need,
> why is it specific to spapr?  IOMMU group can contain multiple devices
> on any platform.  On x86 we could do something similar with a p2p
> bridge, switch, or root port.


At least at the moment devices are assigned to groups statically on SPAPR,
they cannot be moved between groups at all, so it seems like a reasonable
assumption that the user will not mind putting them all to the same QEMU
instance.

I should probably disable "scan" by default though, that would make more
sense for libvirt.


> BTW, the code skips bridges, but doesn't that mean you'll have a hard
> time with forceaddr as you potentially try to overlay devfn from
> multiple buses onto a single bus?
> It also makes the value of this a bit
> more questionable since it seems to fall apart so easily.  Thanks,

These "forceaddr" and "mf" are rather debug options - at the very beginning
I used to have strong impression that USB NEC PCI device (2xOHCI and
1xEHCI) does not work properly if it is not multifunctional but I was
wrong, just checked. So I'll just remove them as they do not help even me
anymore and that's it. Thanks!


> 
> Alex
> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v5:
>> * added handling of possible failure of spapr_vfio_new_table()
>>
>> v4:
>> * moved IOMMU changes to separate patches
>> * moved spapr-pci-vfio-host-bridge to new file
>> ---
>>  hw/ppc/Makefile.objs        |   2 +-
>>  hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |  13 +++
>>  3 files changed, 220 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/ppc/spapr_pci_vfio.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index ea747f0..2239192 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
>>  # IBM pSeries (sPAPR)
>>  obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
>>  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>> -obj-$(CONFIG_PSERIES) += spapr_pci.o
>> +obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>> new file mode 100644
>> index 0000000..40f6673
>> --- /dev/null
>> +++ b/hw/ppc/spapr_pci_vfio.c
>> @@ -0,0 +1,206 @@
>> +/*
>> + * QEMU sPAPR PCI host for VFIO
>> + *
>> + * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +
>> +#include "hw/hw.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "hw/misc/vfio.h"
>> +#include "hw/pci/pci_bus.h"
>> +#include "trace.h"
>> +#include "qemu/error-report.h"
>> +
>> +/* sPAPR VFIO */
>> +static Property spapr_phb_vfio_properties[] = {
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
>> +{
>> +    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
>> +    char *iommupath;
>> +    DIR *dirp;
>> +    struct dirent *entry;
>> +    Error *error = NULL;
>> +
>> +    if (!svphb->scan) {
>> +        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
>> +        return;
>> +    }
>> +
>> +    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
>> +                                svphb->iommugroupid);
>> +    if (!iommupath) {
>> +        return;
>> +    }
>> +
>> +    dirp = opendir(iommupath);
>> +    if (!dirp) {
>> +        error_report("spapr-vfio: vfio scan failed on opendir: %m");
>> +        g_free(iommupath);
>> +        return;
>> +    }
>> +
>> +    while ((entry = readdir(dirp)) != NULL) {
>> +        Error *err = NULL;
>> +        char *tmp;
>> +        FILE *deviceclassfile;
>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>> +        char addr[32];
>> +        DeviceState *dev;
>> +
>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
>> +                   &domainid, &busid, &devid, &fnid) != 4) {
>> +            continue;
>> +        }
>> +
>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
>> +        trace_spapr_pci("Reading device class from ", tmp);
>> +
>> +        deviceclassfile = fopen(tmp, "r");
>> +        if (deviceclassfile) {
>> +            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
>> +            fclose(deviceclassfile);
>> +            if (ret != 1) {
>> +                continue;
>> +            }
>> +        }
>> +        g_free(tmp);
>> +
>> +        if (!deviceclass) {
>> +            continue;
>> +        }
>> +        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
>> +            /* Skip bridges */
>> +            continue;
>> +        }
>> +        trace_spapr_pci("Creating device from ", entry->d_name);
>> +
>> +        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
>> +        if (!dev) {
>> +            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
>> +            continue;
>> +        }
>> +        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
>> +        if (err != NULL) {
>> +            object_unref(OBJECT(dev));
>> +            continue;
>> +        }
>> +        if (svphb->force_addr) {
>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
>> +            err = NULL;
>> +            object_property_parse(OBJECT(dev), addr, "addr", &err);
>> +            if (err != NULL) {
>> +                object_unref(OBJECT(dev));
>> +                continue;
>> +            }
>> +        }
>> +        if (svphb->enable_multifunction) {
>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>> +        }
>> +        object_property_set_bool(OBJECT(dev), true, "realized", &error);
>> +        if (error) {
>> +            object_unref(OBJECT(dev));
>> +            error_propagate(errp, error);
>> +            break;
>> +        }
>> +    }
>> +    closedir(dirp);
>> +    g_free(iommupath);
>> +}
>> +
>> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>> +{
>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
>> +    int ret;
>> +    Error *error = NULL;
>> +
>> +    if (svphb->iommugroupid == -1) {
>> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
>> +        return;
>> +    }
>> +
>> +    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
>> +
>> +    if (!svphb->phb.tcet) {
>> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
>> +        return;
>> +    }
>> +
>> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
>> +                       sphb->dtbusname);
>> +
>> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
>> +                                        sphb->dma_liobn, svphb->iommugroupid,
>> +                                        &info);
>> +    if (ret) {
>> +        error_setg_errno(errp, -ret,
>> +                         "spapr-vfio: get info from container failed");
>> +        return;
>> +    }
>> +
>> +    svphb->phb.dma_window_start = info.dma32_window_start;
>> +    svphb->phb.dma_window_size = info.dma32_window_size;
>> +
>> +    spapr_pci_vfio_scan(svphb, &error);
>> +    if (error) {
>> +        error_propagate(errp, error);
>> +    }
>> +}
>> +
>> +static void spapr_phb_vfio_reset(DeviceState *qdev)
>> +{
>> +    /* Do nothing */
>> +}
>> +
>> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
>> +
>> +    dc->props = spapr_phb_vfio_properties;
>> +    dc->reset = spapr_phb_vfio_reset;
>> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
>> +}
>> +
>> +static const TypeInfo spapr_phb_vfio_info = {
>> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
>> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
>> +    .instance_size = sizeof(sPAPRPHBVFIOState),
>> +    .class_init    = spapr_phb_vfio_class_init,
>> +    .class_size    = sizeof(sPAPRPHBClass),
>> +};
>> +
>> +static void spapr_pci_vfio_register_types(void)
>> +{
>> +    type_register_static(&spapr_phb_vfio_info);
>> +}
>> +
>> +type_init(spapr_pci_vfio_register_types)
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 0f428a1..18acb67 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -30,10 +30,14 @@
>>  #define SPAPR_MSIX_MAX_DEVS 32
>>  
>>  #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
>> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
>>  
>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>  
>> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
>> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
>> +
>>  #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
>>       OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>  #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
>> @@ -41,6 +45,7 @@
>>  
>>  typedef struct sPAPRPHBClass sPAPRPHBClass;
>>  typedef struct sPAPRPHBState sPAPRPHBState;
>> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>>  
>>  struct sPAPRPHBClass {
>>      PCIHostBridgeClass parent_class;
>> @@ -78,6 +83,14 @@ struct sPAPRPHBState {
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  };
>>  
>> +struct sPAPRPHBVFIOState {
>> +    sPAPRPHBState phb;
>> +
>> +    struct VFIOContainer *container;
>> +    int32_t iommugroupid;
>> +    uint8_t scan, enable_multifunction, force_addr;
>> +};
>> +
>>  #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>>  
>>  #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces
  2014-03-28  3:42     ` Alexey Kardashevskiy
@ 2014-03-31 19:14       ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-31 19:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Fri, 2014-03-28 at 14:42 +1100, Alexey Kardashevskiy wrote:
> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> > On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >> From: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> The only model so far supported for VFIO passthrough devices is the model
> >> usually used on x86, where all of the guest's RAM is mapped into the
> >> (host) IOMMU and there is no IOMMU visible in the guest.
> >>
> >> This patch begins to relax this model, introducing the notion of a
> >> VFIOAddressSpace.  This represents a logical DMA address space which will
> >> be visible to one or more VFIO devices by appropriate mapping in the (host)
> >> IOMMU.  Thus the currently global list of containers becomes local to
> >> a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
> >> group to multiple address spaces.
> >>
> >> For now, only one VFIOAddressSpace is created and used, corresponding to
> >> main system memory, that will change in future patches.
> >>
> >> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >> ---
> >> Changes:
> >> v5:
> >> * vfio_get_group() now receives AddressSpace*
> >>
> >> v4:
> >> * removed redundant checks and asserts
> >> * fixed some return error codes
> >> ---
> >>  hw/misc/vfio.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
> >>  1 file changed, 40 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index 6a04c2a..c8236c3 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -133,6 +133,22 @@ enum {
> >>      VFIO_INT_MSIX = 3,
> >>  };
> >>  
> >> +typedef struct VFIOAddressSpace {
> >> +    AddressSpace *as;
> >> +    QLIST_HEAD(, VFIOContainer) containers;
> >> +    QLIST_ENTRY(VFIOAddressSpace) list;
> >> +} VFIOAddressSpace;
> >> +
> >> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
> > 
> > Why isn't this static and initialized with QLIST_HEAD_INITIALIZER like
> > the qlist it replaces?
> 
> 
> I'll use QLIST_HEAD_INITIALIZER, ok.
> 
> 
> It is not static because otherwise it does not compile - the actual use of
> vfio_address_spaces happens in the next patches. Should I make it static
> and move to the next patch? Thanks.

So you're working around an unused variable compile failure by making it
non-static?  Why not just move declaring it until the next patch?
Thanks,

Alex

> >> +
> >> +static VFIOAddressSpace vfio_address_space_memory;
> >> +
> >> +static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
> >> +{
> >> +    space->as = as;
> >> +    QLIST_INIT(&space->containers);
> >> +}
> >> +
> >>  struct VFIOGroup;
> >>  
> >>  typedef struct VFIOType1 {
> >> @@ -142,6 +158,7 @@ typedef struct VFIOType1 {
> >>  } VFIOType1;
> >>  
> >>  typedef struct VFIOContainer {
> >> +    VFIOAddressSpace *space;
> >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >>      struct {
> >>          /* enable abstraction to support various iommu backends */
> >> @@ -234,9 +251,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
> >>  
> >>  #define MSIX_CAP_LENGTH 12
> >>  
> >> -static QLIST_HEAD(, VFIOContainer)
> >> -    container_list = QLIST_HEAD_INITIALIZER(container_list);
> >> -
> >>  static QLIST_HEAD(, VFIOGroup)
> >>      group_list = QLIST_HEAD_INITIALIZER(group_list);
> >>  
> >> @@ -3280,16 +3294,15 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
> >>  #endif
> >>  }
> >>  
> >> -static int vfio_connect_container(VFIOGroup *group)
> >> +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>  {
> >>      VFIOContainer *container;
> >>      int ret, fd;
> >> +    VFIOAddressSpace *space;
> >>  
> >> -    if (group->container) {
> >> -        return 0;
> >> -    }
> >> +    space = &vfio_address_space_memory;
> >>  
> >> -    QLIST_FOREACH(container, &container_list, next) {
> >> +    QLIST_FOREACH(container, &space->containers, next) {
> >>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> >>              group->container = container;
> >>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> >> @@ -3312,6 +3325,7 @@ static int vfio_connect_container(VFIOGroup *group)
> >>      }
> >>  
> >>      container = g_malloc0(sizeof(*container));
> >> +    container->space = space;
> >>      container->fd = fd;
> >>  
> >>      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
> >> @@ -3349,7 +3363,7 @@ static int vfio_connect_container(VFIOGroup *group)
> >>      }
> >>  
> >>      QLIST_INIT(&container->group_list);
> >> -    QLIST_INSERT_HEAD(&container_list, container, next);
> >> +    QLIST_INSERT_HEAD(&space->containers, container, next);
> >>  
> >>      group->container = container;
> >>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> >> @@ -3392,7 +3406,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>      }
> >>  }
> >>  
> >> -static VFIOGroup *vfio_get_group(int groupid)
> >> +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
> >>  {
> >>      VFIOGroup *group;
> >>      char path[32];
> >> @@ -3400,7 +3414,14 @@ static VFIOGroup *vfio_get_group(int groupid)
> >>  
> >>      QLIST_FOREACH(group, &group_list, next) {
> >>          if (group->groupid == groupid) {
> >> -            return group;
> >> +            /* Found it.  Now is it already in the right context? */
> >> +            if (group->container->space->as == as) {
> >> +                return group;
> >> +            } else {
> >> +                error_report("vfio: group %d used in multiple address spaces",
> >> +                             group->groupid);
> >> +                return NULL;
> >> +            }
> >>          }
> >>      }
> >>  
> >> @@ -3428,7 +3449,7 @@ static VFIOGroup *vfio_get_group(int groupid)
> >>      group->groupid = groupid;
> >>      QLIST_INIT(&group->device_list);
> >>  
> >> -    if (vfio_connect_container(group)) {
> >> +    if (vfio_connect_container(group, as)) {
> >>          error_report("vfio: failed to setup container for group %d", groupid);
> >>          goto close_fd_exit;
> >>      }
> >> @@ -3780,7 +3801,12 @@ static int vfio_initfn(PCIDevice *pdev)
> >>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
> >>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
> >>  
> >> -    group = vfio_get_group(groupid);
> >> +    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
> >> +        error_report("vfio: DMA address space must be system memory");
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    group = vfio_get_group(groupid, &address_space_memory);
> >>      if (!group) {
> >>          error_report("vfio: failed to get group %d", groupid);
> >>          return -ENOENT;
> >> @@ -3994,6 +4020,7 @@ static const TypeInfo vfio_pci_dev_info = {
> >>  
> >>  static void register_vfio_pci_dev_type(void)
> >>  {
> >> +    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
> >>      type_register_static(&vfio_pci_dev_info);
> >>  }
> >>  
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-28  4:49         ` Alexey Kardashevskiy
@ 2014-03-31 19:54           ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-31 19:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On Fri, 2014-03-28 at 15:49 +1100, Alexey Kardashevskiy wrote:
> On 03/22/2014 01:17 AM, Alex Williamson wrote:
> > On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote:
> >> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> >>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >>>> From: David Gibson <david@gibson.dropbear.id.au>
> >>>>
> >>>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
> >>>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> >>>> sufficient capability and granularity to match the guest side. This works
> >>>> by tracking all map and unmap operations on the guest IOMMU using the
> >>>> notifiers, and mirroring them into VFIO.
> >>>>
> >>>> There are a number of FIXMEs, and the scheme involves rather more notifier
> >>>> structures than I'd like, but it should make for a reasonable proof of
> >>>> concept.
> >>>>
> >>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>
> >>>> ---
> >>>> Changes:
> >>>> v4:
> >>>> * fixed list objects naming
> >>>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
> >>>> place only, it is also easier to review the changes
> >>>> * fixes boundary check not to fail on sections == 2^64 bytes,
> >>>> the "vfio: Fix debug output for int128 values" patch is required;
> >>>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> >>>> for 128 bit memory section sizes" patch proposal
> >>>> ---
> >>>>  hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>  1 file changed, 120 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >>>> index 038010b..4f6f5da 100644
> >>>> --- a/hw/misc/vfio.c
> >>>> +++ b/hw/misc/vfio.c
> >>>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
> >>>>          };
> >>>>          void (*release)(struct VFIOContainer *);
> >>>>      } iommu_data;
> >>>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>>>      QLIST_HEAD(, VFIOGroup) group_list;
> >>>>      QLIST_ENTRY(VFIOContainer) next;
> >>>>  } VFIOContainer;
> >>>>  
> >>>> +typedef struct VFIOGuestIOMMU {
> >>>> +    VFIOContainer *container;
> >>>> +    MemoryRegion *iommu;
> >>>> +    Notifier n;
> >>>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >>>> +} VFIOGuestIOMMU;
> >>>> +
> >>>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
> >>>>  typedef struct VFIOMSIXInfo {
> >>>>      uint8_t table_bar;
> >>>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>>>  
> >>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>  {
> >>>> -    return !memory_region_is_ram(section->mr) ||
> >>>> -           /*
> >>>> +    return (!memory_region_is_ram(section->mr) &&
> >>>> +            !memory_region_is_iommu(section->mr)) ||
> >>>> +        /*
> >>>
> >>> White space damage
> >>>
> >>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
> >>>>              * addresses in the upper part of the 64-bit address space.  These
> >>>>              * are never accessed by the CPU and beyond the address width of
> >>>> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>             section->offset_within_address_space & (1ULL << 63);
> >>>>  }
> >>>>  
> >>>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>> +{
> >>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>>> +    VFIOContainer *container = giommu->container;
> >>>> +    IOMMUTLBEntry *iotlb = data;
> >>>> +    MemoryRegion *mr;
> >>>> +    hwaddr xlat;
> >>>> +    hwaddr len = iotlb->addr_mask + 1;
> >>>> +    void *vaddr;
> >>>> +    int ret;
> >>>> +
> >>>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >>>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> >>>> +
> >>>> +    /*
> >>>> +     * The IOMMU TLB entry we have just covers translation through
> >>>> +     * this IOMMU to its immediate target.  We need to translate
> >>>> +     * it the rest of the way through to memory.
> >>>> +     */
> >>>> +    mr = address_space_translate(&address_space_memory,
> >>>> +                                 iotlb->translated_addr,
> >>>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> >>>
> >>> Write-only?  Is this supposed to be read-write to mask just 2 bits?
> >>
> >>
> >> The last parameter of address_space_translate() bool is_write. So I do not
> >> really understand the problem here.
> > 
> > Oops, my bad, I didn't look at what address_space_translate() used that
> > for.  Ok.
> > 
> >>>> +    if (!memory_region_is_ram(mr)) {
> >>>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> >>>> +                xlat);
> >>>> +        return;
> >>>> +    }
> >>>> +    if (len & iotlb->addr_mask) {
> >>>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> >>>
> >>> Is this possible?  Assuming len is initially a power-of-2, would the
> >>> translate function change it?  Maybe worth a comment to explain.
> >>
> >>
> >> Oh. address_space_translate() actually changes @len to min(len,
> >> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
> >> but lately I have been implementing a huge DMA window (plus one
> >> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
> >> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
> >> is happening.
> >>
> >> I disabled that check but I need to think of better fix...
> >>
> >> Adding Paolo to cc, may be he picks the context and gives good piece of
> >> advise :)
> >>
> >>
> >>
> >>>
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>
> >>> This lookup isn't free and the unmap path doesn't need it, maybe move
> >>> the variable and lookup into the first branch below?
> >>>
> >>>> +
> >>>> +    if (iotlb->perm != IOMMU_NONE) {
> >>>> +        ret = vfio_dma_map(container, iotlb->iova,
> >>>> +                           iotlb->addr_mask + 1, vaddr,
> >>>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>>> +                         container, iotlb->iova,
> >>>> +                         iotlb->addr_mask + 1, vaddr, ret);
> >>>> +        }
> >>>> +    } else {
> >>>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> >>>> +                         container, iotlb->iova,
> >>>> +                         iotlb->addr_mask + 1, ret);
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>> +
> >>>>  static void vfio_listener_region_add(MemoryListener *listener,
> >>>>                                       MemoryRegionSection *section)
> >>>>  {
> >>>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>      void *vaddr;
> >>>>      int ret;
> >>>>  
> >>>> -    assert(!memory_region_is_iommu(section->mr));
> >>>> -
> >>>>      if (vfio_listener_skipped_section(section)) {
> >>>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
> >>>>                  section->offset_within_address_space,
> >>>> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>          return;
> >>>>      }
> >>>>  
> >>>> +    memory_region_ref(section->mr);
> >>>> +
> >>>> +    if (memory_region_is_iommu(section->mr)) {
> >>>> +        VFIOGuestIOMMU *giommu;
> >>>> +
> >>>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >>>> +                iova, int128_get64(int128_sub(llend, int128_one())));
> >>>> +        /*
> >>>> +         * FIXME: We should do some checking to see if the
> >>>> +         * capabilities of the host VFIO IOMMU are adequate to model
> >>>> +         * the guest IOMMU
> >>>> +         *
> >>>> +         * FIXME: This assumes that the guest IOMMU is empty of
> >>>> +         * mappings at this point - we should either enforce this, or
> >>>> +         * loop through existing mappings to map them into VFIO.
> >>>> +         *
> >>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>>> +         * avoid bouncing all map/unmaps through qemu this way, this
> >>>> +         * would be the right place to wire that up (tell the KVM
> >>>> +         * device emulation the VFIO iommu handles to use).
> >>>> +         */
> >>>
> >>> That's a lot of FIXMEs...  The second one in particular looks like it
> >>> needs to expand a bit on why this is likely a valid assumption.  The
> >>> last one is more of a TODO than a FIXME.
> >>>
> >>>> +        giommu = g_malloc0(sizeof(*giommu));
> >>>> +        giommu->iommu = section->mr;
> >>>> +        giommu->container = container;
> >>>> +        giommu->n.notify = vfio_iommu_map_notify;
> >>>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>>> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >>>> +
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> >>>> +
> >>>>      end = int128_get64(llend);
> >>>>      vaddr = memory_region_get_ram_ptr(section->mr) +
> >>>>              section->offset_within_region +
> >>>>              (iova - section->offset_within_address_space);
> >>>>  
> >>>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >>>> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >>>>              iova, end - 1, vaddr);
> >>>>  
> >>>> -    memory_region_ref(section->mr);
> >>>>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
> >>>>      if (ret) {
> >>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>>>          return;
> >>>>      }
> >>>>  
> >>>> +    if (memory_region_is_iommu(section->mr)) {
> >>>> +        VFIOGuestIOMMU *giommu;
> >>>> +
> >>>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >>>> +            if (giommu->iommu == section->mr) {
> >>>> +                memory_region_unregister_iommu_notifier(&giommu->n);
> >>>> +                QLIST_REMOVE(giommu, giommu_next);
> >>>> +                g_free(giommu);
> >>>> +                break;
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        /*
> >>>> +         * FIXME: We assume the one big unmap below is adequate to
> >>>> +         * remove any individual page mappings in the IOMMU which
> >>>> +         * might have been copied into VFIO.  That may not be true for
> >>>> +         * all IOMMU types
> >>>> +         */
> >>>
> >>> We assume this because the IOVA that gets unmapped is the same
> >>> regardless of whether a guest IOMMU is present?
> >>
> >>
> >> What exactly is meant by "guest IOMMU is present"? Doing the second DMA
> >> window, now I am really confused about terminology :(
> > 
> > The confusion for me is that add_region initializes the giommu and all
> > the DMA mapping through VFIO is done in the notifier for the giommu.
> > It's therefore asymmetric that add_region doesn't vfio_dma_map anything,
> > but region_del does vfio_dma_unmap, which is the basis of my question.
> > I thought this comment was trying to address why that is, but apparently
> > it's something else entirely, so it would be nice to understand why this
> > doesn't return() and decode a bit more clearly what the FIXME is trying
> > to say.  Thanks,
> 
> I do not mind extending that comment but need some assistance.
> 
> My understanding is that:
> ===
> region_del is called on memory region removal so this particular window is
> not going to be used anymore and this is the only place where such cleanup
> could be done.
> ===
> 
> Asymmetry is here but it does not look terrible.

The asymmetry may well be fine if properly documented and region_del is
a fine place to teardown and unmap a guest IOMMU that was instantiated
via region_add.  The problem is validating and clarifying the assumption
that the FIXME is making for future support.  The life cycle as I
understand it is:

region_add
 - create giommu and register notifier

notifier
 - adds and removes giommu mappings

region_del
 - unregister notifier, cleanup, and free giommu

As part of that cleanup we need to make sure there are no outstanding
mappings.  We do that by unmapping the entire MemoryRegionSection.  So
what exactly are the assumptions to be able to do this?  One seems to be
that the IOMMU will unmap any sub-mappings fully overlapped by a larger
range, which is what David identified.  Another seems to be that there's
no address translation of the IOVA and thus the range is the same
regardless of whether a guest IOMMU is present, which is what I tried to
identify but David nak'd.

Personally I think the FIXME here and the one where we create the giommu
and assume it has no mappings show some immaturity in the QEMU IOMMU
API.  I'd expect it to behave more like a MemoryListener, replaying all
mappings on register and removing all on unregister.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
  2014-03-28  5:12       ` Alexey Kardashevskiy
@ 2014-03-31 19:59         ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-03-31 19:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2014-03-28 at 16:12 +1100, Alexey Kardashevskiy wrote:
> On 03/20/2014 04:25 PM, David Gibson wrote:
> > On Wed, Mar 19, 2014 at 01:57:41PM -0600, Alex Williamson wrote:
> >> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >>> From: David Gibson <david@gibson.dropbear.id.au>
> > [snip]
> >>> +    if (!memory_region_is_ram(mr)) {
> >>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> >>> +                xlat);
> >>> +        return;
> >>> +    }
> >>> +    if (len & iotlb->addr_mask) {
> >>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> >>
> >> Is this possible?  Assuming len is initially a power-of-2, would the
> >> translate function change it?  Maybe worth a comment to explain.
> > 
> > translate can absolutely change the length.  It will generally
> > truncate it to the IOMMU page size, in fact.
> > 
> > [snip]
> >>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >>> +                iova, int128_get64(int128_sub(llend, int128_one())));
> >>> +        /*
> >>> +         * FIXME: We should do some checking to see if the
> >>> +         * capabilities of the host VFIO IOMMU are adequate to model
> >>> +         * the guest IOMMU
> >>> +         *
> >>> +         * FIXME: This assumes that the guest IOMMU is empty of
> >>> +         * mappings at this point - we should either enforce this, or
> >>> +         * loop through existing mappings to map them into VFIO.
> >>> +         *
> >>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>> +         * avoid bouncing all map/unmaps through qemu this way, this
> >>> +         * would be the right place to wire that up (tell the KVM
> >>> +         * device emulation the VFIO iommu handles to use).
> >>> +         */
> >>
> >> That's a lot of FIXMEs...  The second one in particular looks like it
> >> needs to expand a bit on why this is likely a valid assumption.  The
> >> last one is more of a TODO than a FIXME.
> > 
> > I think #2 isn't a valid assumption in general.  It was true for the
> > situation I was testing at the time, due to the order of pseries
> > initialization, so I left it to get a proof of concept reasonably
> > quickly.
> > 
> > But I think that one's a FIXME that actually needs to be fixed.
> 
> 
> But how?
> 
> At the moment, for SPAPR, the table gets cleaned when group is attached to
> a container (VFIO_GROUP_SET_CONTAINER ioctl) which happens right before
> registering the memory listener so we are fine (at least for SPAPR).
> Is that true for x86 or we need something more?
> 
> "loop through existing mappings to map them into VFIO" - this I do not
> really understand. We do not track mapping in QEMU so we cannot really loop
> through them.

Making registration of a guest IOMMU more like a MemoryListener would
solve this, the infrastructure should replay existing mappings on
startups and clear them on shutdown, then we could also get rid of the
FIXME on the region_del path.

> > [snip]
> >>> +        /*
> >>> +         * FIXME: We assume the one big unmap below is adequate to
> >>> +         * remove any individual page mappings in the IOMMU which
> >>> +         * might have been copied into VFIO.  That may not be true for
> >>> +         * all IOMMU types
> >>> +         */
> >>
> >> We assume this because the IOVA that gets unmapped is the same
> >> regardless of whether a guest IOMMU is present?
> > 
> > Uh.. no.  This assumption works for a page table based IOMMU where a
> > big unmap just flattens a large range of IO-PTEs. 
> 
> 
> Sorry for my poor english but how exactly is that different from what Alex
> said? IOVA is a PCI bus address between dma_window_start and
> dma_window_start+dma_window_size.

I think David is trying to describe that the IOMMU must be able to unmap
a sparse sub-region of a larger unmap, while I'm trying to make sure
there's no IOVA translation that might interfere with using the
"standard" unmap path rather than the guest IOMMU unmap path.  Thanks,

Alex

> > It might not work
> > for some kind of extent or TLB based IOMMU, where operations are
> > expected to exactly match the addresses of map operations.
> > 
> > I don't know if IOMMUs that have trouble with this are a realistic
> > prospect, but they're at least a theoretical possibility, hence the
> > comment.
> > 
> >>
> >>> +    }
> >>> +
> >>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> >>>            TARGET_PAGE_MASK;
> >>
> >>
> >>
> > 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-28  6:01     ` Alexey Kardashevskiy
@ 2014-03-31 20:09       ` Alex Williamson
  2014-04-01  6:25         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2014-03-31 20:09 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Fri, 2014-03-28 at 17:01 +1100, Alexey Kardashevskiy wrote:
> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> > On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >> The patch adds a spapr-pci-vfio-host-bridge device type
> >> which is a PCI Host Bridge with VFIO support. The new device
> >> inherits from the spapr-pci-host-bridge device and adds
> >> the following properties:
> >> 	iommu - IOMMU group ID which represents a Partitionable
> >> 	 	Endpoint, QEMU/ppc64 uses a separate PHB for
> >> 		an IOMMU group so the guest kernel has to have
> >> 		PCI Domain support enabled.
> >> 	forceaddr (optional, 0 by default) - forces QEMU to copy
> >> 		device:function from the host address as
> >> 		certain guest drivers expect devices to appear in
> >> 		particular locations;
> >> 	mf (optional, 0 by default) - forces multifunction bit for
> >> 		the function #0 of a found device, only makes sense
> >> 		for multifunction devices and only with the forceaddr
> >> 		property set. It would not be required if there
> >> 		was a way to know in advance whether a device is
> >> 		multifunctional or not.
> >> 	scan (optional, 1 by default) - if non-zero, the new PHB walks
> >> 		through all non-bridge devices in the group and tries
> >> 		adding them to the PHB; if zero, all devices in the group
> >> 		have to be configured manually via the QEMU command line.
> >>
> >> Examples of use:
> >> 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
> >> 	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6
> >>
> >> 2) Configure and Add 3 functions of a multifunctional device to QEMU:
> >> (the NEC PCI USB card is used as an example here):
> >> 	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
> >> 	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
> >> 	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
> >> 	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
> > 
> > I won't pretend to predict the reaction of QEMU device architects to
> > this, but it seems like the assembly we expect from config files or
> > outside utilities, ex. libvirt.  I don't doubt this makes qemu
> > commandline usage more palatable, but it seems contrary to some of the
> > other things we do in QEMU with devices.  If this is something we need,
> > why is it specific to spapr?  IOMMU group can contain multiple devices
> > on any platform.  On x86 we could do something similar with a p2p
> > bridge, switch, or root port.
> 
> 
> At least at the moment devices are assigned to groups statically on SPAPR,
> they cannot be moved between groups at all, so it seems like a reasonable
> assumption that the user will not mind putting them all to the same QEMU
> instance.
> 
> I should probably disable "scan" by default though, that would make more
> sense for libvirt.

That doesn't really address the concern.  An x86 chipset puts specific
devices at a specific address, yet this is not something that we hard
code into QEMU.  We have config files that define this and tools like
libvirt know where to put things.  Why is SPAPR special enough to have
QEMU auto-add an entire sub-hierarchy?  If this is a necessary feature,
why not make it generic and give x86 the capability as well?

> > BTW, the code skips bridges, but doesn't that mean you'll have a hard
> > time with forceaddr as you potentially try to overlay devfn from
> > multiple buses onto a single bus?
> > It also makes the value of this a bit
> > more questionable since it seems to fall apart so easily.  Thanks,
> 
> These "forceaddr" and "mf" are rather debug options - at the very beginning
> I used to have strong impression that USB NEC PCI device (2xOHCI and
> 1xEHCI) does not work properly if it is not multifunctional but I was
> wrong, just checked. So I'll just remove them as they do not help even me
> anymore and that's it. Thanks!

I'm still questioning the value of this code, it just seems to be a
convenience feature for someone running QEMU from the commandline, which
has never really been a concern for the rest of QEMU.  Suggest taking
this patch out of he series since it's not critical path.  Thanks,

Alex

> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v5:
> >> * added handling of possible failure of spapr_vfio_new_table()
> >>
> >> v4:
> >> * moved IOMMU changes to separate patches
> >> * moved spapr-pci-vfio-host-bridge to new file
> >> ---
> >>  hw/ppc/Makefile.objs        |   2 +-
> >>  hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/pci-host/spapr.h |  13 +++
> >>  3 files changed, 220 insertions(+), 1 deletion(-)
> >>  create mode 100644 hw/ppc/spapr_pci_vfio.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index ea747f0..2239192 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
> >>  # IBM pSeries (sPAPR)
> >>  obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
> >>  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> >> -obj-$(CONFIG_PSERIES) += spapr_pci.o
> >> +obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> >> new file mode 100644
> >> index 0000000..40f6673
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_pci_vfio.c
> >> @@ -0,0 +1,206 @@
> >> +/*
> >> + * QEMU sPAPR PCI host for VFIO
> >> + *
> >> + * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
> >> + *
> >> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> >> + * of this software and associated documentation files (the "Software"), to deal
> >> + * in the Software without restriction, including without limitation the rights
> >> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> >> + * copies of the Software, and to permit persons to whom the Software is
> >> + * furnished to do so, subject to the following conditions:
> >> + *
> >> + * The above copyright notice and this permission notice shall be included in
> >> + * all copies or substantial portions of the Software.
> >> + *
> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> >> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >> + * THE SOFTWARE.
> >> + */
> >> +#include <sys/types.h>
> >> +#include <dirent.h>
> >> +
> >> +#include "hw/hw.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/pci-host/spapr.h"
> >> +#include "hw/misc/vfio.h"
> >> +#include "hw/pci/pci_bus.h"
> >> +#include "trace.h"
> >> +#include "qemu/error-report.h"
> >> +
> >> +/* sPAPR VFIO */
> >> +static Property spapr_phb_vfio_properties[] = {
> >> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
> >> +    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
> >> +    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
> >> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
> >> +{
> >> +    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
> >> +    char *iommupath;
> >> +    DIR *dirp;
> >> +    struct dirent *entry;
> >> +    Error *error = NULL;
> >> +
> >> +    if (!svphb->scan) {
> >> +        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
> >> +        return;
> >> +    }
> >> +
> >> +    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
> >> +                                svphb->iommugroupid);
> >> +    if (!iommupath) {
> >> +        return;
> >> +    }
> >> +
> >> +    dirp = opendir(iommupath);
> >> +    if (!dirp) {
> >> +        error_report("spapr-vfio: vfio scan failed on opendir: %m");
> >> +        g_free(iommupath);
> >> +        return;
> >> +    }
> >> +
> >> +    while ((entry = readdir(dirp)) != NULL) {
> >> +        Error *err = NULL;
> >> +        char *tmp;
> >> +        FILE *deviceclassfile;
> >> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> >> +        char addr[32];
> >> +        DeviceState *dev;
> >> +
> >> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> >> +                   &domainid, &busid, &devid, &fnid) != 4) {
> >> +            continue;
> >> +        }
> >> +
> >> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> >> +        trace_spapr_pci("Reading device class from ", tmp);
> >> +
> >> +        deviceclassfile = fopen(tmp, "r");
> >> +        if (deviceclassfile) {
> >> +            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
> >> +            fclose(deviceclassfile);
> >> +            if (ret != 1) {
> >> +                continue;
> >> +            }
> >> +        }
> >> +        g_free(tmp);
> >> +
> >> +        if (!deviceclass) {
> >> +            continue;
> >> +        }
> >> +        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
> >> +            /* Skip bridges */
> >> +            continue;
> >> +        }
> >> +        trace_spapr_pci("Creating device from ", entry->d_name);
> >> +
> >> +        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
> >> +        if (!dev) {
> >> +            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
> >> +            continue;
> >> +        }
> >> +        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
> >> +        if (err != NULL) {
> >> +            object_unref(OBJECT(dev));
> >> +            continue;
> >> +        }
> >> +        if (svphb->force_addr) {
> >> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> >> +            err = NULL;
> >> +            object_property_parse(OBJECT(dev), addr, "addr", &err);
> >> +            if (err != NULL) {
> >> +                object_unref(OBJECT(dev));
> >> +                continue;
> >> +            }
> >> +        }
> >> +        if (svphb->enable_multifunction) {
> >> +            qdev_prop_set_bit(dev, "multifunction", 1);
> >> +        }
> >> +        object_property_set_bool(OBJECT(dev), true, "realized", &error);
> >> +        if (error) {
> >> +            object_unref(OBJECT(dev));
> >> +            error_propagate(errp, error);
> >> +            break;
> >> +        }
> >> +    }
> >> +    closedir(dirp);
> >> +    g_free(iommupath);
> >> +}
> >> +
> >> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> >> +{
> >> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> >> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
> >> +    int ret;
> >> +    Error *error = NULL;
> >> +
> >> +    if (svphb->iommugroupid == -1) {
> >> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
> >> +        return;
> >> +    }
> >> +
> >> +    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
> >> +
> >> +    if (!svphb->phb.tcet) {
> >> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
> >> +        return;
> >> +    }
> >> +
> >> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
> >> +                       sphb->dtbusname);
> >> +
> >> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
> >> +                                        sphb->dma_liobn, svphb->iommugroupid,
> >> +                                        &info);
> >> +    if (ret) {
> >> +        error_setg_errno(errp, -ret,
> >> +                         "spapr-vfio: get info from container failed");
> >> +        return;
> >> +    }
> >> +
> >> +    svphb->phb.dma_window_start = info.dma32_window_start;
> >> +    svphb->phb.dma_window_size = info.dma32_window_size;
> >> +
> >> +    spapr_pci_vfio_scan(svphb, &error);
> >> +    if (error) {
> >> +        error_propagate(errp, error);
> >> +    }
> >> +}
> >> +
> >> +static void spapr_phb_vfio_reset(DeviceState *qdev)
> >> +{
> >> +    /* Do nothing */
> >> +}
> >> +
> >> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
> >> +
> >> +    dc->props = spapr_phb_vfio_properties;
> >> +    dc->reset = spapr_phb_vfio_reset;
> >> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
> >> +}
> >> +
> >> +static const TypeInfo spapr_phb_vfio_info = {
> >> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
> >> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
> >> +    .instance_size = sizeof(sPAPRPHBVFIOState),
> >> +    .class_init    = spapr_phb_vfio_class_init,
> >> +    .class_size    = sizeof(sPAPRPHBClass),
> >> +};
> >> +
> >> +static void spapr_pci_vfio_register_types(void)
> >> +{
> >> +    type_register_static(&spapr_phb_vfio_info);
> >> +}
> >> +
> >> +type_init(spapr_pci_vfio_register_types)
> >> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >> index 0f428a1..18acb67 100644
> >> --- a/include/hw/pci-host/spapr.h
> >> +++ b/include/hw/pci-host/spapr.h
> >> @@ -30,10 +30,14 @@
> >>  #define SPAPR_MSIX_MAX_DEVS 32
> >>  
> >>  #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
> >> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
> >>  
> >>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
> >>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>  
> >> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
> >> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
> >> +
> >>  #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
> >>       OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>  #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
> >> @@ -41,6 +45,7 @@
> >>  
> >>  typedef struct sPAPRPHBClass sPAPRPHBClass;
> >>  typedef struct sPAPRPHBState sPAPRPHBState;
> >> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
> >>  
> >>  struct sPAPRPHBClass {
> >>      PCIHostBridgeClass parent_class;
> >> @@ -78,6 +83,14 @@ struct sPAPRPHBState {
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >>  };
> >>  
> >> +struct sPAPRPHBVFIOState {
> >> +    sPAPRPHBState phb;
> >> +
> >> +    struct VFIOContainer *container;
> >> +    int32_t iommugroupid;
> >> +    uint8_t scan, enable_multifunction, force_addr;
> >> +};
> >> +
> >>  #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> >>  
> >>  #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-03-31 20:09       ` Alex Williamson
@ 2014-04-01  6:25         ` Alexey Kardashevskiy
  2014-04-01 18:21           ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-04-01  6:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 04/01/2014 07:09 AM, Alex Williamson wrote:
> On Fri, 2014-03-28 at 17:01 +1100, Alexey Kardashevskiy wrote:
>> On 03/20/2014 06:57 AM, Alex Williamson wrote:
>>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>>>> The patch adds a spapr-pci-vfio-host-bridge device type
>>>> which is a PCI Host Bridge with VFIO support. The new device
>>>> inherits from the spapr-pci-host-bridge device and adds
>>>> the following properties:
>>>> 	iommu - IOMMU group ID which represents a Partitionable
>>>> 	 	Endpoint, QEMU/ppc64 uses a separate PHB for
>>>> 		an IOMMU group so the guest kernel has to have
>>>> 		PCI Domain support enabled.
>>>> 	forceaddr (optional, 0 by default) - forces QEMU to copy
>>>> 		device:function from the host address as
>>>> 		certain guest drivers expect devices to appear in
>>>> 		particular locations;
>>>> 	mf (optional, 0 by default) - forces multifunction bit for
>>>> 		the function #0 of a found device, only makes sense
>>>> 		for multifunction devices and only with the forceaddr
>>>> 		property set. It would not be required if there
>>>> 		was a way to know in advance whether a device is
>>>> 		multifunctional or not.
>>>> 	scan (optional, 1 by default) - if non-zero, the new PHB walks
>>>> 		through all non-bridge devices in the group and tries
>>>> 		adding them to the PHB; if zero, all devices in the group
>>>> 		have to be configured manually via the QEMU command line.
>>>>
>>>> Examples of use:
>>>> 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
>>>> 	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6
>>>>
>>>> 2) Configure and Add 3 functions of a multifunctional device to QEMU:
>>>> (the NEC PCI USB card is used as an example here):
>>>> 	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
>>>> 	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
>>>> 	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
>>>> 	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
>>>
>>> I won't pretend to predict the reaction of QEMU device architects to
>>> this, but it seems like the assembly we expect from config files or
>>> outside utilities, ex. libvirt.  I don't doubt this makes qemu
>>> commandline usage more palatable, but it seems contrary to some of the
>>> other things we do in QEMU with devices.  If this is something we need,
>>> why is it specific to spapr?  IOMMU group can contain multiple devices
>>> on any platform.  On x86 we could do something similar with a p2p
>>> bridge, switch, or root port.
>>
>>
>> At least at the moment devices are assigned to groups statically on SPAPR,
>> they cannot be moved between groups at all, so it seems like a reasonable
>> assumption that the user will not mind putting them all to the same QEMU
>> instance.
>>
>> I should probably disable "scan" by default though, that would make more
>> sense for libvirt.
> 
> That doesn't really address the concern.  An x86 chipset puts specific
> devices at a specific address, yet this is not something that we hard
> code into QEMU.  We have config files that define this and tools like
> libvirt know where to put things.  Why is SPAPR special enough to have
> QEMU auto-add an entire sub-hierarchy?  If this is a necessary feature,
> why not make it generic and give x86 the capability as well?

Ok. You are right. spapr_pci_vfio_scan() has to go :)



>>> BTW, the code skips bridges, but doesn't that mean you'll have a hard
>>> time with forceaddr as you potentially try to overlay devfn from
>>> multiple buses onto a single bus?
>>> It also makes the value of this a bit
>>> more questionable since it seems to fall apart so easily.  Thanks,
>>
>> These "forceaddr" and "mf" are rather debug options - at the very beginning
>> I used to have strong impression that USB NEC PCI device (2xOHCI and
>> 1xEHCI) does not work properly if it is not multifunctional but I was
>> wrong, just checked. So I'll just remove them as they do not help even me
>> anymore and that's it. Thanks!
> 
> I'm still questioning the value of this code, it just seems to be a
> convenience feature for someone running QEMU from the commandline, which
> has never really been a concern for the rest of QEMU.  Suggest taking
> this patch out of he series since it's not critical path.  Thanks,


I can move  spapr_pci_vfio_scan() to a separate patch or get rid of it at
all  but the rest is still needed to setup the correct IOMMU device. Is
that what you really meant?


> 
> Alex
> 
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v5:
>>>> * added handling of possible failure of spapr_vfio_new_table()
>>>>
>>>> v4:
>>>> * moved IOMMU changes to separate patches
>>>> * moved spapr-pci-vfio-host-bridge to new file
>>>> ---
>>>>  hw/ppc/Makefile.objs        |   2 +-
>>>>  hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/pci-host/spapr.h |  13 +++
>>>>  3 files changed, 220 insertions(+), 1 deletion(-)
>>>>  create mode 100644 hw/ppc/spapr_pci_vfio.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index ea747f0..2239192 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
>>>>  # IBM pSeries (sPAPR)
>>>>  obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
>>>>  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>>> -obj-$(CONFIG_PSERIES) += spapr_pci.o
>>>> +obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
>>>>  # PowerPC 4xx boards
>>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>  obj-y += ppc4xx_pci.o
>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>> new file mode 100644
>>>> index 0000000..40f6673
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>> @@ -0,0 +1,206 @@
>>>> +/*
>>>> + * QEMU sPAPR PCI host for VFIO
>>>> + *
>>>> + * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>> + * of this software and associated documentation files (the "Software"), to deal
>>>> + * in the Software without restriction, including without limitation the rights
>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>> + * furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice shall be included in
>>>> + * all copies or substantial portions of the Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>> + * THE SOFTWARE.
>>>> + */
>>>> +#include <sys/types.h>
>>>> +#include <dirent.h>
>>>> +
>>>> +#include "hw/hw.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/pci-host/spapr.h"
>>>> +#include "hw/misc/vfio.h"
>>>> +#include "hw/pci/pci_bus.h"
>>>> +#include "trace.h"
>>>> +#include "qemu/error-report.h"
>>>> +
>>>> +/* sPAPR VFIO */
>>>> +static Property spapr_phb_vfio_properties[] = {
>>>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
>>>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
>>>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
>>>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
>>>> +{
>>>> +    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
>>>> +    char *iommupath;
>>>> +    DIR *dirp;
>>>> +    struct dirent *entry;
>>>> +    Error *error = NULL;
>>>> +
>>>> +    if (!svphb->scan) {
>>>> +        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
>>>> +                                svphb->iommugroupid);
>>>> +    if (!iommupath) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    dirp = opendir(iommupath);
>>>> +    if (!dirp) {
>>>> +        error_report("spapr-vfio: vfio scan failed on opendir: %m");
>>>> +        g_free(iommupath);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    while ((entry = readdir(dirp)) != NULL) {
>>>> +        Error *err = NULL;
>>>> +        char *tmp;
>>>> +        FILE *deviceclassfile;
>>>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>>>> +        char addr[32];
>>>> +        DeviceState *dev;
>>>> +
>>>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
>>>> +                   &domainid, &busid, &devid, &fnid) != 4) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
>>>> +        trace_spapr_pci("Reading device class from ", tmp);
>>>> +
>>>> +        deviceclassfile = fopen(tmp, "r");
>>>> +        if (deviceclassfile) {
>>>> +            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
>>>> +            fclose(deviceclassfile);
>>>> +            if (ret != 1) {
>>>> +                continue;
>>>> +            }
>>>> +        }
>>>> +        g_free(tmp);
>>>> +
>>>> +        if (!deviceclass) {
>>>> +            continue;
>>>> +        }
>>>> +        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
>>>> +            /* Skip bridges */
>>>> +            continue;
>>>> +        }
>>>> +        trace_spapr_pci("Creating device from ", entry->d_name);
>>>> +
>>>> +        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
>>>> +        if (!dev) {
>>>> +            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
>>>> +            continue;
>>>> +        }
>>>> +        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
>>>> +        if (err != NULL) {
>>>> +            object_unref(OBJECT(dev));
>>>> +            continue;
>>>> +        }
>>>> +        if (svphb->force_addr) {
>>>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
>>>> +            err = NULL;
>>>> +            object_property_parse(OBJECT(dev), addr, "addr", &err);
>>>> +            if (err != NULL) {
>>>> +                object_unref(OBJECT(dev));
>>>> +                continue;
>>>> +            }
>>>> +        }
>>>> +        if (svphb->enable_multifunction) {
>>>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>>>> +        }
>>>> +        object_property_set_bool(OBJECT(dev), true, "realized", &error);
>>>> +        if (error) {
>>>> +            object_unref(OBJECT(dev));
>>>> +            error_propagate(errp, error);
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    closedir(dirp);
>>>> +    g_free(iommupath);
>>>> +}
>>>> +
>>>> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>> +{
>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
>>>> +    int ret;
>>>> +    Error *error = NULL;
>>>> +
>>>> +    if (svphb->iommugroupid == -1) {
>>>> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
>>>> +
>>>> +    if (!svphb->phb.tcet) {
>>>> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
>>>> +                       sphb->dtbusname);
>>>> +
>>>> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
>>>> +                                        sphb->dma_liobn, svphb->iommugroupid,
>>>> +                                        &info);
>>>> +    if (ret) {
>>>> +        error_setg_errno(errp, -ret,
>>>> +                         "spapr-vfio: get info from container failed");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    svphb->phb.dma_window_start = info.dma32_window_start;
>>>> +    svphb->phb.dma_window_size = info.dma32_window_size;
>>>> +
>>>> +    spapr_pci_vfio_scan(svphb, &error);
>>>> +    if (error) {
>>>> +        error_propagate(errp, error);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void spapr_phb_vfio_reset(DeviceState *qdev)
>>>> +{
>>>> +    /* Do nothing */
>>>> +}
>>>> +
>>>> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
>>>> +
>>>> +    dc->props = spapr_phb_vfio_properties;
>>>> +    dc->reset = spapr_phb_vfio_reset;
>>>> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
>>>> +}
>>>> +
>>>> +static const TypeInfo spapr_phb_vfio_info = {
>>>> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
>>>> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
>>>> +    .instance_size = sizeof(sPAPRPHBVFIOState),
>>>> +    .class_init    = spapr_phb_vfio_class_init,
>>>> +    .class_size    = sizeof(sPAPRPHBClass),
>>>> +};
>>>> +
>>>> +static void spapr_pci_vfio_register_types(void)
>>>> +{
>>>> +    type_register_static(&spapr_phb_vfio_info);
>>>> +}
>>>> +
>>>> +type_init(spapr_pci_vfio_register_types)
>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>> index 0f428a1..18acb67 100644
>>>> --- a/include/hw/pci-host/spapr.h
>>>> +++ b/include/hw/pci-host/spapr.h
>>>> @@ -30,10 +30,14 @@
>>>>  #define SPAPR_MSIX_MAX_DEVS 32
>>>>  
>>>>  #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
>>>> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
>>>>  
>>>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>>>  
>>>> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
>>>> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
>>>> +
>>>>  #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
>>>>       OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>>>  #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
>>>> @@ -41,6 +45,7 @@
>>>>  
>>>>  typedef struct sPAPRPHBClass sPAPRPHBClass;
>>>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>>> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>>>>  
>>>>  struct sPAPRPHBClass {
>>>>      PCIHostBridgeClass parent_class;
>>>> @@ -78,6 +83,14 @@ struct sPAPRPHBState {
>>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>>  };
>>>>  
>>>> +struct sPAPRPHBVFIOState {
>>>> +    sPAPRPHBState phb;
>>>> +
>>>> +    struct VFIOContainer *container;
>>>> +    int32_t iommugroupid;
>>>> +    uint8_t scan, enable_multifunction, force_addr;
>>>> +};
>>>> +
>>>>  #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>>>>  
>>>>  #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
  2014-04-01  6:25         ` Alexey Kardashevskiy
@ 2014-04-01 18:21           ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2014-04-01 18:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On Tue, 2014-04-01 at 17:25 +1100, Alexey Kardashevskiy wrote:
> On 04/01/2014 07:09 AM, Alex Williamson wrote:
> > On Fri, 2014-03-28 at 17:01 +1100, Alexey Kardashevskiy wrote:
> >> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> >>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >>>> The patch adds a spapr-pci-vfio-host-bridge device type
> >>>> which is a PCI Host Bridge with VFIO support. The new device
> >>>> inherits from the spapr-pci-host-bridge device and adds
> >>>> the following properties:
> >>>> 	iommu - IOMMU group ID which represents a Partitionable
> >>>> 	 	Endpoint, QEMU/ppc64 uses a separate PHB for
> >>>> 		an IOMMU group so the guest kernel has to have
> >>>> 		PCI Domain support enabled.
> >>>> 	forceaddr (optional, 0 by default) - forces QEMU to copy
> >>>> 		device:function from the host address as
> >>>> 		certain guest drivers expect devices to appear in
> >>>> 		particular locations;
> >>>> 	mf (optional, 0 by default) - forces multifunction bit for
> >>>> 		the function #0 of a found device, only makes sense
> >>>> 		for multifunction devices and only with the forceaddr
> >>>> 		property set. It would not be required if there
> >>>> 		was a way to know in advance whether a device is
> >>>> 		multifunctional or not.
> >>>> 	scan (optional, 1 by default) - if non-zero, the new PHB walks
> >>>> 		through all non-bridge devices in the group and tries
> >>>> 		adding them to the PHB; if zero, all devices in the group
> >>>> 		have to be configured manually via the QEMU command line.
> >>>>
> >>>> Examples of use:
> >>>> 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
> >>>> 	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6
> >>>>
> >>>> 2) Configure and Add 3 functions of a multifunctional device to QEMU:
> >>>> (the NEC PCI USB card is used as an example here):
> >>>> 	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
> >>>> 	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
> >>>> 	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
> >>>> 	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
> >>>
> >>> I won't pretend to predict the reaction of QEMU device architects to
> >>> this, but it seems like the assembly we expect from config files or
> >>> outside utilities, ex. libvirt.  I don't doubt this makes qemu
> >>> commandline usage more palatable, but it seems contrary to some of the
> >>> other things we do in QEMU with devices.  If this is something we need,
> >>> why is it specific to spapr?  IOMMU group can contain multiple devices
> >>> on any platform.  On x86 we could do something similar with a p2p
> >>> bridge, switch, or root port.
> >>
> >>
> >> At least at the moment devices are assigned to groups statically on SPAPR,
> >> they cannot be moved between groups at all, so it seems like a reasonable
> >> assumption that the user will not mind putting them all to the same QEMU
> >> instance.
> >>
> >> I should probably disable "scan" by default though, that would make more
> >> sense for libvirt.
> > 
> > That doesn't really address the concern.  An x86 chipset puts specific
> > devices at a specific address, yet this is not something that we hard
> > code into QEMU.  We have config files that define this and tools like
> > libvirt know where to put things.  Why is SPAPR special enough to have
> > QEMU auto-add an entire sub-hierarchy?  If this is a necessary feature,
> > why not make it generic and give x86 the capability as well?
> 
> Ok. You are right. spapr_pci_vfio_scan() has to go :)
> 
> 
> 
> >>> BTW, the code skips bridges, but doesn't that mean you'll have a hard
> >>> time with forceaddr as you potentially try to overlay devfn from
> >>> multiple buses onto a single bus?
> >>> It also makes the value of this a bit
> >>> more questionable since it seems to fall apart so easily.  Thanks,
> >>
> >> These "forceaddr" and "mf" are rather debug options - at the very beginning
> >> I used to have strong impression that USB NEC PCI device (2xOHCI and
> >> 1xEHCI) does not work properly if it is not multifunctional but I was
> >> wrong, just checked. So I'll just remove them as they do not help even me
> >> anymore and that's it. Thanks!
> > 
> > I'm still questioning the value of this code, it just seems to be a
> > convenience feature for someone running QEMU from the commandline, which
> > has never really been a concern for the rest of QEMU.  Suggest taking
> > this patch out of he series since it's not critical path.  Thanks,
> 
> 
> I can move  spapr_pci_vfio_scan() to a separate patch or get rid of it at
> all  but the rest is still needed to setup the correct IOMMU device. Is
> that what you really meant?

Sure, you still need the PHB which hosts an IOMMU to create a bus for
the vfio-pci devices.  Thanks,

Alex

> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v5:
> >>>> * added handling of possible failure of spapr_vfio_new_table()
> >>>>
> >>>> v4:
> >>>> * moved IOMMU changes to separate patches
> >>>> * moved spapr-pci-vfio-host-bridge to new file
> >>>> ---
> >>>>  hw/ppc/Makefile.objs        |   2 +-
> >>>>  hw/ppc/spapr_pci_vfio.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/pci-host/spapr.h |  13 +++
> >>>>  3 files changed, 220 insertions(+), 1 deletion(-)
> >>>>  create mode 100644 hw/ppc/spapr_pci_vfio.c
> >>>>
> >>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>> index ea747f0..2239192 100644
> >>>> --- a/hw/ppc/Makefile.objs
> >>>> +++ b/hw/ppc/Makefile.objs
> >>>> @@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
> >>>>  # IBM pSeries (sPAPR)
> >>>>  obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
> >>>>  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> >>>> -obj-$(CONFIG_PSERIES) += spapr_pci.o
> >>>> +obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
> >>>>  # PowerPC 4xx boards
> >>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>>>  obj-y += ppc4xx_pci.o
> >>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> >>>> new file mode 100644
> >>>> index 0000000..40f6673
> >>>> --- /dev/null
> >>>> +++ b/hw/ppc/spapr_pci_vfio.c
> >>>> @@ -0,0 +1,206 @@
> >>>> +/*
> >>>> + * QEMU sPAPR PCI host for VFIO
> >>>> + *
> >>>> + * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
> >>>> + *
> >>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> >>>> + * of this software and associated documentation files (the "Software"), to deal
> >>>> + * in the Software without restriction, including without limitation the rights
> >>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> >>>> + * copies of the Software, and to permit persons to whom the Software is
> >>>> + * furnished to do so, subject to the following conditions:
> >>>> + *
> >>>> + * The above copyright notice and this permission notice shall be included in
> >>>> + * all copies or substantial portions of the Software.
> >>>> + *
> >>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> >>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> >>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >>>> + * THE SOFTWARE.
> >>>> + */
> >>>> +#include <sys/types.h>
> >>>> +#include <dirent.h>
> >>>> +
> >>>> +#include "hw/hw.h"
> >>>> +#include "hw/ppc/spapr.h"
> >>>> +#include "hw/pci-host/spapr.h"
> >>>> +#include "hw/misc/vfio.h"
> >>>> +#include "hw/pci/pci_bus.h"
> >>>> +#include "trace.h"
> >>>> +#include "qemu/error-report.h"
> >>>> +
> >>>> +/* sPAPR VFIO */
> >>>> +static Property spapr_phb_vfio_properties[] = {
> >>>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
> >>>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
> >>>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
> >>>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
> >>>> +    DEFINE_PROP_END_OF_LIST(),
> >>>> +};
> >>>> +
> >>>> +static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
> >>>> +{
> >>>> +    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
> >>>> +    char *iommupath;
> >>>> +    DIR *dirp;
> >>>> +    struct dirent *entry;
> >>>> +    Error *error = NULL;
> >>>> +
> >>>> +    if (!svphb->scan) {
> >>>> +        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
> >>>> +                                svphb->iommugroupid);
> >>>> +    if (!iommupath) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    dirp = opendir(iommupath);
> >>>> +    if (!dirp) {
> >>>> +        error_report("spapr-vfio: vfio scan failed on opendir: %m");
> >>>> +        g_free(iommupath);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    while ((entry = readdir(dirp)) != NULL) {
> >>>> +        Error *err = NULL;
> >>>> +        char *tmp;
> >>>> +        FILE *deviceclassfile;
> >>>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> >>>> +        char addr[32];
> >>>> +        DeviceState *dev;
> >>>> +
> >>>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> >>>> +                   &domainid, &busid, &devid, &fnid) != 4) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> >>>> +        trace_spapr_pci("Reading device class from ", tmp);
> >>>> +
> >>>> +        deviceclassfile = fopen(tmp, "r");
> >>>> +        if (deviceclassfile) {
> >>>> +            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
> >>>> +            fclose(deviceclassfile);
> >>>> +            if (ret != 1) {
> >>>> +                continue;
> >>>> +            }
> >>>> +        }
> >>>> +        g_free(tmp);
> >>>> +
> >>>> +        if (!deviceclass) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
> >>>> +            /* Skip bridges */
> >>>> +            continue;
> >>>> +        }
> >>>> +        trace_spapr_pci("Creating device from ", entry->d_name);
> >>>> +
> >>>> +        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
> >>>> +        if (!dev) {
> >>>> +            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
> >>>> +            continue;
> >>>> +        }
> >>>> +        object_property_parse(OBJECT(dev), entry->d_name, "host", &err);
> >>>> +        if (err != NULL) {
> >>>> +            object_unref(OBJECT(dev));
> >>>> +            continue;
> >>>> +        }
> >>>> +        if (svphb->force_addr) {
> >>>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> >>>> +            err = NULL;
> >>>> +            object_property_parse(OBJECT(dev), addr, "addr", &err);
> >>>> +            if (err != NULL) {
> >>>> +                object_unref(OBJECT(dev));
> >>>> +                continue;
> >>>> +            }
> >>>> +        }
> >>>> +        if (svphb->enable_multifunction) {
> >>>> +            qdev_prop_set_bit(dev, "multifunction", 1);
> >>>> +        }
> >>>> +        object_property_set_bool(OBJECT(dev), true, "realized", &error);
> >>>> +        if (error) {
> >>>> +            object_unref(OBJECT(dev));
> >>>> +            error_propagate(errp, error);
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +    closedir(dirp);
> >>>> +    g_free(iommupath);
> >>>> +}
> >>>> +
> >>>> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> >>>> +{
> >>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> >>>> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
> >>>> +    int ret;
> >>>> +    Error *error = NULL;
> >>>> +
> >>>> +    if (svphb->iommugroupid == -1) {
> >>>> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn);
> >>>> +
> >>>> +    if (!svphb->phb.tcet) {
> >>>> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
> >>>> +                       sphb->dtbusname);
> >>>> +
> >>>> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
> >>>> +                                        sphb->dma_liobn, svphb->iommugroupid,
> >>>> +                                        &info);
> >>>> +    if (ret) {
> >>>> +        error_setg_errno(errp, -ret,
> >>>> +                         "spapr-vfio: get info from container failed");
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    svphb->phb.dma_window_start = info.dma32_window_start;
> >>>> +    svphb->phb.dma_window_size = info.dma32_window_size;
> >>>> +
> >>>> +    spapr_pci_vfio_scan(svphb, &error);
> >>>> +    if (error) {
> >>>> +        error_propagate(errp, error);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static void spapr_phb_vfio_reset(DeviceState *qdev)
> >>>> +{
> >>>> +    /* Do nothing */
> >>>> +}
> >>>> +
> >>>> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
> >>>> +{
> >>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >>>> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
> >>>> +
> >>>> +    dc->props = spapr_phb_vfio_properties;
> >>>> +    dc->reset = spapr_phb_vfio_reset;
> >>>> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
> >>>> +}
> >>>> +
> >>>> +static const TypeInfo spapr_phb_vfio_info = {
> >>>> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
> >>>> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
> >>>> +    .instance_size = sizeof(sPAPRPHBVFIOState),
> >>>> +    .class_init    = spapr_phb_vfio_class_init,
> >>>> +    .class_size    = sizeof(sPAPRPHBClass),
> >>>> +};
> >>>> +
> >>>> +static void spapr_pci_vfio_register_types(void)
> >>>> +{
> >>>> +    type_register_static(&spapr_phb_vfio_info);
> >>>> +}
> >>>> +
> >>>> +type_init(spapr_pci_vfio_register_types)
> >>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>>> index 0f428a1..18acb67 100644
> >>>> --- a/include/hw/pci-host/spapr.h
> >>>> +++ b/include/hw/pci-host/spapr.h
> >>>> @@ -30,10 +30,14 @@
> >>>>  #define SPAPR_MSIX_MAX_DEVS 32
> >>>>  
> >>>>  #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
> >>>> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
> >>>>  
> >>>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
> >>>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>>>  
> >>>> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
> >>>> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
> >>>> +
> >>>>  #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
> >>>>       OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>>>  #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
> >>>> @@ -41,6 +45,7 @@
> >>>>  
> >>>>  typedef struct sPAPRPHBClass sPAPRPHBClass;
> >>>>  typedef struct sPAPRPHBState sPAPRPHBState;
> >>>> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
> >>>>  
> >>>>  struct sPAPRPHBClass {
> >>>>      PCIHostBridgeClass parent_class;
> >>>> @@ -78,6 +83,14 @@ struct sPAPRPHBState {
> >>>>      QLIST_ENTRY(sPAPRPHBState) list;
> >>>>  };
> >>>>  
> >>>> +struct sPAPRPHBVFIOState {
> >>>> +    sPAPRPHBState phb;
> >>>> +
> >>>> +    struct VFIOContainer *container;
> >>>> +    int32_t iommugroupid;
> >>>> +    uint8_t scan, enable_multifunction, force_addr;
> >>>> +};
> >>>> +
> >>>>  #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> >>>>  
> >>>>  #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device
  2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
@ 2014-04-03 12:17   ` Alexander Graf
  2014-04-07  4:07     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Graf @ 2014-04-03 12:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson
  Cc: qemu-ppc, qemu-devel, Alexander Graf


On 12.03.14 06:52, Alexey Kardashevskiy wrote:
> This adds SPAPR VFIO IOMMU device in order to support DMA operations
> for VFIO devices.

Sorry if this has been mentioned before, but why exactly do you need a 
separate IOMMU for VFIO? Couldn't the existing IOMMU backend drive things?


Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device
  2014-04-03 12:17   ` Alexander Graf
@ 2014-04-07  4:07     ` Alexey Kardashevskiy
  2014-04-10 12:13       ` Alexander Graf
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Kardashevskiy @ 2014-04-07  4:07 UTC (permalink / raw)
  To: Alexander Graf, Alex Williamson; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 04/03/2014 11:17 PM, Alexander Graf wrote:
> 
> On 12.03.14 06:52, Alexey Kardashevskiy wrote:
>> This adds SPAPR VFIO IOMMU device in order to support DMA operations
>> for VFIO devices.
> 
> Sorry if this has been mentioned before, but why exactly do you need a
> separate IOMMU for VFIO? Couldn't the existing IOMMU backend drive things?

Well... Since I started VFIO on SPAPR, the emulated and VFIO IOMMU became
almost the same thing and I'll rework that too before I post things again.

However one difference still remains - IOMMU for emulated PCI and VIO keeps
a TCE table (allocated in QEMU or mmap'ed from the host kernel) and VFIO
IOMMU works with the table which is allocated and owned by the host kernel.

Since TCE tables are used only by devices, the IOMMU translation callback
is never called by VFIO devices and that's ok and I checked - it works.

So I either need a property in the IOMMU device to tell it is TCE table and
MemoryRegionIOMMUOps::translate() are required. Or a new IOMMU device
class. What to choose?

Oh. btw. There is H_GET_TCE now which I have to implement for VFIO :( This
will never ever end.


-- 
Alexey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device
  2014-04-07  4:07     ` Alexey Kardashevskiy
@ 2014-04-10 12:13       ` Alexander Graf
  0 siblings, 0 replies; 42+ messages in thread
From: Alexander Graf @ 2014-04-10 12:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alexander Graf, Alex Williamson
  Cc: qemu-ppc, qemu-devel


On 07.04.14 06:07, Alexey Kardashevskiy wrote:
> On 04/03/2014 11:17 PM, Alexander Graf wrote:
>> On 12.03.14 06:52, Alexey Kardashevskiy wrote:
>>> This adds SPAPR VFIO IOMMU device in order to support DMA operations
>>> for VFIO devices.
>> Sorry if this has been mentioned before, but why exactly do you need a
>> separate IOMMU for VFIO? Couldn't the existing IOMMU backend drive things?
> Well... Since I started VFIO on SPAPR, the emulated and VFIO IOMMU became
> almost the same thing and I'll rework that too before I post things again.
>
> However one difference still remains - IOMMU for emulated PCI and VIO keeps
> a TCE table (allocated in QEMU or mmap'ed from the host kernel) and VFIO
> IOMMU works with the table which is allocated and owned by the host kernel.
>
> Since TCE tables are used only by devices, the IOMMU translation callback
> is never called by VFIO devices and that's ok and I checked - it works.
>
> So I either need a property in the IOMMU device to tell it is TCE table and
> MemoryRegionIOMMUOps::translate() are required. Or a new IOMMU device
> class. What to choose?

We need to handle in-kernel TCE tables with the emulated device IOMMU as 
well, so I'd

> Oh. btw. There is H_GET_TCE now which I have to implement for VFIO :( This
> will never ever end.

... which means you get H_GET_TCE for free as well ;).


Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-04-10 12:14 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-12  5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
2014-03-20 10:20   ` Paolo Bonzini
2014-03-20 11:45     ` David Gibson
2014-03-27  5:40     ` Alexey Kardashevskiy
2014-03-27 12:15       ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
2014-03-20 10:19   ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
2014-03-20 10:20   ` Paolo Bonzini
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-28  3:42     ` Alexey Kardashevskiy
2014-03-31 19:14       ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-20  5:25     ` David Gibson
2014-03-28  5:12       ` Alexey Kardashevskiy
2014-03-31 19:59         ` Alex Williamson
2014-03-21  7:59     ` Alexey Kardashevskiy
2014-03-21 14:17       ` Alex Williamson
2014-03-21 14:23         ` Paolo Bonzini
2014-03-28  4:49         ` Alexey Kardashevskiy
2014-03-31 19:54           ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
2014-04-03 12:17   ` Alexander Graf
2014-04-07  4:07     ` Alexey Kardashevskiy
2014-04-10 12:13       ` Alexander Graf
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2014-03-13  8:12   ` [Qemu-devel] [PATCH v6] " Alexey Kardashevskiy
2014-03-19 19:57   ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
2014-03-28  6:01     ` Alexey Kardashevskiy
2014-03-31 20:09       ` Alex Williamson
2014-04-01  6:25         ` Alexey Kardashevskiy
2014-04-01 18:21           ` Alex Williamson
2014-03-12  5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
2014-03-19 19:57   ` Alex Williamson
2014-03-19 20:12 ` [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.