All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
@ 2012-07-10  5:51 Alexey Kardashevskiy
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 1/2] pseries pci: spapr_finalize_pci_setup introduced Alexey Kardashevskiy
                   ` (6 more replies)
  0 siblings, 7 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-10  5:51 UTC (permalink / raw)
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-devel,
	Alex Williamson, qemu-ppc, David Gibson

The two patches in this set are supposed to add VFIO support for POWER.

The first one adds one more step in the initalizaion sequence which I am not
sure is correct.

The second patch adds actual VFIO support. It is not ready to submit but
ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
and I wonder if there is any plan to implement some generic EOI support code, etc.


Alexey Kardashevskiy (2):
  pseries pci: spapr_finalize_pci_setup introduced
  vfio-powerpc: added VFIO support

 hw/ppc/Makefile.objs |    3 ++
 hw/spapr.c           |    7 ++++
 hw/spapr.h           |    4 +++
 hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/spapr_pci.c       |   36 ++++++++++++++++++---
 hw/spapr_pci.h       |    4 +++
 hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
 hw/vfio_pci.h        |    2 ++
 8 files changed, 212 insertions(+), 7 deletions(-)

-- 
1.7.10

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 1/2] pseries pci: spapr_finalize_pci_setup introduced
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
@ 2012-07-10  5:51 ` Alexey Kardashevskiy
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-10  5:51 UTC (permalink / raw)
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-devel,
	Alex Williamson, qemu-ppc, David Gibson

Previously PCI bus setup was done in 3 steps:
1) create a PCI bus, configure DMA
2) create PCI devices on the bus
3) populate a PCI bus node in the Device Tree

As some bus parameters can be configured only when some or all
the devices got attached to the bus and initialized,
the spapr_finalize_pci_setup has been introduced.

As an example, such a handler can setup DMA window parameters taken from
an IOMMU file descriptor available from a VFIO PCI device.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/spapr.c     |    7 +++++++
 hw/spapr_pci.c |   13 ++++++++++---
 hw/spapr_pci.h |    2 ++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/hw/spapr.c b/hw/spapr.c
index b83f83b..688a135 100644
--- a/hw/spapr.c
+++ b/hw/spapr.c
@@ -516,7 +516,14 @@ static void spapr_finalize_fdt(sPAPREnvironment *spapr,
     }
 
     QLIST_FOREACH(phb, &spapr->phbs, list) {
+        ret = spapr_finalize_pci_setup(phb);
+        if (ret < 0) {
+            break;
+        }
         ret = spapr_populate_pci_dt(phb, PHANDLE_XICP, fdt);
+        if (ret < 0) {
+            break;
+        }
     }
 
     if (ret < 0) {
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 014297b..5f89003 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -573,9 +573,6 @@ static int spapr_phb_init(SysBusDevice *s)
     phb->host_state.bus = bus;
 
     phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
-    phb->dma_window_start = 0;
-    phb->dma_window_size = 0x40000000;
-    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
@@ -639,6 +636,16 @@ void spapr_create_phb(sPAPREnvironment *spapr,
     qdev_init_nofail(dev);
 }
 
+/* Finalize PCI setup, called when all devices are already created */
+int spapr_finalize_pci_setup(sPAPRPHBState *phb)
+{
+    phb->dma_window_start = 0;
+    phb->dma_window_size = 0x40000000;
+    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
+                                         phb->dma_window_size);
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 145071c..3aae273 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -68,6 +68,8 @@ void spapr_create_phb(sPAPREnvironment *spapr,
                       uint64_t mem_win_addr, uint64_t mem_win_size,
                       uint64_t io_win_addr, uint64_t msi_win_addr);
 
+int spapr_finalize_pci_setup(sPAPRPHBState *phb);
+
 int spapr_populate_pci_dt(sPAPRPHBState *phb,
                           uint32_t xics_phandle,
                           void *fdt);
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 1/2] pseries pci: spapr_finalize_pci_setup introduced Alexey Kardashevskiy
@ 2012-07-10  5:51 ` Alexey Kardashevskiy
  2012-07-10 16:55   ` Alex Williamson
  2012-07-10 22:26   ` Scott Wood
  2012-07-10 16:57 ` [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alex Williamson
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-10  5:51 UTC (permalink / raw)
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-devel,
	Alex Williamson, qemu-ppc, David Gibson

The patch enables VFIO on POWER.

It literally does the following:

1. POWERPC IOMMU support (the kernel counterpart is required)

2. Added #ifdef TARGET_PPC64 for EOI handlers initialisation.

3. Added vfio_get_container_fd() to VFIO in order to initialize 1).

4. Makefile fixed and "is_vfio" flag added into sPAPR PHB - required to
distinguish VFIO's DMA context from the emulated one.

WIth the pathes posted today a bit earlier, this patch fully supports
VFIO what includes MSIX as well,


Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs |    3 ++
 hw/spapr.h           |    4 +++
 hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/spapr_pci.c       |   23 ++++++++++++-
 hw/spapr_pci.h       |    2 ++
 hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
 hw/vfio_pci.h        |    2 ++
 7 files changed, 193 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index f573a95..c46a049 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
 # Xilinx PPC peripherals
 obj-y += xilinx_ethlite.o
 
+# VFIO PCI device assignment
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
+
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/spapr.h b/hw/spapr.h
index b37f337..9dca704 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       DMAContext *dma);
 
+void spapr_vfio_init_dma(int fd, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 50c288d..0a194e8 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -16,6 +16,8 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
+#include <sys/ioctl.h>
+
 #include "hw.h"
 #include "kvm.h"
 #include "qdev.h"
@@ -23,6 +25,7 @@
 #include "dma.h"
 
 #include "hw/spapr.h"
+#include "hw/linux-vfio.h"
 
 #include <libfdt.h>
 
@@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
     return 0;
 }
 
+/* -------- API for POWERPC IOMMU -------- */
+
+#define POWERPC_IOMMU           2
+
+struct tce_iommu_info {
+    __u32 argsz;
+    __u32 dma32_window_start;
+    __u32 dma32_window_size;
+};
+
+#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+struct tce_iommu_dma_map {
+    __u32 argsz;
+    __u64 va;
+    __u64 dmaaddr;
+};
+
+#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+typedef struct sPAPRVFIOTable {
+    int fd;
+    uint32_t liobn;
+    QLIST_ENTRY(sPAPRVFIOTable) list;
+} sPAPRVFIOTable;
+
+QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
+
+void spapr_vfio_init_dma(int fd, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_info info = { .argsz = sizeof(info) };
+
+    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
+        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
+        return;
+    }
+    *dma32_window_start = info.dma32_window_start;
+    *dma32_window_size = info.dma32_window_size;
+
+    t = g_malloc0(sizeof(*t));
+    t->fd = fd;
+    t->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
+}
+
+static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_dma_map map = {
+        .argsz = sizeof(map),
+        .va = 0,
+        .dmaaddr = ioba,
+    };
+
+    QLIST_FOREACH(t, &vfio_tce_tables, list) {
+        if (t->liobn != liobn) {
+            continue;
+        }
+        if (tce) {
+            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
+            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
+                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
+                return H_PARAMETER;
+            }
+        } else {
+            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
+                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
+                return H_PARAMETER;
+            }
+        }
+        return H_SUCCESS;
+    }
+    return H_CONTINUE; /* positive non-zero value */
+}
+
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
 {
@@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     if (0 >= ret) {
         return ret ? H_PARAMETER : H_SUCCESS;
     }
+    ret = put_tce_vfio(liobn, ioba, tce);
+    if (0 >= ret) {
+        return ret ? H_PARAMETER : H_SUCCESS;
+    }
 #ifdef DEBUG_TCE
     fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
             "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 5f89003..3375c3f 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -29,6 +29,7 @@
 #include "pci_host.h"
 #include "hw/spapr.h"
 #include "hw/spapr_pci.h"
+#include "hw/vfio_pci.h"
 #include "exec-memory.h"
 #include <libfdt.h>
 #include "trace.h"
@@ -440,6 +441,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
                  level);
 }
 
+static int pci_spapr_get_irq(void *opaque, int irq_num)
+{
+    sPAPRPHBState *phb = opaque;
+    return phb->lsi_table[irq_num].dt_irq;
+}
+
 static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
                               unsigned size)
 {
@@ -567,7 +574,8 @@ static int spapr_phb_init(SysBusDevice *s)
 
     bus = pci_register_bus(&phb->host_state.busdev.qdev,
                            phb->busname ? phb->busname : phb->dtbusname,
-                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
+                           pci_spapr_set_irq, pci_spapr_get_irq,
+                           pci_spapr_map_irq, phb,
                            &phb->memspace, &phb->iospace,
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
@@ -596,6 +604,7 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
     DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
     DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
+    DEFINE_PROP_UINT8("vfio", sPAPRPHBState, is_vfio, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -639,6 +648,18 @@ void spapr_create_phb(sPAPREnvironment *spapr,
 /* Finalize PCI setup, called when all devices are already created */
 int spapr_finalize_pci_setup(sPAPRPHBState *phb)
 {
+    if (phb->is_vfio) {
+        int fd = vfio_get_container_fd(phb->host_state.bus);
+
+        if (fd < 0) {
+            return fd;
+        }
+        spapr_vfio_init_dma(fd, phb->dma_liobn,
+                            &phb->dma_window_start,
+                            &phb->dma_window_size);
+        return 0;
+    }
+
     phb->dma_window_start = 0;
     phb->dma_window_size = 0x40000000;
     phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 3aae273..a4f031b 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -57,6 +57,8 @@ typedef struct sPAPRPHBState {
         int nvec;
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
+    uint8_t is_vfio;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 1ac287f..cc0b974 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -21,7 +21,6 @@
 #include <dirent.h>
 #include <stdio.h>
 #include <unistd.h>
-#include <sys/io.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/types.h>
@@ -44,6 +43,17 @@
 #include "vfio_pci.h"
 #include "linux-vfio.h"
 
+#ifndef TARGET_PPC64
+#include <sys/io.h>
+#define VFIO_IOMMU_EXTENSION    VFIO_X86_IOMMU
+#else
+#include "hw/pci_internals.h"
+#include "hw/xics.h"
+#include "hw/spapr.h"
+#define POWERPC_IOMMU           2
+#define VFIO_IOMMU_EXTENSION    POWERPC_IOMMU
+#endif
+
 //#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
@@ -235,6 +245,7 @@ struct vfio_irq_set_fd {
 
 static void vfio_enable_intx_kvm(VFIODevice *vdev)
 {
+#ifndef TARGET_PPC64
 #ifdef CONFIG_KVM
     struct vfio_irq_set_fd irq_set_fd = {
 	.irq_set = {
@@ -298,10 +309,12 @@ fail:
     qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_intx(vdev);
 #endif
+#endif
 }
 
 static void vfio_disable_intx_kvm(VFIODevice *vdev)
 {
+#ifndef TARGET_PPC64
 #ifdef CONFIG_KVM
     struct vfio_irq_set_fd irq_set_fd = {
 	.irq_set = {
@@ -350,8 +363,10 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __FUNCTION__,
             vdev->host.seg, vdev->host.bus, vdev->host.dev, vdev->host.func);
 #endif
+#endif
 }
 
+#ifndef TARGET_PPC64
 static void vfio_update_irq(Notifier *notify, void *data)
 {
     VFIODevice *vdev = container_of(notify, VFIODevice, intx.update_irq);
@@ -381,6 +396,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
     /* Re-enable the interrupt in cased we missed an EOI */
     vfio_eoi(&vdev->intx.eoi, NULL);
 }
+#endif
 
 static int vfio_enable_intx(VFIODevice *vdev)
 {
@@ -404,10 +420,14 @@ static int vfio_enable_intx(VFIODevice *vdev)
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
     vdev->intx.eoi.notify = vfio_eoi;
+#ifndef TARGET_PPC64
     ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
 
     vdev->intx.update_irq.notify = vfio_update_irq;
     pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);
+#else
+    xics_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+#endif
 
     if (event_notifier_init(&vdev->intx.interrupt, 0)) {
         error_report("vfio: Error: event_notifier_init failed\n");
@@ -440,8 +460,12 @@ static void vfio_disable_intx(VFIODevice *vdev)
     vfio_disable_intx_kvm(vdev);
     vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
 
+#ifndef TARGET_PPC64
     pci_remove_irq_update_notifier(&vdev->intx.update_irq);
     ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+#else
+    xics_remove_eoi_notifier(&vdev->intx.eoi);
+#endif
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
@@ -543,7 +567,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
     }
 
     fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
-
+#ifndef TARGET_PPC64
     vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
     if (vdev->msi_vectors[vector].virq < 0 || 
         kvm_irqchip_add_irqfd(kvm_state, fd,
@@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
         qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
                             &vdev->msi_vectors[vector]);
     }
-
+#else
+    vdev->msi_vectors[vector].virq = -1;
+    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
+                        &vdev->msi_vectors[vector]);
+#endif
     if (vdev->nr_vectors < vector + 1) {
         int i;
 
@@ -692,6 +720,7 @@ retry:
         fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
 
         msg = msi_get_msg(&vdev->pdev, i);
+#ifndef TARGET_PPC64
         vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
         if (vdev->msi_vectors[i].virq < 0 || 
             kvm_irqchip_add_irqfd(kvm_state, fd,
@@ -699,6 +728,12 @@ retry:
             qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
                                 &vdev->msi_vectors[i]);
         }
+#else
+        vdev->msi_vectors[i].virq = -1;
+        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
+                            &vdev->msi_vectors[i]);
+        msg = msg;
+#endif
     }
     
     ret = vfio_enable_vectors(vdev, false);
@@ -1581,6 +1616,25 @@ static int vfio_connect_container(VFIOGroup *group)
 
         memory_listener_register(&container->listener, get_system_memory());
 
+#define POWERPC_IOMMU           2
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
     } else {
         error_report("vfio: No available IOMMU models\n");
         g_free(container);
@@ -2005,3 +2059,19 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+int vfio_get_container_fd(struct PCIBus *pbus)
+{
+    BusChild *kid1st = QTAILQ_FIRST(&pbus->qbus.children);
+    VFIODevice *vdev1st;
+
+    if (!kid1st) {
+        printf("No device registered on PCI bus \"%s\", no DMA enabled\n",
+               pbus->qbus.name);
+        return -1;
+    }
+    vdev1st = container_of(kid1st->child, VFIODevice, pdev.qdev);
+
+    return vdev1st->group->container->fd;
+}
+
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 226607c..0d13341 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -105,4 +105,6 @@ typedef struct VFIOGroup {
 #define VFIO_FLAG_IOMMU_SHARED_BIT 0
 #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
 
+int vfio_get_container_fd(struct PCIBus *pbus);
+
 #endif /* __VFIO_H__ */
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support Alexey Kardashevskiy
@ 2012-07-10 16:55   ` Alex Williamson
  2012-07-10 21:32     ` Benjamin Herrenschmidt
  2012-07-11  2:54     ` Alexey Kardashevskiy
  2012-07-10 22:26   ` Scott Wood
  1 sibling, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-10 16:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> The patch enables VFIO on POWER.
> 
> It literally does the following:
> 
> 1. POWERPC IOMMU support (the kernel counterpart is required)
> 
> 2. Added #ifdef TARGET_PPC64 for EOI handlers initialisation.
> 
> 3. Added vfio_get_container_fd() to VFIO in order to initialize 1).
> 
> 4. Makefile fixed and "is_vfio" flag added into sPAPR PHB - required to
> distinguish VFIO's DMA context from the emulated one.
> 
> WIth the pathes posted today a bit earlier, this patch fully supports
> VFIO what includes MSIX as well,
> 
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs |    3 ++
>  hw/spapr.h           |    4 +++
>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/spapr_pci.c       |   23 ++++++++++++-
>  hw/spapr_pci.h       |    2 ++
>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio_pci.h        |    2 ++
>  7 files changed, 193 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>  
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..9dca704 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>  
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..0a194e8 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -16,6 +16,8 @@
>   * You should have received a copy of the GNU Lesser General Public
>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>   */
> +#include <sys/ioctl.h>
> +
>  #include "hw.h"
>  #include "kvm.h"
>  #include "qdev.h"
> @@ -23,6 +25,7 @@
>  #include "dma.h"
>  
>  #include "hw/spapr.h"
> +#include "hw/linux-vfio.h"

I really need to move this into linux-headers.

>  
>  #include <libfdt.h>
>  
> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>  
> +/* -------- API for POWERPC IOMMU -------- */
> +
> +#define POWERPC_IOMMU           2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +};
> +
> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)

I assume this would eventually go into the kernel vfio.h with a VFIO_
prefix.  Add a flags field to the structures or it'll be hard to extend
them later.

> +typedef struct sPAPRVFIOTable {
> +    int fd;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> +
> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
> +        return;
> +    }
> +    *dma32_window_start = info.dma32_window_start;
> +    *dma32_window_size = info.dma32_window_size;
> +
> +    t = g_malloc0(sizeof(*t));
> +    t->fd = fd;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +

I wish you could do this through a MemoryListener like we do on x86.

>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      if (0 >= ret) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (0 >= ret) {
> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
>  #ifdef DEBUG_TCE
>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 5f89003..3375c3f 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -29,6 +29,7 @@
>  #include "pci_host.h"
>  #include "hw/spapr.h"
>  #include "hw/spapr_pci.h"
> +#include "hw/vfio_pci.h"
>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> @@ -440,6 +441,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>  
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -567,7 +574,8 @@ static int spapr_phb_init(SysBusDevice *s)
>  
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
> @@ -596,6 +604,7 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_UINT8("vfio", sPAPRPHBState, is_vfio, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -639,6 +648,18 @@ void spapr_create_phb(sPAPREnvironment *spapr,
>  /* Finalize PCI setup, called when all devices are already created */
>  int spapr_finalize_pci_setup(sPAPRPHBState *phb)
>  {
> +    if (phb->is_vfio) {
> +        int fd = vfio_get_container_fd(phb->host_state.bus);
> +
> +        if (fd < 0) {
> +            return fd;
> +        }
> +        spapr_vfio_init_dma(fd, phb->dma_liobn,
> +                            &phb->dma_window_start,
> +                            &phb->dma_window_size);
> +        return 0;
> +    }
> +
>      phb->dma_window_start = 0;
>      phb->dma_window_size = 0x40000000;
>      phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 3aae273..a4f031b 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -57,6 +57,8 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>  
> +    uint8_t is_vfio;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>  
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1ac287f..cc0b974 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -21,7 +21,6 @@
>  #include <dirent.h>
>  #include <stdio.h>
>  #include <unistd.h>
> -#include <sys/io.h>
>  #include <sys/ioctl.h>
>  #include <sys/mman.h>
>  #include <sys/types.h>
> @@ -44,6 +43,17 @@
>  #include "vfio_pci.h"
>  #include "linux-vfio.h"
>  
> +#ifndef TARGET_PPC64
> +#include <sys/io.h>
> +#define VFIO_IOMMU_EXTENSION    VFIO_X86_IOMMU
> +#else
> +#include "hw/pci_internals.h"
> +#include "hw/xics.h"
> +#include "hw/spapr.h"
> +#define POWERPC_IOMMU           2
> +#define VFIO_IOMMU_EXTENSION    POWERPC_IOMMU
> +#endif
> +

VFIO_IOMMU_EXTENSION never gets used, POWER_IOMMU is redefined below.

>  //#define DEBUG_VFIO
>  #ifdef DEBUG_VFIO
>  #define DPRINTF(fmt, ...) \
> @@ -235,6 +245,7 @@ struct vfio_irq_set_fd {
>  
>  static void vfio_enable_intx_kvm(VFIODevice *vdev)
>  {
> +#ifndef TARGET_PPC64

Why do you need this, aren't the extension checks sufficient for this to
be a nop for you?

>  #ifdef CONFIG_KVM
>      struct vfio_irq_set_fd irq_set_fd = {
>  	.irq_set = {
> @@ -298,10 +309,12 @@ fail:
>      qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
>      vfio_unmask_intx(vdev);
>  #endif
> +#endif
>  }
>  
>  static void vfio_disable_intx_kvm(VFIODevice *vdev)
>  {
> +#ifndef TARGET_PPC64

Same

>  #ifdef CONFIG_KVM
>      struct vfio_irq_set_fd irq_set_fd = {
>  	.irq_set = {
> @@ -350,8 +363,10 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __FUNCTION__,
>              vdev->host.seg, vdev->host.bus, vdev->host.dev, vdev->host.func);
>  #endif
> +#endif
>  }
>  
> +#ifndef TARGET_PPC64
>  static void vfio_update_irq(Notifier *notify, void *data)
>  {
>      VFIODevice *vdev = container_of(notify, VFIODevice, intx.update_irq);
> @@ -381,6 +396,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
>      /* Re-enable the interrupt in cased we missed an EOI */
>      vfio_eoi(&vdev->intx.eoi, NULL);
>  }
> +#endif
>  
>  static int vfio_enable_intx(VFIODevice *vdev)
>  {
> @@ -404,10 +420,14 @@ static int vfio_enable_intx(VFIODevice *vdev)
>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>      vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
>      vdev->intx.eoi.notify = vfio_eoi;
> +#ifndef TARGET_PPC64
>      ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);

This is really only a place holder for x86 too, I don't think my eoi
notifier as written is acceptable upstream.  We really need some common
infrastructure here.  I'm hoping to get the kvm acceleration in place
which would make vfio usable on x86 with kvm (the common case), then
work towards a generic eoi notifier.

>  
>      vdev->intx.update_irq.notify = vfio_update_irq;
>      pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);

Can't you stub this out to make it safe to do on POWER too?

> +#else
> +    xics_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> +#endif
>  
>      if (event_notifier_init(&vdev->intx.interrupt, 0)) {
>          error_report("vfio: Error: event_notifier_init failed\n");
> @@ -440,8 +460,12 @@ static void vfio_disable_intx(VFIODevice *vdev)
>      vfio_disable_intx_kvm(vdev);
>      vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
>  
> +#ifndef TARGET_PPC64
>      pci_remove_irq_update_notifier(&vdev->intx.update_irq);
>      ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> +#else
> +    xics_remove_eoi_notifier(&vdev->intx.eoi);
> +#endif
>  
>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>      qemu_set_fd_handler(fd, NULL, NULL, vdev);
> @@ -543,7 +567,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>      }
>  
>      fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
> -
> +#ifndef TARGET_PPC64
>      vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>      if (vdev->msi_vectors[vector].virq < 0 || 
>          kvm_irqchip_add_irqfd(kvm_state, fd,
> @@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>          qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>                              &vdev->msi_vectors[vector]);
>      }
> -
> +#else
> +    vdev->msi_vectors[vector].virq = -1;
> +    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> +                        &vdev->msi_vectors[vector]);
> +#endif

This shouldn't be necessary once the abort is removed from
kvm_irqchip_add_msi_route.  It'll be merged next time the kvm uq tree
merges into qemu.

>      if (vdev->nr_vectors < vector + 1) {
>          int i;
>  
> @@ -692,6 +720,7 @@ retry:
>          fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
>  
>          msg = msi_get_msg(&vdev->pdev, i);
> +#ifndef TARGET_PPC64
>          vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>          if (vdev->msi_vectors[i].virq < 0 || 
>              kvm_irqchip_add_irqfd(kvm_state, fd,
> @@ -699,6 +728,12 @@ retry:
>              qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>                                  &vdev->msi_vectors[i]);
>          }
> +#else
> +        vdev->msi_vectors[i].virq = -1;
> +        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> +                            &vdev->msi_vectors[i]);
> +        msg = msg;
> +#endif

Same here

>      }
>      
>      ret = vfio_enable_vectors(vdev, false);
> @@ -1581,6 +1616,25 @@ static int vfio_connect_container(VFIOGroup *group)
>  
>          memory_listener_register(&container->listener, get_system_memory());
>  
> +#define POWERPC_IOMMU           2

Assume this will go in the kernel vfio.h at some point.  You may want to
pick a different name if there's a possibility of other powerpc iommu
implementations... thus the crappy type1 name for x86.

> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> @@ -2005,3 +2059,19 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +int vfio_get_container_fd(struct PCIBus *pbus)
> +{
> +    BusChild *kid1st = QTAILQ_FIRST(&pbus->qbus.children);
> +    VFIODevice *vdev1st;
> +
> +    if (!kid1st) {
> +        printf("No device registered on PCI bus \"%s\", no DMA enabled\n",
> +               pbus->qbus.name);
> +        return -1;
> +    }
> +    vdev1st = container_of(kid1st->child, VFIODevice, pdev.qdev);
> +
> +    return vdev1st->group->container->fd;
> +}
> +

This is not a generic implementation.  x86 won't have all devices on a
bus be vfio devices and even if it did, there's no guarantee they all
belong to the same container.  This should probably at least take a
PCIDevice and some kind of POWER specific code will need to know that
the container is the same for the whole bus.  Thanks,

Alex

> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 226607c..0d13341 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>  
> +int vfio_get_container_fd(struct PCIBus *pbus);
> +
>  #endif /* __VFIO_H__ */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 1/2] pseries pci: spapr_finalize_pci_setup introduced Alexey Kardashevskiy
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support Alexey Kardashevskiy
@ 2012-07-10 16:57 ` Alex Williamson
  2012-07-11  2:25   ` Alexey Kardashevskiy
  2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-10 16:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> The two patches in this set are supposed to add VFIO support for POWER.
> 
> The first one adds one more step in the initalizaion sequence which I am not
> sure is correct.
> 
> The second patch adds actual VFIO support. It is not ready to submit but
> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
> and I wonder if there is any plan to implement some generic EOI support code, etc.

A generic EOI notifier is on my todo list, but I have no idea what it's
going to look like.  As you know, I've got an ioapic specific notifier
in my tree, you add a spapr specific one.  I welcome ideas on how to
create something generic that has a chance of being accepted.  Thanks,

Alex

> Alexey Kardashevskiy (2):
>   pseries pci: spapr_finalize_pci_setup introduced
>   vfio-powerpc: added VFIO support
> 
>  hw/ppc/Makefile.objs |    3 ++
>  hw/spapr.c           |    7 ++++
>  hw/spapr.h           |    4 +++
>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/spapr_pci.c       |   36 ++++++++++++++++++---
>  hw/spapr_pci.h       |    4 +++
>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio_pci.h        |    2 ++
>  8 files changed, 212 insertions(+), 7 deletions(-)
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 16:55   ` Alex Williamson
@ 2012-07-10 21:32     ` Benjamin Herrenschmidt
  2012-07-10 21:48       ` Alex Williamson
  2012-07-11  2:54     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-10 21:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Tue, 2012-07-10 at 10:55 -0600, Alex Williamson wrote:
> 
> I wish you could do this through a MemoryListener like we do on x86.
> 

Can you elaborate ? TCE (iommu) manipulation on PAPR is done via
specific hypervisor calls, not sure what a MemoryListener would do
here ...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 21:32     ` Benjamin Herrenschmidt
@ 2012-07-10 21:48       ` Alex Williamson
  2012-07-10 21:53         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-10 21:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed, 2012-07-11 at 07:32 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-07-10 at 10:55 -0600, Alex Williamson wrote:
> > 
> > I wish you could do this through a MemoryListener like we do on x86.
> > 
> 
> Can you elaborate ? TCE (iommu) manipulation on PAPR is done via
> specific hypervisor calls, not sure what a MemoryListener would do
> here ...

Hmm, the guest directed iommu updates via hypercalls may not really fit
the MemoryListener model.  I'm just trying to think of ways to avoid
having an offshoot of vfio in the power code base by making use of
common abstraction layers.  If we got a region_add/del callback we could
potentially move the spapr map and unmap into vfio like we do for x86.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 21:48       ` Alex Williamson
@ 2012-07-10 21:53         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-10 21:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Tue, 2012-07-10 at 15:48 -0600, Alex Williamson wrote:
> > specific hypervisor calls, not sure what a MemoryListener would do
> > here ...
> 
> Hmm, the guest directed iommu updates via hypercalls may not really fit
> the MemoryListener model.  I'm just trying to think of ways to avoid
> having an offshoot of vfio in the power code base by making use of
> common abstraction layers.  If we got a region_add/del callback we could
> potentially move the spapr map and unmap into vfio like we do for x86.
> Thanks, 

In the end we don't really want to use that anyway. map and unmap are
*extremely* performance sensitive in practice, so plan is to implement
the hypercall directly in the kernel KVM at a level where it won't even
go near generic code :-)

Basically, when the hypercall gets in, we take control in what we call
"real mode" on powerpc (MMU off, translation disabled), we have a window
to implement critical stuff like this before we context switch the MMU
to the host context (which on P7 is quite expensive).

This is where I want to go directly whack the TCE table as used by the
HW (provided the page has a good PTE of course), pretty much like we do
for populating the main MMU hash table.

So the map/unmap path will be entirely in arch specific code. The one
you see in Alexey code is basically only ever going to be used by
something like qemu full emulation...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10  5:51 ` [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support Alexey Kardashevskiy
  2012-07-10 16:55   ` Alex Williamson
@ 2012-07-10 22:26   ` Scott Wood
  2012-07-10 23:55     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 52+ messages in thread
From: Scott Wood @ 2012-07-10 22:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, David Gibson, qemu-ppc, Alexander Graf, qemu-devel

On 07/10/2012 12:51 AM, Alexey Kardashevskiy wrote:
> The patch enables VFIO on POWER.
> 
> It literally does the following:
> 
> 1. POWERPC IOMMU support (the kernel counterpart is required)
[snip]
> +/* -------- API for POWERPC IOMMU -------- */
> +
> +#define POWERPC_IOMMU           2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +};
> +
> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

Is there a more specific name that could be used for this?  Not all
PowerPC chips have the same kind of IOMMU.

-Scott

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 22:26   ` Scott Wood
@ 2012-07-10 23:55     ` Alexey Kardashevskiy
  2012-07-11  0:04       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-10 23:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc,
	Scott Wood, David Gibson

On 11/07/12 08:26, Scott Wood wrote:
> On 07/10/2012 12:51 AM, Alexey Kardashevskiy wrote:
>> The patch enables VFIO on POWER.
>>
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
> [snip]
>> +/* -------- API for POWERPC IOMMU -------- */
>> +
>> +#define POWERPC_IOMMU           2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +};
>> +
>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
> Is there a more specific name that could be used for this?  Not all
> PowerPC chips have the same kind of IOMMU.


Ben, is is SPAPR? BOOK3S?


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 23:55     ` Alexey Kardashevskiy
@ 2012-07-11  0:04       ` Benjamin Herrenschmidt
  2012-07-11  0:17         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-11  0:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc,
	Scott Wood, David Gibson

On Wed, 2012-07-11 at 09:55 +1000, Alexey Kardashevskiy wrote:
> On 11/07/12 08:26, Scott Wood wrote:
> > On 07/10/2012 12:51 AM, Alexey Kardashevskiy wrote:
> >> The patch enables VFIO on POWER.
> >>
> >> It literally does the following:
> >>
> >> 1. POWERPC IOMMU support (the kernel counterpart is required)
> > [snip]
> >> +/* -------- API for POWERPC IOMMU -------- */
> >> +
> >> +#define POWERPC_IOMMU           2
> >> +
> >> +struct tce_iommu_info {
> >> +    __u32 argsz;
> >> +    __u32 dma32_window_start;
> >> +    __u32 dma32_window_size;
> >> +};
> >> +
> >> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> > 
> > Is there a more specific name that could be used for this?  Not all
> > PowerPC chips have the same kind of IOMMU.

> 
> Ben, is is SPAPR? BOOK3S?

So we have varieties of different iommus in the kernel indeed, we
probably want the info ioctl to reflect that. I would call this one
spapr_tce.

Also we will want a few other things, dunno if that's reflected here,
or whether the ioctl is easily extendable, but in the long run we will
need:

 - Ways to tell KVM about association between a liobn (logical bus
number as used in H_PUT_TCE) and an iommu so we can implement the real
mode H_PUT_TCE properly.

 - We will need some conduit to implement the "DDW" APIs (part of PAPR
allowing the guest to control the DMA windows, ie, by creating new
windows in 64-bit DMA space, with different page sizes etc....). So you
may want to make it clear that the above provides information about the
"base window" specifically.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-11  0:04       ` Benjamin Herrenschmidt
@ 2012-07-11  0:17         ` Alexey Kardashevskiy
  2012-07-11  0:26           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-11  0:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc,
	Scott Wood, David Gibson

On 11/07/12 10:04, Benjamin Herrenschmidt wrote:
> On Wed, 2012-07-11 at 09:55 +1000, Alexey Kardashevskiy wrote:
>> On 11/07/12 08:26, Scott Wood wrote:
>>> On 07/10/2012 12:51 AM, Alexey Kardashevskiy wrote:
>>>> The patch enables VFIO on POWER.
>>>>
>>>> It literally does the following:
>>>>
>>>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>> [snip]
>>>> +/* -------- API for POWERPC IOMMU -------- */
>>>> +
>>>> +#define POWERPC_IOMMU           2
>>>> +
>>>> +struct tce_iommu_info {
>>>> +    __u32 argsz;
>>>> +    __u32 dma32_window_start;
>>>> +    __u32 dma32_window_size;
>>>> +};
>>>> +
>>>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>>>
>>> Is there a more specific name that could be used for this?  Not all
>>> PowerPC chips have the same kind of IOMMU.
> 
>>
>> Ben, is is SPAPR? BOOK3S?
> 
> So we have varieties of different iommus in the kernel indeed, we
> probably want the info ioctl to reflect that. I would call this one
> spapr_tce.

VFIO provides such ioctl actually and it is it who returns POWERPC_IOMMU as an IOMMU type.

ok. SPAPR_TCE.


> Also we will want a few other things, dunno if that's reflected here,
> or whether the ioctl is easily extendable, but in the long run we will
> need:
> 
>  - Ways to tell KVM about association between a liobn (logical bus
> number as used in H_PUT_TCE) and an iommu so we can implement the real
> mode H_PUT_TCE properly.
> 
>  - We will need some conduit to implement the "DDW" APIs (part of PAPR
> allowing the guest to control the DMA windows, ie, by creating new
> windows in 64-bit DMA space, with different page sizes etc....). So you
> may want to make it clear that the above provides information about the
> "base window" specifically.

So the current one would be SPAPR_TCE_32?


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-11  0:17         ` Alexey Kardashevskiy
@ 2012-07-11  0:26           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-11  0:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc,
	Scott Wood, David Gibson

On Wed, 2012-07-11 at 10:17 +1000, Alexey Kardashevskiy wrote:
> So the current one would be SPAPR_TCE_32?

No, the iommu type is SPAPR_TCE, but the *window* info you get here is
the 32-bit window. My thinking is add some versionning and a bunch of
reserved fields to that info struct so we can stick a few more things,
or at the very least add a flags field, so we can return if/when we
support DDW etc... on that iommu.

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-10 16:57 ` [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alex Williamson
@ 2012-07-11  2:25   ` Alexey Kardashevskiy
  2012-07-12  2:54     ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-11  2:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 11/07/12 02:57, Alex Williamson wrote:
> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>> The two patches in this set are supposed to add VFIO support for POWER.
>>
>> The first one adds one more step in the initalizaion sequence which I am not
>> sure is correct.
>>
>> The second patch adds actual VFIO support. It is not ready to submit but
>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>> and I wonder if there is any plan to implement some generic EOI support code, etc.
> 
> A generic EOI notifier is on my todo list, but I have no idea what it's
> going to look like.  As you know, I've got an ioapic specific notifier
> in my tree, you add a spapr specific one.  I welcome ideas on how to
> create something generic that has a chance of being accepted.  Thanks,


So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
ioapic_remove_gsi_eoi_notifier() as it does too much :)


The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.

Does it sound reasonable?


> 
> Alex
> 
>> Alexey Kardashevskiy (2):
>>   pseries pci: spapr_finalize_pci_setup introduced
>>   vfio-powerpc: added VFIO support
>>
>>  hw/ppc/Makefile.objs |    3 ++
>>  hw/spapr.c           |    7 ++++
>>  hw/spapr.h           |    4 +++
>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/spapr_pci.c       |   36 ++++++++++++++++++---
>>  hw/spapr_pci.h       |    4 +++
>>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
>>  hw/vfio_pci.h        |    2 ++
>>  8 files changed, 212 insertions(+), 7 deletions(-)
>>
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-10 16:55   ` Alex Williamson
  2012-07-10 21:32     ` Benjamin Herrenschmidt
@ 2012-07-11  2:54     ` Alexey Kardashevskiy
  2012-07-11  3:10       ` Benjamin Herrenschmidt
  2012-07-12  3:11       ` Alex Williamson
  1 sibling, 2 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-11  2:54 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 11/07/12 02:55, Alex Williamson wrote:
> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>> The patch enables VFIO on POWER.
>>
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>
>> 2. Added #ifdef TARGET_PPC64 for EOI handlers initialisation.
>>
>> 3. Added vfio_get_container_fd() to VFIO in order to initialize 1).
>>
>> 4. Makefile fixed and "is_vfio" flag added into sPAPR PHB - required to
>> distinguish VFIO's DMA context from the emulated one.
>>
>> WIth the pathes posted today a bit earlier, this patch fully supports
>> VFIO what includes MSIX as well,
>>
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/Makefile.objs |    3 ++
>>  hw/spapr.h           |    4 +++
>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/spapr_pci.c       |   23 ++++++++++++-
>>  hw/spapr_pci.h       |    2 ++
>>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
>>  hw/vfio_pci.h        |    2 ++
>>  7 files changed, 193 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index f573a95..c46a049 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>  # Xilinx PPC peripherals
>>  obj-y += xilinx_ethlite.o
>>  
>> +# VFIO PCI device assignment
>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>> +
>>  obj-y := $(addprefix ../,$(obj-y))
>> diff --git a/hw/spapr.h b/hw/spapr.h
>> index b37f337..9dca704 100644
>> --- a/hw/spapr.h
>> +++ b/hw/spapr.h
>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        DMAContext *dma);
>>  
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size);
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>> index 50c288d..0a194e8 100644
>> --- a/hw/spapr_iommu.c
>> +++ b/hw/spapr_iommu.c
>> @@ -16,6 +16,8 @@
>>   * You should have received a copy of the GNU Lesser General Public
>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>   */
>> +#include <sys/ioctl.h>
>> +
>>  #include "hw.h"
>>  #include "kvm.h"
>>  #include "qdev.h"
>> @@ -23,6 +25,7 @@
>>  #include "dma.h"
>>  
>>  #include "hw/spapr.h"
>> +#include "hw/linux-vfio.h"
> 
> I really need to move this into linux-headers.
> 
>>  
>>  #include <libfdt.h>
>>  
>> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>      return 0;
>>  }
>>  
>> +/* -------- API for POWERPC IOMMU -------- */
>> +
>> +#define POWERPC_IOMMU           2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +};
>> +
>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> 
> I assume this would eventually go into the kernel vfio.h with a VFIO_
> prefix.  Add a flags field to the structures or it'll be hard to extend
> them later.


We can always define another type of IOMMU :) But yes, I'll extend both map and info structures.



>> +typedef struct sPAPRVFIOTable {
>> +    int fd;
>> +    uint32_t liobn;
>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>> +} sPAPRVFIOTable;
>> +
>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>> +
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>> +
>> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
>> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
>> +        return;
>> +    }
>> +    *dma32_window_start = info.dma32_window_start;
>> +    *dma32_window_size = info.dma32_window_size;
>> +
>> +    t = g_malloc0(sizeof(*t));
>> +    t->fd = fd;
>> +    t->liobn = liobn;
>> +
>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>> +}
>> +
>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .va = 0,
>> +        .dmaaddr = ioba,
>> +    };
>> +
>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>> +        if (t->liobn != liobn) {
>> +            continue;
>> +        }
>> +        if (tce) {
>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
>> +                return H_PARAMETER;
>> +            }
>> +        } else {
>> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
>> +                return H_PARAMETER;
>> +            }
>> +        }
>> +        return H_SUCCESS;
>> +    }
>> +    return H_CONTINUE; /* positive non-zero value */
>> +}
>> +
> 
> I wish you could do this through a MemoryListener like we do on x86.


What is the point? Map the entire RAM to the guest? And It will still use our own IOMMU ioctls as it
is completely our IOMMU implementaiton.


>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>                                target_ulong opcode, target_ulong *args)
>>  {
>> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>      if (0 >= ret) {
>>          return ret ? H_PARAMETER : H_SUCCESS;
>>      }
>> +    ret = put_tce_vfio(liobn, ioba, tce);
>> +    if (0 >= ret) {
>> +        return ret ? H_PARAMETER : H_SUCCESS;
>> +    }
>>  #ifdef DEBUG_TCE
>>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 5f89003..3375c3f 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -29,6 +29,7 @@
>>  #include "pci_host.h"
>>  #include "hw/spapr.h"
>>  #include "hw/spapr_pci.h"
>> +#include "hw/vfio_pci.h"
>>  #include "exec-memory.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>> @@ -440,6 +441,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>                   level);
>>  }
>>  
>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>> +{
>> +    sPAPRPHBState *phb = opaque;
>> +    return phb->lsi_table[irq_num].dt_irq;
>> +}
>> +
>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>                                unsigned size)
>>  {
>> @@ -567,7 +574,8 @@ static int spapr_phb_init(SysBusDevice *s)
>>  
>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>                             phb->busname ? phb->busname : phb->dtbusname,
>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>> +                           pci_spapr_map_irq, phb,
>>                             &phb->memspace, &phb->iospace,
>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>      phb->host_state.bus = bus;
>> @@ -596,6 +604,7 @@ static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>> +    DEFINE_PROP_UINT8("vfio", sPAPRPHBState, is_vfio, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -639,6 +648,18 @@ void spapr_create_phb(sPAPREnvironment *spapr,
>>  /* Finalize PCI setup, called when all devices are already created */
>>  int spapr_finalize_pci_setup(sPAPRPHBState *phb)
>>  {
>> +    if (phb->is_vfio) {
>> +        int fd = vfio_get_container_fd(phb->host_state.bus);
>> +
>> +        if (fd < 0) {
>> +            return fd;
>> +        }
>> +        spapr_vfio_init_dma(fd, phb->dma_liobn,
>> +                            &phb->dma_window_start,
>> +                            &phb->dma_window_size);
>> +        return 0;
>> +    }
>> +
>>      phb->dma_window_start = 0;
>>      phb->dma_window_size = 0x40000000;
>>      phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>> index 3aae273..a4f031b 100644
>> --- a/hw/spapr_pci.h
>> +++ b/hw/spapr_pci.h
>> @@ -57,6 +57,8 @@ typedef struct sPAPRPHBState {
>>          int nvec;
>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>  
>> +    uint8_t is_vfio;
>> +
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  } sPAPRPHBState;
>>  
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 1ac287f..cc0b974 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -21,7 +21,6 @@
>>  #include <dirent.h>
>>  #include <stdio.h>
>>  #include <unistd.h>
>> -#include <sys/io.h>
>>  #include <sys/ioctl.h>
>>  #include <sys/mman.h>
>>  #include <sys/types.h>
>> @@ -44,6 +43,17 @@
>>  #include "vfio_pci.h"
>>  #include "linux-vfio.h"
>>  
>> +#ifndef TARGET_PPC64
>> +#include <sys/io.h>
>> +#define VFIO_IOMMU_EXTENSION    VFIO_X86_IOMMU
>> +#else
>> +#include "hw/pci_internals.h"
>> +#include "hw/xics.h"
>> +#include "hw/spapr.h"
>> +#define POWERPC_IOMMU           2
>> +#define VFIO_IOMMU_EXTENSION    POWERPC_IOMMU
>> +#endif
>> +
> 
> VFIO_IOMMU_EXTENSION never gets used, POWER_IOMMU is redefined below.


Yes, a bit messy. Was not sure about the name so I postponed it.


>>  //#define DEBUG_VFIO
>>  #ifdef DEBUG_VFIO
>>  #define DPRINTF(fmt, ...) \
>> @@ -235,6 +245,7 @@ struct vfio_irq_set_fd {
>>  
>>  static void vfio_enable_intx_kvm(VFIODevice *vdev)
>>  {
>> +#ifndef TARGET_PPC64
> 
> Why do you need this, aren't the extension checks sufficient for this to
> be a nop for you?


It uses ioapic_remove_gsi_eoi_notifier() so it needs some #ifdef anyway. And as we do not support
kvm_irqchip_in_kernel(), there is no point in fixing it and I disabled it all.
When we make eoi notifiers a platform independent, then yes, it will be nop.


>>  #ifdef CONFIG_KVM
>>      struct vfio_irq_set_fd irq_set_fd = {
>>  	.irq_set = {
>> @@ -298,10 +309,12 @@ fail:
>>      qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
>>      vfio_unmask_intx(vdev);
>>  #endif
>> +#endif
>>  }
>>  
>>  static void vfio_disable_intx_kvm(VFIODevice *vdev)
>>  {
>> +#ifndef TARGET_PPC64
> 
> Same

Same :)

> 
>>  #ifdef CONFIG_KVM
>>      struct vfio_irq_set_fd irq_set_fd = {
>>  	.irq_set = {
>> @@ -350,8 +363,10 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
>>      DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __FUNCTION__,
>>              vdev->host.seg, vdev->host.bus, vdev->host.dev, vdev->host.func);
>>  #endif
>> +#endif
>>  }
>>  
>> +#ifndef TARGET_PPC64
>>  static void vfio_update_irq(Notifier *notify, void *data)
>>  {
>>      VFIODevice *vdev = container_of(notify, VFIODevice, intx.update_irq);
>> @@ -381,6 +396,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
>>      /* Re-enable the interrupt in cased we missed an EOI */
>>      vfio_eoi(&vdev->intx.eoi, NULL);
>>  }
>> +#endif
>>  
>>  static int vfio_enable_intx(VFIODevice *vdev)
>>  {
>> @@ -404,10 +420,14 @@ static int vfio_enable_intx(VFIODevice *vdev)
>>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>>      vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
>>      vdev->intx.eoi.notify = vfio_eoi;
>> +#ifndef TARGET_PPC64
>>      ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> 
> This is really only a place holder for x86 too, I don't think my eoi
> notifier as written is acceptable upstream.  We really need some common
> infrastructure here.  I'm hoping to get the kvm acceleration in place
> which would make vfio usable on x86 with kvm (the common case), then
> work towards a generic eoi notifier.
> 
>>  
>>      vdev->intx.update_irq.notify = vfio_update_irq;
>>      pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);
> 
> Can't you stub this out to make it safe to do on POWER too?


I could even simply enable it (not sure if it is going to be called ever though but anyway) once we
get unified eoi notifiers.


>> +#else
>> +    xics_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
>> +#endif
>>  
>>      if (event_notifier_init(&vdev->intx.interrupt, 0)) {
>>          error_report("vfio: Error: event_notifier_init failed\n");
>> @@ -440,8 +460,12 @@ static void vfio_disable_intx(VFIODevice *vdev)
>>      vfio_disable_intx_kvm(vdev);
>>      vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
>>  
>> +#ifndef TARGET_PPC64
>>      pci_remove_irq_update_notifier(&vdev->intx.update_irq);
>>      ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
>> +#else
>> +    xics_remove_eoi_notifier(&vdev->intx.eoi);
>> +#endif
>>  
>>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>>      qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> @@ -543,7 +567,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>>      }
>>  
>>      fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
>> -
>> +#ifndef TARGET_PPC64
>>      vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>>      if (vdev->msi_vectors[vector].virq < 0 || 
>>          kvm_irqchip_add_irqfd(kvm_state, fd,
>> @@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>>          qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>                              &vdev->msi_vectors[vector]);
>>      }
>> -
>> +#else
>> +    vdev->msi_vectors[vector].virq = -1;
>> +    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>> +                        &vdev->msi_vectors[vector]);
>> +#endif
> 
> This shouldn't be necessary once the abort is removed from
> kvm_irqchip_add_msi_route.  It'll be merged next time the kvm uq tree
> merges into qemu.


True, I just did not pick up your very last changes. Updating is always painful, and now it is even
worse then usual as pci_get_irq has been renamed to something else :) Will do though.



>>      if (vdev->nr_vectors < vector + 1) {
>>          int i;
>>  
>> @@ -692,6 +720,7 @@ retry:
>>          fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
>>  
>>          msg = msi_get_msg(&vdev->pdev, i);
>> +#ifndef TARGET_PPC64
>>          vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>>          if (vdev->msi_vectors[i].virq < 0 || 
>>              kvm_irqchip_add_irqfd(kvm_state, fd,
>> @@ -699,6 +728,12 @@ retry:
>>              qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>                                  &vdev->msi_vectors[i]);
>>          }
>> +#else
>> +        vdev->msi_vectors[i].virq = -1;
>> +        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>> +                            &vdev->msi_vectors[i]);
>> +        msg = msg;
>> +#endif
> 
> Same here
> 
>>      }
>>      
>>      ret = vfio_enable_vectors(vdev, false);
>> @@ -1581,6 +1616,25 @@ static int vfio_connect_container(VFIOGroup *group)
>>  
>>          memory_listener_register(&container->listener, get_system_memory());
>>  
>> +#define POWERPC_IOMMU           2
> 
> Assume this will go in the kernel vfio.h at some point.  You may want to
> pick a different name if there's a possibility of other powerpc iommu
> implementations... thus the crappy type1 name for x86.
> 
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>>      } else {
>>          error_report("vfio: No available IOMMU models\n");
>>          g_free(container);
>> @@ -2005,3 +2059,19 @@ static void register_vfio_pci_dev_type(void)
>>  }
>>  
>>  type_init(register_vfio_pci_dev_type)
>> +
>> +int vfio_get_container_fd(struct PCIBus *pbus)
>> +{
>> +    BusChild *kid1st = QTAILQ_FIRST(&pbus->qbus.children);
>> +    VFIODevice *vdev1st;
>> +
>> +    if (!kid1st) {
>> +        printf("No device registered on PCI bus \"%s\", no DMA enabled\n",
>> +               pbus->qbus.name);
>> +        return -1;
>> +    }
>> +    vdev1st = container_of(kid1st->child, VFIODevice, pdev.qdev);
>> +
>> +    return vdev1st->group->container->fd;
>> +}
>> +
> 
> This is not a generic implementation.  x86 won't have all devices on a
> bus be vfio devices and even if it did, there's no guarantee they all
> belong to the same container.  This should probably at least take a
> PCIDevice and some kind of POWER specific code will need to know that
> the container is the same for the whole bus.  Thanks,


This is a workaround, true. x86 does not need this call at all. And on powerpc VFIO devices won't
share PCI bus with emulated devices. I just need some API to get this fd.

Well I probably can add MemoryListener for the DMA window and move all power-specific map/unmap code
to VFIO but it does not look much better. I would rather prefer separating IOMMU code from vfio_pci
somehow (more or less as it is now for powerpc). While doing it, we could think of the API to get
this fd which we need anyway in order to setup the DMA window which is per group (which QEMU does
not understand) but not per device.



> Alex
> 
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 226607c..0d13341 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>>  
>> +int vfio_get_container_fd(struct PCIBus *pbus);
>> +
>>  #endif /* __VFIO_H__ */
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-11  2:54     ` Alexey Kardashevskiy
@ 2012-07-11  3:10       ` Benjamin Herrenschmidt
  2012-07-12  3:11       ` Alex Williamson
  1 sibling, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-11  3:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alex Williamson, qemu-ppc, Alexander Graf, David Gibson

On Wed, 2012-07-11 at 12:54 +1000, Alexey Kardashevskiy wrote:
> > Why do you need this, aren't the extension checks sufficient for this to
> > be a nop for you?
> 
> 
> It uses ioapic_remove_gsi_eoi_notifier() so it needs some #ifdef anyway. And as we do not support
> kvm_irqchip_in_kernel(), there is no point in fixing it and I disabled it all.
> When we make eoi notifiers a platform independent, then yes, it will be nop.

In fact we have an internal experimental patch to move our
PIC emulation into the kernel but so far I have not managed
to make it fit in the existing irqchip stuff.

That irqchip interface is nasty. It's completely x86 centric,
and have tendrils all over the place, into the msi code, into
devices (virtio-pci.c) etc... in ways that are essentially unusable for
anything that looks a bit different.

IE. In urgent need of refactoring.

> >> -
> >> +#ifndef TARGET_PPC64
> >>      vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
> >>      if (vdev->msi_vectors[vector].virq < 0 || 
> >>          kvm_irqchip_add_irqfd(kvm_state, fd,
> >> @@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
> >>          qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >>                              &vdev->msi_vectors[vector]);
> >>      }
> >> -
> >> +#else
> >> +    vdev->msi_vectors[vector].virq = -1;
> >> +    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >> +                        &vdev->msi_vectors[vector]);
> >> +#endif
> > 
> > This shouldn't be necessary once the abort is removed from
> > kvm_irqchip_add_msi_route.  It'll be merged next time the kvm uq tree
> > merges into qemu.

It must also return an irq number, what to chose in that case ? Ie. the
whole irqchip API is a trainwreck if you ask me :-) Very poorly thought
out.

> True, I just did not pick up your very last changes. Updating is always painful, and now it is even
> worse then usual as pci_get_irq has been renamed to something else :) Will do though.

 .../...

> Well I probably can add MemoryListener for the DMA window and move all power-specific map/unmap code
> to VFIO but it does not look much better. I would rather prefer separating IOMMU code from vfio_pci
> somehow (more or less as it is now for powerpc). While doing it, we could think of the API to get
> this fd which we need anyway in order to setup the DMA window which is per group (which QEMU does
> not understand) but not per device.

Right. What we really want is our own private iommu interface based on
the iommu type. Each iommu will have it's own "quirks" in that regard
anyways.

Cheers,
Ben.

> 
> 
> > Alex
> > 
> >> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> >> index 226607c..0d13341 100644
> >> --- a/hw/vfio_pci.h
> >> +++ b/hw/vfio_pci.h
> >> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
> >>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
> >>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
> >>  
> >> +int vfio_get_container_fd(struct PCIBus *pbus);
> >> +
> >>  #endif /* __VFIO_H__ */
> > 
> > 
> > 
> 
> 
> -- 
> Alexey
> 
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-11  2:25   ` Alexey Kardashevskiy
@ 2012-07-12  2:54     ` Alex Williamson
  2012-07-12  4:16       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-12  2:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
> On 11/07/12 02:57, Alex Williamson wrote:
> > On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> >> The two patches in this set are supposed to add VFIO support for POWER.
> >>
> >> The first one adds one more step in the initalizaion sequence which I am not
> >> sure is correct.
> >>
> >> The second patch adds actual VFIO support. It is not ready to submit but
> >> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
> >> and I wonder if there is any plan to implement some generic EOI support code, etc.
> > 
> > A generic EOI notifier is on my todo list, but I have no idea what it's
> > going to look like.  As you know, I've got an ioapic specific notifier
> > in my tree, you add a spapr specific one.  I welcome ideas on how to
> > create something generic that has a chance of being accepted.  Thanks,
> 
> 
> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
> ioapic_remove_gsi_eoi_notifier() as it does too much :)
> 
> 
> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
> 
> Does it sound reasonable?

I suspect we'd need to somehow tie it into qemu_irq where both handlers
and notifiers are allocated so we don't really care the underlying
implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
irq, ...).  It's another mess like adding the PCIBus interrupt line to
gsi effort though.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-11  2:54     ` Alexey Kardashevskiy
  2012-07-11  3:10       ` Benjamin Herrenschmidt
@ 2012-07-12  3:11       ` Alex Williamson
  2012-07-12  8:47         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-12  3:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed, 2012-07-11 at 12:54 +1000, Alexey Kardashevskiy wrote:
> On 11/07/12 02:55, Alex Williamson wrote:
> > On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> >> The patch enables VFIO on POWER.
> >>
> >> It literally does the following:
> >>
> >> 1. POWERPC IOMMU support (the kernel counterpart is required)
> >>
> >> 2. Added #ifdef TARGET_PPC64 for EOI handlers initialisation.
> >>
> >> 3. Added vfio_get_container_fd() to VFIO in order to initialize 1).
> >>
> >> 4. Makefile fixed and "is_vfio" flag added into sPAPR PHB - required to
> >> distinguish VFIO's DMA context from the emulated one.
> >>
> >> WIth the pathes posted today a bit earlier, this patch fully supports
> >> VFIO what includes MSIX as well,
> >>
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/ppc/Makefile.objs |    3 ++
> >>  hw/spapr.h           |    4 +++
> >>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/spapr_pci.c       |   23 ++++++++++++-
> >>  hw/spapr_pci.h       |    2 ++
> >>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
> >>  hw/vfio_pci.h        |    2 ++
> >>  7 files changed, 193 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index f573a95..c46a049 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
> >>  # Xilinx PPC peripherals
> >>  obj-y += xilinx_ethlite.o
> >>  
> >> +# VFIO PCI device assignment
> >> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> >> +
> >>  obj-y := $(addprefix ../,$(obj-y))
> >> diff --git a/hw/spapr.h b/hw/spapr.h
> >> index b37f337..9dca704 100644
> >> --- a/hw/spapr.h
> >> +++ b/hw/spapr.h
> >> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >>                        DMAContext *dma);
> >>  
> >> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> >> +                         uint64_t *dma32_window_start,
> >> +                         uint64_t *dma32_window_size);
> >> +
> >>  #endif /* !defined (__HW_SPAPR_H__) */
> >> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> >> index 50c288d..0a194e8 100644
> >> --- a/hw/spapr_iommu.c
> >> +++ b/hw/spapr_iommu.c
> >> @@ -16,6 +16,8 @@
> >>   * You should have received a copy of the GNU Lesser General Public
> >>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> >>   */
> >> +#include <sys/ioctl.h>
> >> +
> >>  #include "hw.h"
> >>  #include "kvm.h"
> >>  #include "qdev.h"
> >> @@ -23,6 +25,7 @@
> >>  #include "dma.h"
> >>  
> >>  #include "hw/spapr.h"
> >> +#include "hw/linux-vfio.h"
> > 
> > I really need to move this into linux-headers.
> > 
> >>  
> >>  #include <libfdt.h>
> >>  
> >> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
> >>      return 0;
> >>  }
> >>  
> >> +/* -------- API for POWERPC IOMMU -------- */
> >> +
> >> +#define POWERPC_IOMMU           2
> >> +
> >> +struct tce_iommu_info {
> >> +    __u32 argsz;
> >> +    __u32 dma32_window_start;
> >> +    __u32 dma32_window_size;
> >> +};
> >> +
> >> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >> +
> >> +struct tce_iommu_dma_map {
> >> +    __u32 argsz;
> >> +    __u64 va;
> >> +    __u64 dmaaddr;
> >> +};
> >> +
> >> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> >> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> > 
> > I assume this would eventually go into the kernel vfio.h with a VFIO_
> > prefix.  Add a flags field to the structures or it'll be hard to extend
> > them later.
> 
> 
> We can always define another type of IOMMU :) But yes, I'll extend both map and info structures.
> 
> 
> 
> >> +typedef struct sPAPRVFIOTable {
> >> +    int fd;
> >> +    uint32_t liobn;
> >> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> >> +} sPAPRVFIOTable;
> >> +
> >> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> >> +
> >> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> >> +                         uint64_t *dma32_window_start,
> >> +                         uint64_t *dma32_window_size)
> >> +{
> >> +    sPAPRVFIOTable *t;
> >> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> >> +
> >> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
> >> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
> >> +        return;
> >> +    }
> >> +    *dma32_window_start = info.dma32_window_start;
> >> +    *dma32_window_size = info.dma32_window_size;
> >> +
> >> +    t = g_malloc0(sizeof(*t));
> >> +    t->fd = fd;
> >> +    t->liobn = liobn;
> >> +
> >> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> >> +}
> >> +
> >> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> >> +{
> >> +    sPAPRVFIOTable *t;
> >> +    struct tce_iommu_dma_map map = {
> >> +        .argsz = sizeof(map),
> >> +        .va = 0,
> >> +        .dmaaddr = ioba,
> >> +    };
> >> +
> >> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> >> +        if (t->liobn != liobn) {
> >> +            continue;
> >> +        }
> >> +        if (tce) {
> >> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> >> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
> >> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
> >> +                return H_PARAMETER;
> >> +            }
> >> +        } else {
> >> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
> >> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
> >> +                return H_PARAMETER;
> >> +            }
> >> +        }
> >> +        return H_SUCCESS;
> >> +    }
> >> +    return H_CONTINUE; /* positive non-zero value */
> >> +}
> >> +
> > 
> > I wish you could do this through a MemoryListener like we do on x86.
> 
> 
> What is the point? Map the entire RAM to the guest? And It will still use our own IOMMU ioctls as it
> is completely our IOMMU implementaiton.

Yeah, with Ben's explanation it's probably not worth the effort.  We
might want to consider putting stuff like this in logical vfio-arch
files though (vfio-spapr, vfio-x86, vfio-x86-kvm, etc).

> >>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>                                target_ulong opcode, target_ulong *args)
> >>  {
> >> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>      if (0 >= ret) {
> >>          return ret ? H_PARAMETER : H_SUCCESS;
> >>      }
> >> +    ret = put_tce_vfio(liobn, ioba, tce);
> >> +    if (0 >= ret) {
> >> +        return ret ? H_PARAMETER : H_SUCCESS;
> >> +    }
> >>  #ifdef DEBUG_TCE
> >>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
> >>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
> >> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> >> index 5f89003..3375c3f 100644
> >> --- a/hw/spapr_pci.c
> >> +++ b/hw/spapr_pci.c
> >> @@ -29,6 +29,7 @@
> >>  #include "pci_host.h"
> >>  #include "hw/spapr.h"
> >>  #include "hw/spapr_pci.h"
> >> +#include "hw/vfio_pci.h"
> >>  #include "exec-memory.h"
> >>  #include <libfdt.h>
> >>  #include "trace.h"
> >> @@ -440,6 +441,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
> >>                   level);
> >>  }
> >>  
> >> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> >> +{
> >> +    sPAPRPHBState *phb = opaque;
> >> +    return phb->lsi_table[irq_num].dt_irq;
> >> +}
> >> +
> >>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
> >>                                unsigned size)
> >>  {
> >> @@ -567,7 +574,8 @@ static int spapr_phb_init(SysBusDevice *s)
> >>  
> >>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
> >>                             phb->busname ? phb->busname : phb->dtbusname,
> >> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> >> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> >> +                           pci_spapr_map_irq, phb,
> >>                             &phb->memspace, &phb->iospace,
> >>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
> >>      phb->host_state.bus = bus;
> >> @@ -596,6 +604,7 @@ static Property spapr_phb_properties[] = {
> >>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
> >>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
> >>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> >> +    DEFINE_PROP_UINT8("vfio", sPAPRPHBState, is_vfio, 0),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> @@ -639,6 +648,18 @@ void spapr_create_phb(sPAPREnvironment *spapr,
> >>  /* Finalize PCI setup, called when all devices are already created */
> >>  int spapr_finalize_pci_setup(sPAPRPHBState *phb)
> >>  {
> >> +    if (phb->is_vfio) {
> >> +        int fd = vfio_get_container_fd(phb->host_state.bus);
> >> +
> >> +        if (fd < 0) {
> >> +            return fd;
> >> +        }
> >> +        spapr_vfio_init_dma(fd, phb->dma_liobn,
> >> +                            &phb->dma_window_start,
> >> +                            &phb->dma_window_size);
> >> +        return 0;
> >> +    }
> >> +
> >>      phb->dma_window_start = 0;
> >>      phb->dma_window_size = 0x40000000;
> >>      phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> >> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> >> index 3aae273..a4f031b 100644
> >> --- a/hw/spapr_pci.h
> >> +++ b/hw/spapr_pci.h
> >> @@ -57,6 +57,8 @@ typedef struct sPAPRPHBState {
> >>          int nvec;
> >>      } msi_table[SPAPR_MSIX_MAX_DEVS];
> >>  
> >> +    uint8_t is_vfio;
> >> +
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >>  } sPAPRPHBState;
> >>  
> >> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> >> index 1ac287f..cc0b974 100644
> >> --- a/hw/vfio_pci.c
> >> +++ b/hw/vfio_pci.c
> >> @@ -21,7 +21,6 @@
> >>  #include <dirent.h>
> >>  #include <stdio.h>
> >>  #include <unistd.h>
> >> -#include <sys/io.h>
> >>  #include <sys/ioctl.h>
> >>  #include <sys/mman.h>
> >>  #include <sys/types.h>
> >> @@ -44,6 +43,17 @@
> >>  #include "vfio_pci.h"
> >>  #include "linux-vfio.h"
> >>  
> >> +#ifndef TARGET_PPC64
> >> +#include <sys/io.h>
> >> +#define VFIO_IOMMU_EXTENSION    VFIO_X86_IOMMU
> >> +#else
> >> +#include "hw/pci_internals.h"
> >> +#include "hw/xics.h"
> >> +#include "hw/spapr.h"
> >> +#define POWERPC_IOMMU           2
> >> +#define VFIO_IOMMU_EXTENSION    POWERPC_IOMMU
> >> +#endif
> >> +
> > 
> > VFIO_IOMMU_EXTENSION never gets used, POWER_IOMMU is redefined below.
> 
> 
> Yes, a bit messy. Was not sure about the name so I postponed it.
> 
> 
> >>  //#define DEBUG_VFIO
> >>  #ifdef DEBUG_VFIO
> >>  #define DPRINTF(fmt, ...) \
> >> @@ -235,6 +245,7 @@ struct vfio_irq_set_fd {
> >>  
> >>  static void vfio_enable_intx_kvm(VFIODevice *vdev)
> >>  {
> >> +#ifndef TARGET_PPC64
> > 
> > Why do you need this, aren't the extension checks sufficient for this to
> > be a nop for you?
> 
> 
> It uses ioapic_remove_gsi_eoi_notifier() so it needs some #ifdef anyway. And as we do not support
> kvm_irqchip_in_kernel(), there is no point in fixing it and I disabled it all.
> When we make eoi notifiers a platform independent, then yes, it will be nop.

Ah right, forgot you won't even build ioapic_*.

> >>  #ifdef CONFIG_KVM
> >>      struct vfio_irq_set_fd irq_set_fd = {
> >>  	.irq_set = {
> >> @@ -298,10 +309,12 @@ fail:
> >>      qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> >>      vfio_unmask_intx(vdev);
> >>  #endif
> >> +#endif
> >>  }
> >>  
> >>  static void vfio_disable_intx_kvm(VFIODevice *vdev)
> >>  {
> >> +#ifndef TARGET_PPC64
> > 
> > Same
> 
> Same :)
> 
> > 
> >>  #ifdef CONFIG_KVM
> >>      struct vfio_irq_set_fd irq_set_fd = {
> >>  	.irq_set = {
> >> @@ -350,8 +363,10 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
> >>      DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __FUNCTION__,
> >>              vdev->host.seg, vdev->host.bus, vdev->host.dev, vdev->host.func);
> >>  #endif
> >> +#endif
> >>  }
> >>  
> >> +#ifndef TARGET_PPC64
> >>  static void vfio_update_irq(Notifier *notify, void *data)
> >>  {
> >>      VFIODevice *vdev = container_of(notify, VFIODevice, intx.update_irq);
> >> @@ -381,6 +396,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
> >>      /* Re-enable the interrupt in cased we missed an EOI */
> >>      vfio_eoi(&vdev->intx.eoi, NULL);
> >>  }
> >> +#endif
> >>  
> >>  static int vfio_enable_intx(VFIODevice *vdev)
> >>  {
> >> @@ -404,10 +420,14 @@ static int vfio_enable_intx(VFIODevice *vdev)
> >>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
> >>      vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
> >>      vdev->intx.eoi.notify = vfio_eoi;
> >> +#ifndef TARGET_PPC64
> >>      ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> > 
> > This is really only a place holder for x86 too, I don't think my eoi
> > notifier as written is acceptable upstream.  We really need some common
> > infrastructure here.  I'm hoping to get the kvm acceleration in place
> > which would make vfio usable on x86 with kvm (the common case), then
> > work towards a generic eoi notifier.
> > 
> >>  
> >>      vdev->intx.update_irq.notify = vfio_update_irq;
> >>      pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);
> > 
> > Can't you stub this out to make it safe to do on POWER too?
> 
> 
> I could even simply enable it (not sure if it is going to be called ever though but anyway) once we
> get unified eoi notifiers.

Right

> >> +#else
> >> +    xics_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> >> +#endif
> >>  
> >>      if (event_notifier_init(&vdev->intx.interrupt, 0)) {
> >>          error_report("vfio: Error: event_notifier_init failed\n");
> >> @@ -440,8 +460,12 @@ static void vfio_disable_intx(VFIODevice *vdev)
> >>      vfio_disable_intx_kvm(vdev);
> >>      vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
> >>  
> >> +#ifndef TARGET_PPC64
> >>      pci_remove_irq_update_notifier(&vdev->intx.update_irq);
> >>      ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
> >> +#else
> >> +    xics_remove_eoi_notifier(&vdev->intx.eoi);
> >> +#endif
> >>  
> >>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
> >>      qemu_set_fd_handler(fd, NULL, NULL, vdev);
> >> @@ -543,7 +567,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
> >>      }
> >>  
> >>      fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
> >> -
> >> +#ifndef TARGET_PPC64
> >>      vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
> >>      if (vdev->msi_vectors[vector].virq < 0 || 
> >>          kvm_irqchip_add_irqfd(kvm_state, fd,
> >> @@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
> >>          qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >>                              &vdev->msi_vectors[vector]);
> >>      }
> >> -
> >> +#else
> >> +    vdev->msi_vectors[vector].virq = -1;
> >> +    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >> +                        &vdev->msi_vectors[vector]);
> >> +#endif
> > 
> > This shouldn't be necessary once the abort is removed from
> > kvm_irqchip_add_msi_route.  It'll be merged next time the kvm uq tree
> > merges into qemu.
> 
> 
> True, I just did not pick up your very last changes. Updating is always painful, and now it is even
> worse then usual as pci_get_irq has been renamed to something else :) Will do though.

Yep, I think once Michael is back from holiday and does a pull request
(and hopefully merges Jan's PCIBus irq routing patches) my tree will be
down to mostly just the vfio driver and I'll start managing it like the
kernel tree with a patch series that gets rebased.  I'm hoping that if I
can get an acceptable level irqfd/eoifd implementation for x86 kvm in
the kernel that I can rip out the ioapic eoi notifiers and submit the
code as functional only with kvm and work out the generic eoi notifiers
in qemu proper.

> >>      if (vdev->nr_vectors < vector + 1) {
> >>          int i;
> >>  
> >> @@ -692,6 +720,7 @@ retry:
> >>          fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> >>  
> >>          msg = msi_get_msg(&vdev->pdev, i);
> >> +#ifndef TARGET_PPC64
> >>          vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
> >>          if (vdev->msi_vectors[i].virq < 0 || 
> >>              kvm_irqchip_add_irqfd(kvm_state, fd,
> >> @@ -699,6 +728,12 @@ retry:
> >>              qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >>                                  &vdev->msi_vectors[i]);
> >>          }
> >> +#else
> >> +        vdev->msi_vectors[i].virq = -1;
> >> +        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> >> +                            &vdev->msi_vectors[i]);
> >> +        msg = msg;
> >> +#endif
> > 
> > Same here
> > 
> >>      }
> >>      
> >>      ret = vfio_enable_vectors(vdev, false);
> >> @@ -1581,6 +1616,25 @@ static int vfio_connect_container(VFIOGroup *group)
> >>  
> >>          memory_listener_register(&container->listener, get_system_memory());
> >>  
> >> +#define POWERPC_IOMMU           2
> > 
> > Assume this will go in the kernel vfio.h at some point.  You may want to
> > pick a different name if there's a possibility of other powerpc iommu
> > implementations... thus the crappy type1 name for x86.
> > 
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
> >> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set group container: %s\n",
> >> +                         strerror(errno));
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -1;
> >> +        }
> >> +
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set iommu for container: %s\n",
> >> +                         strerror(errno));
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -1;
> >> +        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models\n");
> >>          g_free(container);
> >> @@ -2005,3 +2059,19 @@ static void register_vfio_pci_dev_type(void)
> >>  }
> >>  
> >>  type_init(register_vfio_pci_dev_type)
> >> +
> >> +int vfio_get_container_fd(struct PCIBus *pbus)
> >> +{
> >> +    BusChild *kid1st = QTAILQ_FIRST(&pbus->qbus.children);
> >> +    VFIODevice *vdev1st;
> >> +
> >> +    if (!kid1st) {
> >> +        printf("No device registered on PCI bus \"%s\", no DMA enabled\n",
> >> +               pbus->qbus.name);
> >> +        return -1;
> >> +    }
> >> +    vdev1st = container_of(kid1st->child, VFIODevice, pdev.qdev);
> >> +
> >> +    return vdev1st->group->container->fd;
> >> +}
> >> +
> > 
> > This is not a generic implementation.  x86 won't have all devices on a
> > bus be vfio devices and even if it did, there's no guarantee they all
> > belong to the same container.  This should probably at least take a
> > PCIDevice and some kind of POWER specific code will need to know that
> > the container is the same for the whole bus.  Thanks,
> 
> 
> This is a workaround, true. x86 does not need this call at all. And on powerpc VFIO devices won't
> share PCI bus with emulated devices. I just need some API to get this fd.
> 
> Well I probably can add MemoryListener for the DMA window and move all power-specific map/unmap code
> to VFIO but it does not look much better. I would rather prefer separating IOMMU code from vfio_pci
> somehow (more or less as it is now for powerpc). While doing it, we could think of the API to get
> this fd which we need anyway in order to setup the DMA window which is per group (which QEMU does
> not understand) but not per device.

The MemoryListener probably doesn't make sense with a guest driven iova
window.  It would be an abuse of the interface I think.  At some level
in the power code you, or at least the user, needs to know about groups
though.  That's how you end up with an emulated bridge in front of each
group, right?  So with that same knowledge, shouldn't the API simply be:

int vfio_get_container_fd(PCIDevice *dev)

where power code picks a device from the bus since you know they're all
in the same group?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  2:54     ` Alex Williamson
@ 2012-07-12  4:16       ` Alexey Kardashevskiy
  2012-07-12  4:31         ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  4:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On 12/07/12 12:54, Alex Williamson wrote:
> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
>> On 11/07/12 02:57, Alex Williamson wrote:
>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>> The two patches in this set are supposed to add VFIO support for POWER.
>>>>
>>>> The first one adds one more step in the initalizaion sequence which I am not
>>>> sure is correct.
>>>>
>>>> The second patch adds actual VFIO support. It is not ready to submit but
>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
>>>
>>> A generic EOI notifier is on my todo list, but I have no idea what it's
>>> going to look like.  As you know, I've got an ioapic specific notifier
>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
>>> create something generic that has a chance of being accepted.  Thanks,
>>
>>
>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
>>
>>
>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
>>
>> Does it sound reasonable?
> 
> I suspect we'd need to somehow tie it into qemu_irq where both handlers
> and notifiers are allocated so we don't really care the underlying
> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
> irq, ...).  It's another mess like adding the PCIBus interrupt line to
> gsi effort though.  Thanks,


Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
                              qemu_eoi_add_notifier add_notifier);
and called it from the XICS initialization code.

It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
(and then add_eoi_notifier()) from that number within vfio_pci.

May be we could add the callback pointer into PCIINTxRoute?


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  4:16       ` Alexey Kardashevskiy
@ 2012-07-12  4:31         ` Alex Williamson
  2012-07-12  4:38           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-12  4:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
> On 12/07/12 12:54, Alex Williamson wrote:
> > On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
> >> On 11/07/12 02:57, Alex Williamson wrote:
> >>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> >>>> The two patches in this set are supposed to add VFIO support for POWER.
> >>>>
> >>>> The first one adds one more step in the initalizaion sequence which I am not
> >>>> sure is correct.
> >>>>
> >>>> The second patch adds actual VFIO support. It is not ready to submit but
> >>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
> >>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
> >>>
> >>> A generic EOI notifier is on my todo list, but I have no idea what it's
> >>> going to look like.  As you know, I've got an ioapic specific notifier
> >>> in my tree, you add a spapr specific one.  I welcome ideas on how to
> >>> create something generic that has a chance of being accepted.  Thanks,
> >>
> >>
> >> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
> >> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
> >> ioapic_remove_gsi_eoi_notifier() as it does too much :)
> >>
> >>
> >> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
> >> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
> >> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
> >>
> >> Does it sound reasonable?
> > 
> > I suspect we'd need to somehow tie it into qemu_irq where both handlers
> > and notifiers are allocated so we don't really care the underlying
> > implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
> > irq, ...).  It's another mess like adding the PCIBus interrupt line to
> > gsi effort though.  Thanks,
> 
> 
> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
>                               qemu_eoi_add_notifier add_notifier);
> and called it from the XICS initialization code.
> 
> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
> (and then add_eoi_notifier()) from that number within vfio_pci.

Well GSI and qemu_irq are different address spaces.  We still need GSI
for any kind of qemu bypass case.

> May be we could add the callback pointer into PCIINTxRoute?

Maybe, but why is this PCI specific?  Can't we call it as
qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
like qemu_set_irq, extracting the irq number from the IRQState and
passing it through to the add_notifier callback for IRQState until it
got to the ioapic/pic/xics.

int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
{
    if (!irq || !irq->add_eoi_notifier)
        return -1;

   return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
}

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  4:31         ` Alex Williamson
@ 2012-07-12  4:38           ` Alexey Kardashevskiy
  2012-07-12  4:43             ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  4:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On 12/07/12 14:31, Alex Williamson wrote:
> On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
>> On 12/07/12 12:54, Alex Williamson wrote:
>>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
>>>> On 11/07/12 02:57, Alex Williamson wrote:
>>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>>>> The two patches in this set are supposed to add VFIO support for POWER.
>>>>>>
>>>>>> The first one adds one more step in the initalizaion sequence which I am not
>>>>>> sure is correct.
>>>>>>
>>>>>> The second patch adds actual VFIO support. It is not ready to submit but
>>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
>>>>>
>>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
>>>>> going to look like.  As you know, I've got an ioapic specific notifier
>>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
>>>>> create something generic that has a chance of being accepted.  Thanks,
>>>>
>>>>
>>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
>>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
>>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
>>>>
>>>>
>>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
>>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
>>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
>>>>
>>>> Does it sound reasonable?
>>>
>>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
>>> and notifiers are allocated so we don't really care the underlying
>>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
>>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
>>> gsi effort though.  Thanks,
>>
>>
>> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
>> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
>>                               qemu_eoi_add_notifier add_notifier);
>> and called it from the XICS initialization code.
>>
>> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
>> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
>> (and then add_eoi_notifier()) from that number within vfio_pci.
> 
> Well GSI and qemu_irq are different address spaces.  We still need GSI
> for any kind of qemu bypass case.

No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.

>> May be we could add the callback pointer into PCIINTxRoute?
> 
> Maybe, but why is this PCI specific?  Can't we call it as
> qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
> like qemu_set_irq, extracting the irq number from the IRQState and
> passing it through to the add_notifier callback for IRQState until it
> got to the ioapic/pic/xics.
> 
> int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
> {
>     if (!irq || !irq->add_eoi_notifier)
>         return -1;
> 
>    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
> }
> 

Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
possible - hard to get accepted :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  4:38           ` Alexey Kardashevskiy
@ 2012-07-12  4:43             ` Alex Williamson
  2012-07-12  4:58               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-12  4:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On Thu, 2012-07-12 at 14:38 +1000, Alexey Kardashevskiy wrote:
> On 12/07/12 14:31, Alex Williamson wrote:
> > On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
> >> On 12/07/12 12:54, Alex Williamson wrote:
> >>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
> >>>> On 11/07/12 02:57, Alex Williamson wrote:
> >>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> >>>>>> The two patches in this set are supposed to add VFIO support for POWER.
> >>>>>>
> >>>>>> The first one adds one more step in the initalizaion sequence which I am not
> >>>>>> sure is correct.
> >>>>>>
> >>>>>> The second patch adds actual VFIO support. It is not ready to submit but
> >>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
> >>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
> >>>>>
> >>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
> >>>>> going to look like.  As you know, I've got an ioapic specific notifier
> >>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
> >>>>> create something generic that has a chance of being accepted.  Thanks,
> >>>>
> >>>>
> >>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
> >>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
> >>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
> >>>>
> >>>>
> >>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
> >>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
> >>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
> >>>>
> >>>> Does it sound reasonable?
> >>>
> >>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
> >>> and notifiers are allocated so we don't really care the underlying
> >>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
> >>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
> >>> gsi effort though.  Thanks,
> >>
> >>
> >> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
> >> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
> >>                               qemu_eoi_add_notifier add_notifier);
> >> and called it from the XICS initialization code.
> >>
> >> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
> >> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
> >> (and then add_eoi_notifier()) from that number within vfio_pci.
> > 
> > Well GSI and qemu_irq are different address spaces.  We still need GSI
> > for any kind of qemu bypass case.
> 
> No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.
> 
> >> May be we could add the callback pointer into PCIINTxRoute?
> > 
> > Maybe, but why is this PCI specific?  Can't we call it as
> > qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
> > like qemu_set_irq, extracting the irq number from the IRQState and
> > passing it through to the add_notifier callback for IRQState until it
> > got to the ioapic/pic/xics.
> > 
> > int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
> > {
> >     if (!irq || !irq->add_eoi_notifier)
> >         return -1;
> > 
> >    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
> > }
> > 
> 
> Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
> non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
> possible - hard to get accepted :)

Yep, that's why I said it was the same kind of mess as the PCIBus intx
routing.  It's intrusive, but qemu_irq is the common interrupt model so
we need to make use of it.  A callback on QEMUMachine seems completely
random.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  4:43             ` Alex Williamson
@ 2012-07-12  4:58               ` Alexey Kardashevskiy
  2012-07-12  5:29                 ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  4:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On 12/07/12 14:43, Alex Williamson wrote:
> On Thu, 2012-07-12 at 14:38 +1000, Alexey Kardashevskiy wrote:
>> On 12/07/12 14:31, Alex Williamson wrote:
>>> On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
>>>> On 12/07/12 12:54, Alex Williamson wrote:
>>>>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
>>>>>> On 11/07/12 02:57, Alex Williamson wrote:
>>>>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>>>>>> The two patches in this set are supposed to add VFIO support for POWER.
>>>>>>>>
>>>>>>>> The first one adds one more step in the initalizaion sequence which I am not
>>>>>>>> sure is correct.
>>>>>>>>
>>>>>>>> The second patch adds actual VFIO support. It is not ready to submit but
>>>>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>>>>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
>>>>>>>
>>>>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
>>>>>>> going to look like.  As you know, I've got an ioapic specific notifier
>>>>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
>>>>>>> create something generic that has a chance of being accepted.  Thanks,
>>>>>>
>>>>>>
>>>>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
>>>>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
>>>>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
>>>>>>
>>>>>>
>>>>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
>>>>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
>>>>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
>>>>>>
>>>>>> Does it sound reasonable?
>>>>>
>>>>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
>>>>> and notifiers are allocated so we don't really care the underlying
>>>>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
>>>>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
>>>>> gsi effort though.  Thanks,
>>>>
>>>>
>>>> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
>>>> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
>>>>                               qemu_eoi_add_notifier add_notifier);
>>>> and called it from the XICS initialization code.
>>>>
>>>> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
>>>> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
>>>> (and then add_eoi_notifier()) from that number within vfio_pci.
>>>
>>> Well GSI and qemu_irq are different address spaces.  We still need GSI
>>> for any kind of qemu bypass case.
>>
>> No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.
>>
>>>> May be we could add the callback pointer into PCIINTxRoute?
>>>
>>> Maybe, but why is this PCI specific?  Can't we call it as
>>> qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
>>> like qemu_set_irq, extracting the irq number from the IRQState and
>>> passing it through to the add_notifier callback for IRQState until it
>>> got to the ioapic/pic/xics.
>>>
>>> int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
>>> {
>>>     if (!irq || !irq->add_eoi_notifier)
>>>         return -1;
>>>
>>>    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
>>> }
>>>
>>
>> Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
>> non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
>> possible - hard to get accepted :)
> 
> Yep, that's why I said it was the same kind of mess as the PCIBus intx
> routing.  It's intrusive, but qemu_irq is the common interrupt model so
> we need to make use of it.

There are 2 level of intrusion.

1. Fix PCIINTxRoute to return the GSI's qemu_irq as well.

2. Add add_eoi_notifier to all levels including PCI. As a part of this, we will have to add this
callback to all pci_register_bus() calls to reach global interrupts via platform-specific PCI bus.

I would stay with 1). Is that bad?


> A callback on QEMUMachine seems completely
> random.  Thanks,

True :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  4:58               ` Alexey Kardashevskiy
@ 2012-07-12  5:29                 ` Alex Williamson
  2012-07-12  5:47                   ` Alexey Kardashevskiy
  2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
  0 siblings, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-12  5:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On Thu, 2012-07-12 at 14:58 +1000, Alexey Kardashevskiy wrote:
> On 12/07/12 14:43, Alex Williamson wrote:
> > On Thu, 2012-07-12 at 14:38 +1000, Alexey Kardashevskiy wrote:
> >> On 12/07/12 14:31, Alex Williamson wrote:
> >>> On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
> >>>> On 12/07/12 12:54, Alex Williamson wrote:
> >>>>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
> >>>>>> On 11/07/12 02:57, Alex Williamson wrote:
> >>>>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
> >>>>>>>> The two patches in this set are supposed to add VFIO support for POWER.
> >>>>>>>>
> >>>>>>>> The first one adds one more step in the initalizaion sequence which I am not
> >>>>>>>> sure is correct.
> >>>>>>>>
> >>>>>>>> The second patch adds actual VFIO support. It is not ready to submit but
> >>>>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
> >>>>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
> >>>>>>>
> >>>>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
> >>>>>>> going to look like.  As you know, I've got an ioapic specific notifier
> >>>>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
> >>>>>>> create something generic that has a chance of being accepted.  Thanks,
> >>>>>>
> >>>>>>
> >>>>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
> >>>>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
> >>>>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
> >>>>>>
> >>>>>>
> >>>>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
> >>>>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
> >>>>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
> >>>>>>
> >>>>>> Does it sound reasonable?
> >>>>>
> >>>>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
> >>>>> and notifiers are allocated so we don't really care the underlying
> >>>>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
> >>>>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
> >>>>> gsi effort though.  Thanks,
> >>>>
> >>>>
> >>>> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
> >>>> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
> >>>>                               qemu_eoi_add_notifier add_notifier);
> >>>> and called it from the XICS initialization code.
> >>>>
> >>>> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
> >>>> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
> >>>> (and then add_eoi_notifier()) from that number within vfio_pci.
> >>>
> >>> Well GSI and qemu_irq are different address spaces.  We still need GSI
> >>> for any kind of qemu bypass case.
> >>
> >> No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.
> >>
> >>>> May be we could add the callback pointer into PCIINTxRoute?
> >>>
> >>> Maybe, but why is this PCI specific?  Can't we call it as
> >>> qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
> >>> like qemu_set_irq, extracting the irq number from the IRQState and
> >>> passing it through to the add_notifier callback for IRQState until it
> >>> got to the ioapic/pic/xics.
> >>>
> >>> int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
> >>> {
> >>>     if (!irq || !irq->add_eoi_notifier)
> >>>         return -1;
> >>>
> >>>    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
> >>> }
> >>>
> >>
> >> Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
> >> non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
> >> possible - hard to get accepted :)
> > 
> > Yep, that's why I said it was the same kind of mess as the PCIBus intx
> > routing.  It's intrusive, but qemu_irq is the common interrupt model so
> > we need to make use of it.
> 
> There are 2 level of intrusion.
> 
> 1. Fix PCIINTxRoute to return the GSI's qemu_irq as well.

Slightly confusing because pdev->irq[] is a qemu_irq, but you want the
actual ioapic/pic/xics qemu_irq w/o walking through the various devices,
correct?  I'm not sure what we do once we have it though.  Do we get to
call something like the function outlined above on these "special"
qemu_irqs?

> 2. Add add_eoi_notifier to all levels including PCI. As a part of this, we will have to add this
> callback to all pci_register_bus() calls to reach global interrupts via platform-specific PCI bus.

Just like the PCI INTx route callback, most of these can just be
passthrough.  We just need to get to the end qemu_irq that registered a
real add notifier.  That might make it possible to do it w/o interfering
too much with other callers, I hope.

> I would stay with 1). Is that bad?

It still seems to present a rather large incongruity, but if we're
planning to cache the qemu_irq there anyway, maybe it's a secondary use.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  5:29                 ` Alex Williamson
@ 2012-07-12  5:47                   ` Alexey Kardashevskiy
  2012-07-16  3:51                     ` Alexey Kardashevskiy
  2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
  1 sibling, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  5:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, qemu-ppc, Jan Kiszka, David Gibson

On 12/07/12 15:29, Alex Williamson wrote:
> On Thu, 2012-07-12 at 14:58 +1000, Alexey Kardashevskiy wrote:
>> On 12/07/12 14:43, Alex Williamson wrote:
>>> On Thu, 2012-07-12 at 14:38 +1000, Alexey Kardashevskiy wrote:
>>>> On 12/07/12 14:31, Alex Williamson wrote:
>>>>> On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
>>>>>> On 12/07/12 12:54, Alex Williamson wrote:
>>>>>>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
>>>>>>>> On 11/07/12 02:57, Alex Williamson wrote:
>>>>>>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>>>>>>>> The two patches in this set are supposed to add VFIO support for POWER.
>>>>>>>>>>
>>>>>>>>>> The first one adds one more step in the initalizaion sequence which I am not
>>>>>>>>>> sure is correct.
>>>>>>>>>>
>>>>>>>>>> The second patch adds actual VFIO support. It is not ready to submit but
>>>>>>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>>>>>>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
>>>>>>>>>
>>>>>>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
>>>>>>>>> going to look like.  As you know, I've got an ioapic specific notifier
>>>>>>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
>>>>>>>>> create something generic that has a chance of being accepted.  Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
>>>>>>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
>>>>>>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
>>>>>>>>
>>>>>>>>
>>>>>>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
>>>>>>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
>>>>>>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
>>>>>>>>
>>>>>>>> Does it sound reasonable?
>>>>>>>
>>>>>>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
>>>>>>> and notifiers are allocated so we don't really care the underlying
>>>>>>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
>>>>>>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
>>>>>>> gsi effort though.  Thanks,
>>>>>>
>>>>>>
>>>>>> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
>>>>>> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
>>>>>>                               qemu_eoi_add_notifier add_notifier);
>>>>>> and called it from the XICS initialization code.
>>>>>>
>>>>>> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
>>>>>> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
>>>>>> (and then add_eoi_notifier()) from that number within vfio_pci.
>>>>>
>>>>> Well GSI and qemu_irq are different address spaces.  We still need GSI
>>>>> for any kind of qemu bypass case.
>>>>
>>>> No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.
>>>>
>>>>>> May be we could add the callback pointer into PCIINTxRoute?
>>>>>
>>>>> Maybe, but why is this PCI specific?  Can't we call it as
>>>>> qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
>>>>> like qemu_set_irq, extracting the irq number from the IRQState and
>>>>> passing it through to the add_notifier callback for IRQState until it
>>>>> got to the ioapic/pic/xics.
>>>>>
>>>>> int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
>>>>> {
>>>>>     if (!irq || !irq->add_eoi_notifier)
>>>>>         return -1;
>>>>>
>>>>>    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
>>>>> }
>>>>>
>>>>
>>>> Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
>>>> non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
>>>> possible - hard to get accepted :)
>>>
>>> Yep, that's why I said it was the same kind of mess as the PCIBus intx
>>> routing.  It's intrusive, but qemu_irq is the common interrupt model so
>>> we need to make use of it.
>>
>> There are 2 level of intrusion.
>>
>> 1. Fix PCIINTxRoute to return the GSI's qemu_irq as well.
> 
> Slightly confusing because pdev->irq[] is a qemu_irq, but you want the
> actual ioapic/pic/xics qemu_irq w/o walking through the various devices,
> correct?

Yes. The qemu_irq which corresponds to the GSI which pci_get_irq is returning.

>  I'm not sure what we do once we have it though.  Do we get to
> call something like the function outlined above on these "special"
> qemu_irqs?

They are not special but just "global". This is what hw/pc_piix.c allocates with qemu_allocate_irqs().

Assuming we have properly initialized add_eoi_notifier() callback in the qemu_irq struct, we can
easily add a notifier via this callback.

Or I did not get the whole idea.

> 
>> 2. Add add_eoi_notifier to all levels including PCI. As a part of this, we will have to add this
>> callback to all pci_register_bus() calls to reach global interrupts via platform-specific PCI bus.
> 
> Just like the PCI INTx route callback, most of these can just be
> passthrough.  We just need to get to the end qemu_irq that registered a
> real add notifier.  That might make it possible to do it w/o interfering
> too much with other callers, I hope.

Yes. This is why I propose to extend the PCIINTxRoute struct.

Actually even adding a callback into QEMUMachine is not that bad idea.

If a pointer to the struct QEMUMachine was passed into QEMUMachineInitFunc(), it would be the right
place to init such callback, one per machine but not per every qemu_irq as it is the same for the
whole machine and will not change.


>> I would stay with 1). Is that bad?
> 
> It still seems to present a rather large incongruity, but if we're
> planning to cache the qemu_irq there anyway, maybe it's a secondary use.

Cannot see how it is different from having pci_get_irq() or pci_route_irq_fn() though.


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support
  2012-07-12  3:11       ` Alex Williamson
@ 2012-07-12  8:47         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  8:47 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 12/07/12 13:11, Alex Williamson wrote:
> On Wed, 2012-07-11 at 12:54 +1000, Alexey Kardashevskiy wrote:
>> On 11/07/12 02:55, Alex Williamson wrote:
>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>> The patch enables VFIO on POWER.
>>>>
>>>> It literally does the following:
>>>>
>>>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>>>
>>>> 2. Added #ifdef TARGET_PPC64 for EOI handlers initialisation.
>>>>
>>>> 3. Added vfio_get_container_fd() to VFIO in order to initialize 1).
>>>>
>>>> 4. Makefile fixed and "is_vfio" flag added into sPAPR PHB - required to
>>>> distinguish VFIO's DMA context from the emulated one.
>>>>
>>>> WIth the pathes posted today a bit earlier, this patch fully supports
>>>> VFIO what includes MSIX as well,
>>>>
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  hw/ppc/Makefile.objs |    3 ++
>>>>  hw/spapr.h           |    4 +++
>>>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  hw/spapr_pci.c       |   23 ++++++++++++-
>>>>  hw/spapr_pci.h       |    2 ++
>>>>  hw/vfio_pci.c        |   76 +++++++++++++++++++++++++++++++++++++++++--
>>>>  hw/vfio_pci.h        |    2 ++
>>>>  7 files changed, 193 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index f573a95..c46a049 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>>>  # Xilinx PPC peripherals
>>>>  obj-y += xilinx_ethlite.o
>>>>  
>>>> +# VFIO PCI device assignment
>>>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>>>> +
>>>>  obj-y := $(addprefix ../,$(obj-y))
>>>> diff --git a/hw/spapr.h b/hw/spapr.h
>>>> index b37f337..9dca704 100644
>>>> --- a/hw/spapr.h
>>>> +++ b/hw/spapr.h
>>>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>>                        DMAContext *dma);
>>>>  
>>>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>>>> +                         uint64_t *dma32_window_start,
>>>> +                         uint64_t *dma32_window_size);
>>>> +
>>>>  #endif /* !defined (__HW_SPAPR_H__) */
>>>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>>>> index 50c288d..0a194e8 100644
>>>> --- a/hw/spapr_iommu.c
>>>> +++ b/hw/spapr_iommu.c
>>>> @@ -16,6 +16,8 @@
>>>>   * You should have received a copy of the GNU Lesser General Public
>>>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>>>   */
>>>> +#include <sys/ioctl.h>
>>>> +
>>>>  #include "hw.h"
>>>>  #include "kvm.h"
>>>>  #include "qdev.h"
>>>> @@ -23,6 +25,7 @@
>>>>  #include "dma.h"
>>>>  
>>>>  #include "hw/spapr.h"
>>>> +#include "hw/linux-vfio.h"
>>>
>>> I really need to move this into linux-headers.
>>>
>>>>  
>>>>  #include <libfdt.h>
>>>>  
>>>> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>>>      return 0;
>>>>  }
>>>>  
>>>> +/* -------- API for POWERPC IOMMU -------- */
>>>> +
>>>> +#define POWERPC_IOMMU           2
>>>> +
>>>> +struct tce_iommu_info {
>>>> +    __u32 argsz;
>>>> +    __u32 dma32_window_start;
>>>> +    __u32 dma32_window_size;
>>>> +};
>>>> +
>>>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>>>> +
>>>> +struct tce_iommu_dma_map {
>>>> +    __u32 argsz;
>>>> +    __u64 va;
>>>> +    __u64 dmaaddr;
>>>> +};
>>>> +
>>>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>>>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>>>
>>> I assume this would eventually go into the kernel vfio.h with a VFIO_
>>> prefix.  Add a flags field to the structures or it'll be hard to extend
>>> them later.
>>
>>
>> We can always define another type of IOMMU :) But yes, I'll extend both map and info structures.
>>
>>
>>
>>>> +typedef struct sPAPRVFIOTable {
>>>> +    int fd;
>>>> +    uint32_t liobn;
>>>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>>>> +} sPAPRVFIOTable;
>>>> +
>>>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>>>> +
>>>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>>>> +                         uint64_t *dma32_window_start,
>>>> +                         uint64_t *dma32_window_size)
>>>> +{
>>>> +    sPAPRVFIOTable *t;
>>>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>>>> +
>>>> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
>>>> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
>>>> +        return;
>>>> +    }
>>>> +    *dma32_window_start = info.dma32_window_start;
>>>> +    *dma32_window_size = info.dma32_window_size;
>>>> +
>>>> +    t = g_malloc0(sizeof(*t));
>>>> +    t->fd = fd;
>>>> +    t->liobn = liobn;
>>>> +
>>>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>>>> +}
>>>> +
>>>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>>>> +{
>>>> +    sPAPRVFIOTable *t;
>>>> +    struct tce_iommu_dma_map map = {
>>>> +        .argsz = sizeof(map),
>>>> +        .va = 0,
>>>> +        .dmaaddr = ioba,
>>>> +    };
>>>> +
>>>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>>>> +        if (t->liobn != liobn) {
>>>> +            continue;
>>>> +        }
>>>> +        if (tce) {
>>>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>>>> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
>>>> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
>>>> +                return H_PARAMETER;
>>>> +            }
>>>> +        } else {
>>>> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
>>>> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
>>>> +                return H_PARAMETER;
>>>> +            }
>>>> +        }
>>>> +        return H_SUCCESS;
>>>> +    }
>>>> +    return H_CONTINUE; /* positive non-zero value */
>>>> +}
>>>> +
>>>
>>> I wish you could do this through a MemoryListener like we do on x86.
>>
>>
>> What is the point? Map the entire RAM to the guest? And It will still use our own IOMMU ioctls as it
>> is completely our IOMMU implementaiton.
> 
> Yeah, with Ben's explanation it's probably not worth the effort.  We
> might want to consider putting stuff like this in logical vfio-arch
> files though (vfio-spapr, vfio-x86, vfio-x86-kvm, etc).
> 
>>>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>>                                target_ulong opcode, target_ulong *args)
>>>>  {
>>>> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>>      if (0 >= ret) {
>>>>          return ret ? H_PARAMETER : H_SUCCESS;
>>>>      }
>>>> +    ret = put_tce_vfio(liobn, ioba, tce);
>>>> +    if (0 >= ret) {
>>>> +        return ret ? H_PARAMETER : H_SUCCESS;
>>>> +    }
>>>>  #ifdef DEBUG_TCE
>>>>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>>>>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
>>>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>>>> index 5f89003..3375c3f 100644
>>>> --- a/hw/spapr_pci.c
>>>> +++ b/hw/spapr_pci.c
>>>> @@ -29,6 +29,7 @@
>>>>  #include "pci_host.h"
>>>>  #include "hw/spapr.h"
>>>>  #include "hw/spapr_pci.h"
>>>> +#include "hw/vfio_pci.h"
>>>>  #include "exec-memory.h"
>>>>  #include <libfdt.h>
>>>>  #include "trace.h"
>>>> @@ -440,6 +441,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>>>                   level);
>>>>  }
>>>>  
>>>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>>>> +{
>>>> +    sPAPRPHBState *phb = opaque;
>>>> +    return phb->lsi_table[irq_num].dt_irq;
>>>> +}
>>>> +
>>>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>>>                                unsigned size)
>>>>  {
>>>> @@ -567,7 +574,8 @@ static int spapr_phb_init(SysBusDevice *s)
>>>>  
>>>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>>>                             phb->busname ? phb->busname : phb->dtbusname,
>>>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>>>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>>>> +                           pci_spapr_map_irq, phb,
>>>>                             &phb->memspace, &phb->iospace,
>>>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>>>      phb->host_state.bus = bus;
>>>> @@ -596,6 +604,7 @@ static Property spapr_phb_properties[] = {
>>>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>>>> +    DEFINE_PROP_UINT8("vfio", sPAPRPHBState, is_vfio, 0),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -639,6 +648,18 @@ void spapr_create_phb(sPAPREnvironment *spapr,
>>>>  /* Finalize PCI setup, called when all devices are already created */
>>>>  int spapr_finalize_pci_setup(sPAPRPHBState *phb)
>>>>  {
>>>> +    if (phb->is_vfio) {
>>>> +        int fd = vfio_get_container_fd(phb->host_state.bus);
>>>> +
>>>> +        if (fd < 0) {
>>>> +            return fd;
>>>> +        }
>>>> +        spapr_vfio_init_dma(fd, phb->dma_liobn,
>>>> +                            &phb->dma_window_start,
>>>> +                            &phb->dma_window_size);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>>      phb->dma_window_start = 0;
>>>>      phb->dma_window_size = 0x40000000;
>>>>      phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>>>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>>>> index 3aae273..a4f031b 100644
>>>> --- a/hw/spapr_pci.h
>>>> +++ b/hw/spapr_pci.h
>>>> @@ -57,6 +57,8 @@ typedef struct sPAPRPHBState {
>>>>          int nvec;
>>>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>>>  
>>>> +    uint8_t is_vfio;
>>>> +
>>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>>  } sPAPRPHBState;
>>>>  
>>>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>>>> index 1ac287f..cc0b974 100644
>>>> --- a/hw/vfio_pci.c
>>>> +++ b/hw/vfio_pci.c
>>>> @@ -21,7 +21,6 @@
>>>>  #include <dirent.h>
>>>>  #include <stdio.h>
>>>>  #include <unistd.h>
>>>> -#include <sys/io.h>
>>>>  #include <sys/ioctl.h>
>>>>  #include <sys/mman.h>
>>>>  #include <sys/types.h>
>>>> @@ -44,6 +43,17 @@
>>>>  #include "vfio_pci.h"
>>>>  #include "linux-vfio.h"
>>>>  
>>>> +#ifndef TARGET_PPC64
>>>> +#include <sys/io.h>
>>>> +#define VFIO_IOMMU_EXTENSION    VFIO_X86_IOMMU
>>>> +#else
>>>> +#include "hw/pci_internals.h"
>>>> +#include "hw/xics.h"
>>>> +#include "hw/spapr.h"
>>>> +#define POWERPC_IOMMU           2
>>>> +#define VFIO_IOMMU_EXTENSION    POWERPC_IOMMU
>>>> +#endif
>>>> +
>>>
>>> VFIO_IOMMU_EXTENSION never gets used, POWER_IOMMU is redefined below.
>>
>>
>> Yes, a bit messy. Was not sure about the name so I postponed it.
>>
>>
>>>>  //#define DEBUG_VFIO
>>>>  #ifdef DEBUG_VFIO
>>>>  #define DPRINTF(fmt, ...) \
>>>> @@ -235,6 +245,7 @@ struct vfio_irq_set_fd {
>>>>  
>>>>  static void vfio_enable_intx_kvm(VFIODevice *vdev)
>>>>  {
>>>> +#ifndef TARGET_PPC64
>>>
>>> Why do you need this, aren't the extension checks sufficient for this to
>>> be a nop for you?
>>
>>
>> It uses ioapic_remove_gsi_eoi_notifier() so it needs some #ifdef anyway. And as we do not support
>> kvm_irqchip_in_kernel(), there is no point in fixing it and I disabled it all.
>> When we make eoi notifiers a platform independent, then yes, it will be nop.
> 
> Ah right, forgot you won't even build ioapic_*.
> 
>>>>  #ifdef CONFIG_KVM
>>>>      struct vfio_irq_set_fd irq_set_fd = {
>>>>  	.irq_set = {
>>>> @@ -298,10 +309,12 @@ fail:
>>>>      qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
>>>>      vfio_unmask_intx(vdev);
>>>>  #endif
>>>> +#endif
>>>>  }
>>>>  
>>>>  static void vfio_disable_intx_kvm(VFIODevice *vdev)
>>>>  {
>>>> +#ifndef TARGET_PPC64
>>>
>>> Same
>>
>> Same :)
>>
>>>
>>>>  #ifdef CONFIG_KVM
>>>>      struct vfio_irq_set_fd irq_set_fd = {
>>>>  	.irq_set = {
>>>> @@ -350,8 +363,10 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
>>>>      DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __FUNCTION__,
>>>>              vdev->host.seg, vdev->host.bus, vdev->host.dev, vdev->host.func);
>>>>  #endif
>>>> +#endif
>>>>  }
>>>>  
>>>> +#ifndef TARGET_PPC64
>>>>  static void vfio_update_irq(Notifier *notify, void *data)
>>>>  {
>>>>      VFIODevice *vdev = container_of(notify, VFIODevice, intx.update_irq);
>>>> @@ -381,6 +396,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
>>>>      /* Re-enable the interrupt in cased we missed an EOI */
>>>>      vfio_eoi(&vdev->intx.eoi, NULL);
>>>>  }
>>>> +#endif
>>>>  
>>>>  static int vfio_enable_intx(VFIODevice *vdev)
>>>>  {
>>>> @@ -404,10 +420,14 @@ static int vfio_enable_intx(VFIODevice *vdev)
>>>>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>>>>      vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
>>>>      vdev->intx.eoi.notify = vfio_eoi;
>>>> +#ifndef TARGET_PPC64
>>>>      ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
>>>
>>> This is really only a place holder for x86 too, I don't think my eoi
>>> notifier as written is acceptable upstream.  We really need some common
>>> infrastructure here.  I'm hoping to get the kvm acceleration in place
>>> which would make vfio usable on x86 with kvm (the common case), then
>>> work towards a generic eoi notifier.
>>>
>>>>  
>>>>      vdev->intx.update_irq.notify = vfio_update_irq;
>>>>      pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);
>>>
>>> Can't you stub this out to make it safe to do on POWER too?
>>
>>
>> I could even simply enable it (not sure if it is going to be called ever though but anyway) once we
>> get unified eoi notifiers.
> 
> Right
> 
>>>> +#else
>>>> +    xics_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
>>>> +#endif
>>>>  
>>>>      if (event_notifier_init(&vdev->intx.interrupt, 0)) {
>>>>          error_report("vfio: Error: event_notifier_init failed\n");
>>>> @@ -440,8 +460,12 @@ static void vfio_disable_intx(VFIODevice *vdev)
>>>>      vfio_disable_intx_kvm(vdev);
>>>>      vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
>>>>  
>>>> +#ifndef TARGET_PPC64
>>>>      pci_remove_irq_update_notifier(&vdev->intx.update_irq);
>>>>      ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
>>>> +#else
>>>> +    xics_remove_eoi_notifier(&vdev->intx.eoi);
>>>> +#endif
>>>>  
>>>>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>>>>      qemu_set_fd_handler(fd, NULL, NULL, vdev);
>>>> @@ -543,7 +567,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>>>>      }
>>>>  
>>>>      fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
>>>> -
>>>> +#ifndef TARGET_PPC64
>>>>      vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>>>>      if (vdev->msi_vectors[vector].virq < 0 || 
>>>>          kvm_irqchip_add_irqfd(kvm_state, fd,
>>>> @@ -551,7 +575,11 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>>>>          qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>>>                              &vdev->msi_vectors[vector]);
>>>>      }
>>>> -
>>>> +#else
>>>> +    vdev->msi_vectors[vector].virq = -1;
>>>> +    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>>> +                        &vdev->msi_vectors[vector]);
>>>> +#endif
>>>
>>> This shouldn't be necessary once the abort is removed from
>>> kvm_irqchip_add_msi_route.  It'll be merged next time the kvm uq tree
>>> merges into qemu.
>>
>>
>> True, I just did not pick up your very last changes. Updating is always painful, and now it is even
>> worse then usual as pci_get_irq has been renamed to something else :) Will do though.
> 
> Yep, I think once Michael is back from holiday and does a pull request
> (and hopefully merges Jan's PCIBus irq routing patches) my tree will be
> down to mostly just the vfio driver and I'll start managing it like the
> kernel tree with a patch series that gets rebased.  I'm hoping that if I
> can get an acceptable level irqfd/eoifd implementation for x86 kvm in
> the kernel that I can rip out the ioapic eoi notifiers and submit the
> code as functional only with kvm and work out the generic eoi notifiers
> in qemu proper.
> 
>>>>      if (vdev->nr_vectors < vector + 1) {
>>>>          int i;
>>>>  
>>>> @@ -692,6 +720,7 @@ retry:
>>>>          fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
>>>>  
>>>>          msg = msi_get_msg(&vdev->pdev, i);
>>>> +#ifndef TARGET_PPC64
>>>>          vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
>>>>          if (vdev->msi_vectors[i].virq < 0 || 
>>>>              kvm_irqchip_add_irqfd(kvm_state, fd,
>>>> @@ -699,6 +728,12 @@ retry:
>>>>              qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>>>                                  &vdev->msi_vectors[i]);
>>>>          }
>>>> +#else
>>>> +        vdev->msi_vectors[i].virq = -1;
>>>> +        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
>>>> +                            &vdev->msi_vectors[i]);
>>>> +        msg = msg;
>>>> +#endif
>>>
>>> Same here
>>>
>>>>      }
>>>>      
>>>>      ret = vfio_enable_vectors(vdev, false);
>>>> @@ -1581,6 +1616,25 @@ static int vfio_connect_container(VFIOGroup *group)
>>>>  
>>>>          memory_listener_register(&container->listener, get_system_memory());
>>>>  
>>>> +#define POWERPC_IOMMU           2
>>>
>>> Assume this will go in the kernel vfio.h at some point.  You may want to
>>> pick a different name if there's a possibility of other powerpc iommu
>>> implementations... thus the crappy type1 name for x86.
>>>
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
>>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set group container: %s\n",
>>>> +                         strerror(errno));
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -1;
>>>> +        }
>>>> +
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set iommu for container: %s\n",
>>>> +                         strerror(errno));
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -1;
>>>> +        }
>>>>      } else {
>>>>          error_report("vfio: No available IOMMU models\n");
>>>>          g_free(container);
>>>> @@ -2005,3 +2059,19 @@ static void register_vfio_pci_dev_type(void)
>>>>  }
>>>>  
>>>>  type_init(register_vfio_pci_dev_type)
>>>> +
>>>> +int vfio_get_container_fd(struct PCIBus *pbus)
>>>> +{
>>>> +    BusChild *kid1st = QTAILQ_FIRST(&pbus->qbus.children);
>>>> +    VFIODevice *vdev1st;
>>>> +
>>>> +    if (!kid1st) {
>>>> +        printf("No device registered on PCI bus \"%s\", no DMA enabled\n",
>>>> +               pbus->qbus.name);
>>>> +        return -1;
>>>> +    }
>>>> +    vdev1st = container_of(kid1st->child, VFIODevice, pdev.qdev);
>>>> +
>>>> +    return vdev1st->group->container->fd;
>>>> +}
>>>> +
>>>
>>> This is not a generic implementation.  x86 won't have all devices on a
>>> bus be vfio devices and even if it did, there's no guarantee they all
>>> belong to the same container.  This should probably at least take a
>>> PCIDevice and some kind of POWER specific code will need to know that
>>> the container is the same for the whole bus.  Thanks,
>>
>>
>> This is a workaround, true. x86 does not need this call at all. And on powerpc VFIO devices won't
>> share PCI bus with emulated devices. I just need some API to get this fd.
>>
>> Well I probably can add MemoryListener for the DMA window and move all power-specific map/unmap code
>> to VFIO but it does not look much better. I would rather prefer separating IOMMU code from vfio_pci
>> somehow (more or less as it is now for powerpc). While doing it, we could think of the API to get
>> this fd which we need anyway in order to setup the DMA window which is per group (which QEMU does
>> not understand) but not per device.
> 
> The MemoryListener probably doesn't make sense with a guest driven iova
> window.  It would be an abuse of the interface I think.  At some level
> in the power code you, or at least the user, needs to know about groups
> though.  That's how you end up with an emulated bridge in front of each
> group, right?


No, this is just for interrupts swizzling. We put one group to a separate PCI bus and we do not care
about bridges on this matter.


> So with that same knowledge, shouldn't the API simply be:
> 
> int vfio_get_container_fd(PCIDevice *dev)
> 
> where power code picks a device from the bus since you know they're all
> in the same group?  Thanks,

Yep but still workaround.
A, screw this API, I came up with something different, I post it a bit later :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2012-07-10 16:57 ` [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alex Williamson
@ 2012-07-12  8:52 ` Alexey Kardashevskiy
  2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
                     ` (2 more replies)
  2012-07-13  7:26 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3) Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-12  8:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf, qemu-ppc, David Gibson

It literally does the following:

1. POWERPC IOMMU support (the kernel counterpart is required)

2. The patch assumes that IOAPIC calls are going to be replaced
with something generic. I have something in my local git but it's
too early, we need to extend PCIINTxRoute first.

3. vfio_get_group() made public. I want to open IOMMU group from
the sPAPR code to have everything I need for VFIO on sPAPR and
avoid ugly workarounds with finilizing PHB setup on sPAPR.

4. Change sPAPR PHB to scan the PCI bus which is used for
the IOMMU-VFIO group. Now it is enough to add the following to
the QEMU command line to get VFIO up with all the devices from
IOMMU group with id=3:
-device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs |    3 ++
 hw/spapr.h           |    4 ++
 hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
 hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
 hw/spapr_pci.h       |    5 +++
 hw/vfio_pci.c        |   28 +++++++++++-
 hw/vfio_pci.h        |    2 +
 7 files changed, 237 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index f573a95..c46a049 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
 # Xilinx PPC peripherals
 obj-y += xilinx_ethlite.o
 
+# VFIO PCI device assignment
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
+
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/spapr.h b/hw/spapr.h
index b37f337..9dca704 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       DMAContext *dma);
 
+void spapr_vfio_init_dma(int fd, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 50c288d..0a194e8 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -16,6 +16,8 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
+#include <sys/ioctl.h>
+
 #include "hw.h"
 #include "kvm.h"
 #include "qdev.h"
@@ -23,6 +25,7 @@
 #include "dma.h"
 
 #include "hw/spapr.h"
+#include "hw/linux-vfio.h"
 
 #include <libfdt.h>
 
@@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
     return 0;
 }
 
+/* -------- API for POWERPC IOMMU -------- */
+
+#define POWERPC_IOMMU           2
+
+struct tce_iommu_info {
+    __u32 argsz;
+    __u32 dma32_window_start;
+    __u32 dma32_window_size;
+};
+
+#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+struct tce_iommu_dma_map {
+    __u32 argsz;
+    __u64 va;
+    __u64 dmaaddr;
+};
+
+#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+typedef struct sPAPRVFIOTable {
+    int fd;
+    uint32_t liobn;
+    QLIST_ENTRY(sPAPRVFIOTable) list;
+} sPAPRVFIOTable;
+
+QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
+
+void spapr_vfio_init_dma(int fd, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_info info = { .argsz = sizeof(info) };
+
+    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
+        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
+        return;
+    }
+    *dma32_window_start = info.dma32_window_start;
+    *dma32_window_size = info.dma32_window_size;
+
+    t = g_malloc0(sizeof(*t));
+    t->fd = fd;
+    t->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
+}
+
+static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_dma_map map = {
+        .argsz = sizeof(map),
+        .va = 0,
+        .dmaaddr = ioba,
+    };
+
+    QLIST_FOREACH(t, &vfio_tce_tables, list) {
+        if (t->liobn != liobn) {
+            continue;
+        }
+        if (tce) {
+            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
+            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
+                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
+                return H_PARAMETER;
+            }
+        } else {
+            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
+                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
+                return H_PARAMETER;
+            }
+        }
+        return H_SUCCESS;
+    }
+    return H_CONTINUE; /* positive non-zero value */
+}
+
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
 {
@@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     if (0 >= ret) {
         return ret ? H_PARAMETER : H_SUCCESS;
     }
+    ret = put_tce_vfio(liobn, ioba, tce);
+    if (0 >= ret) {
+        return ret ? H_PARAMETER : H_SUCCESS;
+    }
 #ifdef DEBUG_TCE
     fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
             "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 014297b..92c48b6 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -22,6 +22,9 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include <sys/types.h>
+#include <dirent.h>
+
 #include "hw.h"
 #include "pci.h"
 #include "msi.h"
@@ -29,10 +32,10 @@
 #include "pci_host.h"
 #include "hw/spapr.h"
 #include "hw/spapr_pci.h"
+#include "hw/vfio_pci.h"
 #include "exec-memory.h"
 #include <libfdt.h>
 #include "trace.h"
-
 #include "hw/pci_internals.h"
 
 /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
@@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
                  level);
 }
 
+static int pci_spapr_get_irq(void *opaque, int irq_num)
+{
+    sPAPRPHBState *phb = opaque;
+    return phb->lsi_table[irq_num].dt_irq;
+}
+
 static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
                               unsigned size)
 {
@@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
     return phb->dma;
 }
 
+static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
+{
+    char iommupath[256];
+    DIR *dirp;
+    struct dirent *entry;
+
+    phb->iommugroup = vfio_get_group(phb->iommugroupid);
+    if (!phb->iommugroup) {
+        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
+        return -1;
+    }
+
+    if (!phb->scan) {
+        printf("Autoscan disabled\n");
+        return 0;
+    }
+
+    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
+    dirp = opendir(iommupath);
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",
+                        &domainid, &busid, &devid, &fnid)) {
+            continue;
+        }
+
+        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);
+        printf("Reading device class from %s\n", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+        }
+        if (!deviceclass) {
+            continue;
+        }
+#define PCI_BASE_CLASS_BRIDGE           0x06
+        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
+            continue;
+        }
+        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
+            /* Tweak USB */
+            phb->force_addr = 1;
+            phb->enable_multifunction = 1;
+        }
+
+        printf("Creating device %X:%X:%X.%x class=0x%X\n",
+               domainid, busid, devid, fnid, deviceclass);
+
+        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
+        if (!dev) {
+            fprintf(stderr, "failed to create vfio-pci\n");
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name);
+        if (phb->force_addr) {
+            sprintf(addr, "%X.%X", devid, fnid);
+            qdev_prop_parse(dev, "addr", addr);
+        }
+        if (phb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        qdev_init_nofail(dev);
+    }
+    closedir(dirp);
+
+    return 0;
+}
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
@@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
 
     bus = pci_register_bus(&phb->host_state.busdev.qdev,
                            phb->busname ? phb->busname : phb->dtbusname,
-                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
+                           pci_spapr_set_irq, pci_spapr_get_irq,
+                           pci_spapr_map_irq, phb,
                            &phb->memspace, &phb->iospace,
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
 
     phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
-    phb->dma_window_start = 0;
-    phb->dma_window_size = 0x40000000;
-    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
@@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
         }
     }
 
+    if (phb->iommugroupid >= 0) {
+        if (0 > spapr_pci_scan_vfio(phb)) {
+            return -1;
+        }
+        if (!phb->iommugroup || !phb->iommugroup->container) {
+            return -1;
+        }
+        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,
+                            &phb->dma_window_start,
+                            &phb->dma_window_size);
+        return 0;
+    }
+
+    phb->dma_window_start = 0;
+    phb->dma_window_size = 0x40000000;
+    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
+                                         phb->dma_window_size);
+
     return 0;
 }
 
@@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
     DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
     DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
+    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 145071c..1953a74 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
         int nvec;
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
+    int32_t iommugroupid;
+    struct VFIOGroup *iommugroup;
+    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
+    uint8_t enable_multifunction, force_addr;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 1ac287f..73681fb 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -21,7 +21,6 @@
 #include <dirent.h>
 #include <stdio.h>
 #include <unistd.h>
-#include <sys/io.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/types.h>
@@ -43,6 +42,12 @@
 #include "range.h"
 #include "vfio_pci.h"
 #include "linux-vfio.h"
+#ifndef TARGET_PPC64
+#include <sys/io.h>
+#else
+#include "hw/pci_internals.h"
+#include "hw/spapr.h"
+#endif
 
 //#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
@@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
 
         memory_listener_register(&container->listener, get_system_memory());
 
+#define POWERPC_IOMMU           2
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
     } else {
         error_report("vfio: No available IOMMU models\n");
         g_free(container);
@@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+VFIOGroup *vfio_get_group(int groupid)
 {
     VFIOGroup *group;
     char path[32];
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 226607c..d63dd63 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -105,4 +105,6 @@ typedef struct VFIOGroup {
 #define VFIO_FLAG_IOMMU_SHARED_BIT 0
 #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
 
+VFIOGroup *vfio_get_group(int groupid);
+
 #endif /* __VFIO_H__ */
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
@ 2012-07-12 20:54   ` Blue Swirl
  2012-07-12 21:37     ` Alex Williamson
  2012-07-13  5:24     ` Alexey Kardashevskiy
  2012-07-12 22:35   ` Scott Wood
  2012-07-13  3:47   ` [Qemu-devel] " Alex Williamson
  2 siblings, 2 replies; 52+ messages in thread
From: Blue Swirl @ 2012-07-12 20:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel, David Gibson

On Thu, Jul 12, 2012 at 8:52 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> It literally does the following:
>
> 1. POWERPC IOMMU support (the kernel counterpart is required)
>
> 2. The patch assumes that IOAPIC calls are going to be replaced
> with something generic. I have something in my local git but it's
> too early, we need to extend PCIINTxRoute first.
>
> 3. vfio_get_group() made public. I want to open IOMMU group from
> the sPAPR code to have everything I need for VFIO on sPAPR and
> avoid ugly workarounds with finilizing PHB setup on sPAPR.
>
> 4. Change sPAPR PHB to scan the PCI bus which is used for
> the IOMMU-VFIO group. Now it is enough to add the following to
> the QEMU command line to get VFIO up with all the devices from
> IOMMU group with id=3:
> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs |    3 ++
>  hw/spapr.h           |    4 ++
>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
>  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
>  hw/spapr_pci.h       |    5 +++
>  hw/vfio_pci.c        |   28 +++++++++++-
>  hw/vfio_pci.h        |    2 +
>  7 files changed, 237 insertions(+), 7 deletions(-)
>
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..9dca704 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..0a194e8 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -16,6 +16,8 @@
>   * You should have received a copy of the GNU Lesser General Public
>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>   */
> +#include <sys/ioctl.h>
> +
>  #include "hw.h"
>  #include "kvm.h"
>  #include "qdev.h"
> @@ -23,6 +25,7 @@
>  #include "dma.h"
>
>  #include "hw/spapr.h"
> +#include "hw/linux-vfio.h"
>
>  #include <libfdt.h>
>
> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>
> +/* -------- API for POWERPC IOMMU -------- */
> +
> +#define POWERPC_IOMMU           2
> +
> +struct tce_iommu_info {

CamelCase.

> +    __u32 argsz;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;

Please use uint32_t.

> +};
> +
> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;

The structure may or may not be padded here since there's no
QEMU_PACKED attribute. If possible, just rearrange the fields.

> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
> +typedef struct sPAPRVFIOTable {
> +    int fd;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> +
> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
> +        return;
> +    }
> +    *dma32_window_start = info.dma32_window_start;
> +    *dma32_window_size = info.dma32_window_size;
> +
> +    t = g_malloc0(sizeof(*t));
> +    t->fd = fd;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);

perror()?

> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +
>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      if (0 >= ret) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (0 >= ret) {

This order in expressions is not common, please reverse.

> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
>  #ifdef DEBUG_TCE
>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 014297b..92c48b6 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -22,6 +22,9 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
>  #include "hw.h"
>  #include "pci.h"
>  #include "msi.h"
> @@ -29,10 +32,10 @@
>  #include "pci_host.h"
>  #include "hw/spapr.h"
>  #include "hw/spapr_pci.h"
> +#include "hw/vfio_pci.h"
>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> -
>  #include "hw/pci_internals.h"
>
>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> @@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>      return phb->dma;
>  }
>
> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> +{
> +    char iommupath[256];
> +    DIR *dirp;
> +    struct dirent *entry;
> +
> +    phb->iommugroup = vfio_get_group(phb->iommugroupid);
> +    if (!phb->iommugroup) {
> +        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
> +        return -1;
> +    }
> +
> +    if (!phb->scan) {
> +        printf("Autoscan disabled\n");
> +        return 0;
> +    }
> +
> +    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);

Please use snprintf() or g_strdup_printf().


> +    dirp = opendir(iommupath);
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",

Please put the constant last.

> +                        &domainid, &busid, &devid, &fnid)) {
> +            continue;
> +        }
> +
> +        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);

Again, snprintf() or g_strdup_printf() (which avoids the alloca() too).

> +        printf("Reading device class from %s\n", tmp);

Leftover debugging?

> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +        }
> +        if (!deviceclass) {
> +            continue;
> +        }
> +#define PCI_BASE_CLASS_BRIDGE           0x06

This belongs to pci_ids.h.

> +        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
> +            continue;
> +        }
> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> +            /* Tweak USB */
> +            phb->force_addr = 1;
> +            phb->enable_multifunction = 1;
> +        }
> +
> +        printf("Creating device %X:%X:%X.%x class=0x%X\n",
> +               domainid, busid, devid, fnid, deviceclass);

Lower case hex, please.

> +
> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            fprintf(stderr, "failed to create vfio-pci\n");
> +            continue;
> +        }
> +        qdev_prop_parse(dev, "host", entry->d_name);
> +        if (phb->force_addr) {
> +            sprintf(addr, "%X.%X", devid, fnid);

snprintf, lower case hex.

> +            qdev_prop_parse(dev, "addr", addr);
> +        }
> +        if (phb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        qdev_init_nofail(dev);
> +    }
> +    closedir(dirp);
> +
> +    return 0;
> +}
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> @@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
>
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
>
>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> -    phb->dma_window_start = 0;
> -    phb->dma_window_size = 0x40000000;
> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>
>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> @@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
>          }
>      }
>
> +    if (phb->iommugroupid >= 0) {
> +        if (0 > spapr_pci_scan_vfio(phb)) {

Order.

> +            return -1;
> +        }
> +        if (!phb->iommugroup || !phb->iommugroup->container) {
> +            return -1;
> +        }
> +        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,
> +                            &phb->dma_window_start,
> +                            &phb->dma_window_size);
> +        return 0;
> +    }
> +
> +    phb->dma_window_start = 0;
> +    phb->dma_window_size = 0x40000000;
> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> +                                         phb->dma_window_size);
> +
>      return 0;
>  }
>
> @@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 145071c..1953a74 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>
> +    int32_t iommugroupid;
> +    struct VFIOGroup *iommugroup;
> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> +    uint8_t enable_multifunction, force_addr;

Use bool for both?

> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1ac287f..73681fb 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -21,7 +21,6 @@
>  #include <dirent.h>
>  #include <stdio.h>
>  #include <unistd.h>
> -#include <sys/io.h>
>  #include <sys/ioctl.h>
>  #include <sys/mman.h>
>  #include <sys/types.h>
> @@ -43,6 +42,12 @@
>  #include "range.h"
>  #include "vfio_pci.h"
>  #include "linux-vfio.h"
> +#ifndef TARGET_PPC64
> +#include <sys/io.h>
> +#else
> +#include "hw/pci_internals.h"
> +#include "hw/spapr.h"
> +#endif
>
>  //#define DEBUG_VFIO
>  #ifdef DEBUG_VFIO
> @@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
>
>          memory_listener_register(&container->listener, get_system_memory());
>
> +#define POWERPC_IOMMU           2
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> @@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  }
>
> -static VFIOGroup *vfio_get_group(int groupid)
> +VFIOGroup *vfio_get_group(int groupid)
>  {
>      VFIOGroup *group;
>      char path[32];
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 226607c..d63dd63 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>
> +VFIOGroup *vfio_get_group(int groupid);
> +
>  #endif /* __VFIO_H__ */
> --
> 1.7.10
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
@ 2012-07-12 21:37     ` Alex Williamson
  2012-07-13  5:24     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-12 21:37 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

On Thu, 2012-07-12 at 20:54 +0000, Blue Swirl wrote:
> On Thu, Jul 12, 2012 at 8:52 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > It literally does the following:
> >
> > 1. POWERPC IOMMU support (the kernel counterpart is required)
> >
> > 2. The patch assumes that IOAPIC calls are going to be replaced
> > with something generic. I have something in my local git but it's
> > too early, we need to extend PCIINTxRoute first.
> >
> > 3. vfio_get_group() made public. I want to open IOMMU group from
> > the sPAPR code to have everything I need for VFIO on sPAPR and
> > avoid ugly workarounds with finilizing PHB setup on sPAPR.
> >
> > 4. Change sPAPR PHB to scan the PCI bus which is used for
> > the IOMMU-VFIO group. Now it is enough to add the following to
> > the QEMU command line to get VFIO up with all the devices from
> > IOMMU group with id=3:
> > -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> > mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> >
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> >  hw/ppc/Makefile.objs |    3 ++
> >  hw/spapr.h           |    4 ++
> >  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
> >  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  hw/spapr_pci.h       |    5 +++
> >  hw/vfio_pci.c        |   28 +++++++++++-
> >  hw/vfio_pci.h        |    2 +
> >  7 files changed, 237 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> > index f573a95..c46a049 100644
> > --- a/hw/ppc/Makefile.objs
> > +++ b/hw/ppc/Makefile.objs
> > @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
> >  # Xilinx PPC peripherals
> >  obj-y += xilinx_ethlite.o
> >
> > +# VFIO PCI device assignment
> > +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> > +
> >  obj-y := $(addprefix ../,$(obj-y))
> > diff --git a/hw/spapr.h b/hw/spapr.h
> > index b37f337..9dca704 100644
> > --- a/hw/spapr.h
> > +++ b/hw/spapr.h
> > @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >                        DMAContext *dma);
> >
> > +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> > +                         uint64_t *dma32_window_start,
> > +                         uint64_t *dma32_window_size);
> > +
> >  #endif /* !defined (__HW_SPAPR_H__) */
> > diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> > index 50c288d..0a194e8 100644
> > --- a/hw/spapr_iommu.c
> > +++ b/hw/spapr_iommu.c
> > @@ -16,6 +16,8 @@
> >   * You should have received a copy of the GNU Lesser General Public
> >   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> >   */
> > +#include <sys/ioctl.h>
> > +
> >  #include "hw.h"
> >  #include "kvm.h"
> >  #include "qdev.h"
> > @@ -23,6 +25,7 @@
> >  #include "dma.h"
> >
> >  #include "hw/spapr.h"
> > +#include "hw/linux-vfio.h"
> >
> >  #include <libfdt.h>
> >
> > @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
> >      return 0;
> >  }
> >
> > +/* -------- API for POWERPC IOMMU -------- */
> > +
> > +#define POWERPC_IOMMU           2
> > +
> > +struct tce_iommu_info {
> 
> CamelCase.
> 
> > +    __u32 argsz;
> > +    __u32 dma32_window_start;
> > +    __u32 dma32_window_size;
> 
> Please use uint32_t.

These should eventually be included from a kernel header file.  I assume
that's the reason for the non-qemu-isms.

> > +};
> > +
> > +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> > +
> > +struct tce_iommu_dma_map {
> > +    __u32 argsz;
> 
> The structure may or may not be padded here since there's no
> QEMU_PACKED attribute. If possible, just rearrange the fields.

I'm hoping Alexey adds a __u32 flags here, which solves that problem as
well.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
  2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
@ 2012-07-12 22:35   ` Scott Wood
  2012-07-13  5:31     ` Alexey Kardashevskiy
  2012-07-13  3:47   ` [Qemu-devel] " Alex Williamson
  2 siblings, 1 reply; 52+ messages in thread
From: Scott Wood @ 2012-07-12 22:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel, David Gibson

On 07/12/2012 03:52 AM, Alexey Kardashevskiy wrote:
> +/* -------- API for POWERPC IOMMU -------- */
> +
> +#define POWERPC_IOMMU           2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +};
> +
> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)

I thought you were going to change the name to be less generic...

-Scott

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
  2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
  2012-07-12 22:35   ` Scott Wood
@ 2012-07-13  3:47   ` Alex Williamson
  2012-07-13  5:03     ` Alexey Kardashevskiy
  2 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-13  3:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Thu, 2012-07-12 at 18:52 +1000, Alexey Kardashevskiy wrote:
> It literally does the following:
> 
> 1. POWERPC IOMMU support (the kernel counterpart is required)
> 
> 2. The patch assumes that IOAPIC calls are going to be replaced
> with something generic. I have something in my local git but it's
> too early, we need to extend PCIINTxRoute first.
> 
> 3. vfio_get_group() made public. I want to open IOMMU group from
> the sPAPR code to have everything I need for VFIO on sPAPR and
> avoid ugly workarounds with finilizing PHB setup on sPAPR.
> 
> 4. Change sPAPR PHB to scan the PCI bus which is used for
> the IOMMU-VFIO group. Now it is enough to add the following to
> the QEMU command line to get VFIO up with all the devices from
> IOMMU group with id=3:
> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs |    3 ++
>  hw/spapr.h           |    4 ++
>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
>  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
>  hw/spapr_pci.h       |    5 +++
>  hw/vfio_pci.c        |   28 +++++++++++-
>  hw/vfio_pci.h        |    2 +
>  7 files changed, 237 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>  
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..9dca704 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>  
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..0a194e8 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -16,6 +16,8 @@
>   * You should have received a copy of the GNU Lesser General Public
>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>   */
> +#include <sys/ioctl.h>
> +
>  #include "hw.h"
>  #include "kvm.h"
>  #include "qdev.h"
> @@ -23,6 +25,7 @@
>  #include "dma.h"
>  
>  #include "hw/spapr.h"
> +#include "hw/linux-vfio.h"
>  
>  #include <libfdt.h>
>  
> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>  
> +/* -------- API for POWERPC IOMMU -------- */
> +
> +#define POWERPC_IOMMU           2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +};
> +
> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
> +typedef struct sPAPRVFIOTable {
> +    int fd;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> +
> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
> +        return;
> +    }
> +    *dma32_window_start = info.dma32_window_start;
> +    *dma32_window_size = info.dma32_window_size;
> +
> +    t = g_malloc0(sizeof(*t));
> +    t->fd = fd;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +
>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      if (0 >= ret) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (0 >= ret) {
> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
>  #ifdef DEBUG_TCE
>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 014297b..92c48b6 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -22,6 +22,9 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
>  #include "hw.h"
>  #include "pci.h"
>  #include "msi.h"
> @@ -29,10 +32,10 @@
>  #include "pci_host.h"
>  #include "hw/spapr.h"
>  #include "hw/spapr_pci.h"
> +#include "hw/vfio_pci.h"

:-\

>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> -
>  #include "hw/pci_internals.h"
>  
>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> @@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>  
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>      return phb->dma;
>  }
>  
> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> +{
> +    char iommupath[256];
> +    DIR *dirp;
> +    struct dirent *entry;
> +
> +    phb->iommugroup = vfio_get_group(phb->iommugroupid);
> +    if (!phb->iommugroup) {
> +        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
> +        return -1;
> +    }
> +
> +    if (!phb->scan) {
> +        printf("Autoscan disabled\n");
> +        return 0;
> +    }
> +
> +    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> +    dirp = opendir(iommupath);
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",
> +                        &domainid, &busid, &devid, &fnid)) {
> +            continue;
> +        }
> +
> +        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);
> +        printf("Reading device class from %s\n", tmp);
> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +        }
> +        if (!deviceclass) {
> +            continue;
> +        }
> +#define PCI_BASE_CLASS_BRIDGE           0x06
> +        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
> +            continue;
> +        }
> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> +            /* Tweak USB */
> +            phb->force_addr = 1;
> +            phb->enable_multifunction = 1;
> +        }
> +
> +        printf("Creating device %X:%X:%X.%x class=0x%X\n",
> +               domainid, busid, devid, fnid, deviceclass);
> +
> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            fprintf(stderr, "failed to create vfio-pci\n");
> +            continue;
> +        }
> +        qdev_prop_parse(dev, "host", entry->d_name);
> +        if (phb->force_addr) {
> +            sprintf(addr, "%X.%X", devid, fnid);
> +            qdev_prop_parse(dev, "addr", addr);
> +        }
> +        if (phb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        qdev_init_nofail(dev);
> +    }
> +    closedir(dirp);
> +
> +    return 0;
> +}
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> @@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
>  
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
>  
>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> -    phb->dma_window_start = 0;
> -    phb->dma_window_size = 0x40000000;
> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>  
>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> @@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
>          }
>      }
>  
> +    if (phb->iommugroupid >= 0) {
> +        if (0 > spapr_pci_scan_vfio(phb)) {
> +            return -1;
> +        }
> +        if (!phb->iommugroup || !phb->iommugroup->container) {
> +            return -1;
> +        }
> +        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,

I'm not really a fan of this approach, the vfio data structures are not
designed for external use.  Perhaps they should be pulled into
vfio_pci.c.

Could we instead just make a group service function, something like:

int vfio_group_iommu_ioctl(int iommu_group, int request, ...)

Then it could just be a passthrough and you don't have to keep an fd or
de-reference vfio structures.  There's a little overhead that we have to
lookup the group each time, but this is your slow method anyway and how
many groups are going to be attached to a single guest.

> +                            &phb->dma_window_start,
> +                            &phb->dma_window_size);
> +        return 0;
> +    }
> +
> +    phb->dma_window_start = 0;
> +    phb->dma_window_size = 0x40000000;
> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> +                                         phb->dma_window_size);
> +
>      return 0;
>  }
>  
> @@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 145071c..1953a74 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>  
> +    int32_t iommugroupid;
> +    struct VFIOGroup *iommugroup;
> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> +    uint8_t enable_multifunction, force_addr;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>  
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1ac287f..73681fb 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -21,7 +21,6 @@
>  #include <dirent.h>
>  #include <stdio.h>
>  #include <unistd.h>
> -#include <sys/io.h>
>  #include <sys/ioctl.h>
>  #include <sys/mman.h>
>  #include <sys/types.h>
> @@ -43,6 +42,12 @@
>  #include "range.h"
>  #include "vfio_pci.h"
>  #include "linux-vfio.h"
> +#ifndef TARGET_PPC64
> +#include <sys/io.h>
> +#else
> +#include "hw/pci_internals.h"
> +#include "hw/spapr.h"
> +#endif
>  
>  //#define DEBUG_VFIO
>  #ifdef DEBUG_VFIO
> @@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
>  
>          memory_listener_register(&container->listener, get_system_memory());
>  
> +#define POWERPC_IOMMU           2
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> @@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  }
>  
> -static VFIOGroup *vfio_get_group(int groupid)
> +VFIOGroup *vfio_get_group(int groupid)
>  {
>      VFIOGroup *group;
>      char path[32];
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 226607c..d63dd63 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>  
> +VFIOGroup *vfio_get_group(int groupid);
> +
>  #endif /* __VFIO_H__ */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-13  3:47   ` [Qemu-devel] " Alex Williamson
@ 2012-07-13  5:03     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-13  5:03 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 13/07/12 13:47, Alex Williamson wrote:
> On Thu, 2012-07-12 at 18:52 +1000, Alexey Kardashevskiy wrote:
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>
>> 2. The patch assumes that IOAPIC calls are going to be replaced
>> with something generic. I have something in my local git but it's
>> too early, we need to extend PCIINTxRoute first.
>>
>> 3. vfio_get_group() made public. I want to open IOMMU group from
>> the sPAPR code to have everything I need for VFIO on sPAPR and
>> avoid ugly workarounds with finilizing PHB setup on sPAPR.
>>
>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>> the IOMMU-VFIO group. Now it is enough to add the following to
>> the QEMU command line to get VFIO up with all the devices from
>> IOMMU group with id=3:
>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/Makefile.objs |    3 ++
>>  hw/spapr.h           |    4 ++
>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
>>  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
>>  hw/spapr_pci.h       |    5 +++
>>  hw/vfio_pci.c        |   28 +++++++++++-
>>  hw/vfio_pci.h        |    2 +
>>  7 files changed, 237 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index f573a95..c46a049 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>  # Xilinx PPC peripherals
>>  obj-y += xilinx_ethlite.o
>>  
>> +# VFIO PCI device assignment
>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>> +
>>  obj-y := $(addprefix ../,$(obj-y))
>> diff --git a/hw/spapr.h b/hw/spapr.h
>> index b37f337..9dca704 100644
>> --- a/hw/spapr.h
>> +++ b/hw/spapr.h
>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        DMAContext *dma);
>>  
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size);
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>> index 50c288d..0a194e8 100644
>> --- a/hw/spapr_iommu.c
>> +++ b/hw/spapr_iommu.c
>> @@ -16,6 +16,8 @@
>>   * You should have received a copy of the GNU Lesser General Public
>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>   */
>> +#include <sys/ioctl.h>
>> +
>>  #include "hw.h"
>>  #include "kvm.h"
>>  #include "qdev.h"
>> @@ -23,6 +25,7 @@
>>  #include "dma.h"
>>  
>>  #include "hw/spapr.h"
>> +#include "hw/linux-vfio.h"
>>  
>>  #include <libfdt.h>
>>  
>> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>      return 0;
>>  }
>>  
>> +/* -------- API for POWERPC IOMMU -------- */
>> +
>> +#define POWERPC_IOMMU           2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +};
>> +
>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>> +
>> +typedef struct sPAPRVFIOTable {
>> +    int fd;
>> +    uint32_t liobn;
>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>> +} sPAPRVFIOTable;
>> +
>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>> +
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>> +
>> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
>> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
>> +        return;
>> +    }
>> +    *dma32_window_start = info.dma32_window_start;
>> +    *dma32_window_size = info.dma32_window_size;
>> +
>> +    t = g_malloc0(sizeof(*t));
>> +    t->fd = fd;
>> +    t->liobn = liobn;
>> +
>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>> +}
>> +
>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .va = 0,
>> +        .dmaaddr = ioba,
>> +    };
>> +
>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>> +        if (t->liobn != liobn) {
>> +            continue;
>> +        }
>> +        if (tce) {
>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
>> +                return H_PARAMETER;
>> +            }
>> +        } else {
>> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
>> +                return H_PARAMETER;
>> +            }
>> +        }
>> +        return H_SUCCESS;
>> +    }
>> +    return H_CONTINUE; /* positive non-zero value */
>> +}
>> +
>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>                                target_ulong opcode, target_ulong *args)
>>  {
>> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>      if (0 >= ret) {
>>          return ret ? H_PARAMETER : H_SUCCESS;
>>      }
>> +    ret = put_tce_vfio(liobn, ioba, tce);
>> +    if (0 >= ret) {
>> +        return ret ? H_PARAMETER : H_SUCCESS;
>> +    }
>>  #ifdef DEBUG_TCE
>>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 014297b..92c48b6 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -22,6 +22,9 @@
>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>   * THE SOFTWARE.
>>   */
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +
>>  #include "hw.h"
>>  #include "pci.h"
>>  #include "msi.h"
>> @@ -29,10 +32,10 @@
>>  #include "pci_host.h"
>>  #include "hw/spapr.h"
>>  #include "hw/spapr_pci.h"
>> +#include "hw/vfio_pci.h"
> 
> :-\
> 
>>  #include "exec-memory.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>> -
>>  #include "hw/pci_internals.h"
>>  
>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>> @@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>                   level);
>>  }
>>  
>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>> +{
>> +    sPAPRPHBState *phb = opaque;
>> +    return phb->lsi_table[irq_num].dt_irq;
>> +}
>> +
>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>                                unsigned size)
>>  {
>> @@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>      return phb->dma;
>>  }
>>  
>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>> +{
>> +    char iommupath[256];
>> +    DIR *dirp;
>> +    struct dirent *entry;
>> +
>> +    phb->iommugroup = vfio_get_group(phb->iommugroupid);
>> +    if (!phb->iommugroup) {
>> +        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
>> +        return -1;
>> +    }
>> +
>> +    if (!phb->scan) {
>> +        printf("Autoscan disabled\n");
>> +        return 0;
>> +    }
>> +
>> +    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
>> +    dirp = opendir(iommupath);
>> +
>> +    while ((entry = readdir(dirp)) != NULL) {
>> +        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
>> +        FILE *deviceclassfile;
>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>> +        char addr[32];
>> +        DeviceState *dev;
>> +
>> +        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",
>> +                        &domainid, &busid, &devid, &fnid)) {
>> +            continue;
>> +        }
>> +
>> +        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);
>> +        printf("Reading device class from %s\n", tmp);
>> +
>> +        deviceclassfile = fopen(tmp, "r");
>> +        if (deviceclassfile) {
>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>> +            fclose(deviceclassfile);
>> +        }
>> +        if (!deviceclass) {
>> +            continue;
>> +        }
>> +#define PCI_BASE_CLASS_BRIDGE           0x06
>> +        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
>> +            continue;
>> +        }
>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>> +            /* Tweak USB */
>> +            phb->force_addr = 1;
>> +            phb->enable_multifunction = 1;
>> +        }
>> +
>> +        printf("Creating device %X:%X:%X.%x class=0x%X\n",
>> +               domainid, busid, devid, fnid, deviceclass);
>> +
>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>> +        if (!dev) {
>> +            fprintf(stderr, "failed to create vfio-pci\n");
>> +            continue;
>> +        }
>> +        qdev_prop_parse(dev, "host", entry->d_name);
>> +        if (phb->force_addr) {
>> +            sprintf(addr, "%X.%X", devid, fnid);
>> +            qdev_prop_parse(dev, "addr", addr);
>> +        }
>> +        if (phb->enable_multifunction) {
>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>> +        }
>> +        qdev_init_nofail(dev);
>> +    }
>> +    closedir(dirp);
>> +
>> +    return 0;
>> +}
>> +
>>  static int spapr_phb_init(SysBusDevice *s)
>>  {
>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>> @@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>  
>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>                             phb->busname ? phb->busname : phb->dtbusname,
>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>> +                           pci_spapr_map_irq, phb,
>>                             &phb->memspace, &phb->iospace,
>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>      phb->host_state.bus = bus;
>>  
>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>> -    phb->dma_window_start = 0;
>> -    phb->dma_window_size = 0x40000000;
>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>  
>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>> @@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
>>          }
>>      }
>>  
>> +    if (phb->iommugroupid >= 0) {
>> +        if (0 > spapr_pci_scan_vfio(phb)) {
>> +            return -1;
>> +        }
>> +        if (!phb->iommugroup || !phb->iommugroup->container) {
>> +            return -1;
>> +        }
>> +        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,
> 
> I'm not really a fan of this approach, the vfio data structures are not
> designed for external use.  Perhaps they should be pulled into
> vfio_pci.c.

Right. I do not really want to know what is inside.


> Could we instead just make a group service function, something like:
> 
> int vfio_group_iommu_ioctl(int iommu_group, int request, ...)
> 
> Then it could just be a passthrough and you don't have to keep an fd or
> de-reference vfio structures.  There's a little overhead that we have to
> lookup the group each time, but this is your slow method anyway and how
> many groups are going to be attached to a single guest.


This solves problem of vfio_pci.h inclusion but does not solve the problem what gets created first -
VFIO PCI device or IOMMU group, and I also need to know IOMMU group id to use the API you proposed.

We could implement an QEMU IOMMU device which would take an IOMMU group id and do what
vfio_get_group() does.

The main idea is that on powerpc we know in advance which device belongs which group and we are
going to put the whole group (may be except bridges) to QEMU so it is more logical to start from
creating a IOMMU group rather than VFIO PCI device.


Lookup is not the issue here as we can make a fd array where IOMMU group id is an index as they
start from 0 and go consequently, do not expect too many of them :) Or we could save this fd
somewhere in the IOMMU device and get it from there somehow (there is no ready API to read
properties though).




>> +                            &phb->dma_window_start,
>> +                            &phb->dma_window_size);
>> +        return 0;
>> +    }
>> +
>> +    phb->dma_window_start = 0;
>> +    phb->dma_window_size = 0x40000000;
>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>> +                                         phb->dma_window_size);
>> +
>>      return 0;
>>  }
>>  
>> @@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>> index 145071c..1953a74 100644
>> --- a/hw/spapr_pci.h
>> +++ b/hw/spapr_pci.h
>> @@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
>>          int nvec;
>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>  
>> +    int32_t iommugroupid;
>> +    struct VFIOGroup *iommugroup;
>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>> +    uint8_t enable_multifunction, force_addr;
>> +
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  } sPAPRPHBState;
>>  
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 1ac287f..73681fb 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -21,7 +21,6 @@
>>  #include <dirent.h>
>>  #include <stdio.h>
>>  #include <unistd.h>
>> -#include <sys/io.h>
>>  #include <sys/ioctl.h>
>>  #include <sys/mman.h>
>>  #include <sys/types.h>
>> @@ -43,6 +42,12 @@
>>  #include "range.h"
>>  #include "vfio_pci.h"
>>  #include "linux-vfio.h"
>> +#ifndef TARGET_PPC64
>> +#include <sys/io.h>
>> +#else
>> +#include "hw/pci_internals.h"
>> +#include "hw/spapr.h"
>> +#endif
>>  
>>  //#define DEBUG_VFIO
>>  #ifdef DEBUG_VFIO
>> @@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
>>  
>>          memory_listener_register(&container->listener, get_system_memory());
>>  
>> +#define POWERPC_IOMMU           2
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>>      } else {
>>          error_report("vfio: No available IOMMU models\n");
>>          g_free(container);
>> @@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>      }
>>  }
>>  
>> -static VFIOGroup *vfio_get_group(int groupid)
>> +VFIOGroup *vfio_get_group(int groupid)
>>  {
>>      VFIOGroup *group;
>>      char path[32];
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 226607c..d63dd63 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>>  
>> +VFIOGroup *vfio_get_group(int groupid);
>> +
>>  #endif /* __VFIO_H__ */
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
  2012-07-12 21:37     ` Alex Williamson
@ 2012-07-13  5:24     ` Alexey Kardashevskiy
  2012-07-13 14:33       ` Blue Swirl
  1 sibling, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-13  5:24 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Alex Williamson, qemu-ppc, qemu-devel, David Gibson

Two comments below.

On 13/07/12 06:54, Blue Swirl wrote:
> On Thu, Jul 12, 2012 at 8:52 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>
>> 2. The patch assumes that IOAPIC calls are going to be replaced
>> with something generic. I have something in my local git but it's
>> too early, we need to extend PCIINTxRoute first.
>>
>> 3. vfio_get_group() made public. I want to open IOMMU group from
>> the sPAPR code to have everything I need for VFIO on sPAPR and
>> avoid ugly workarounds with finilizing PHB setup on sPAPR.
>>
>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>> the IOMMU-VFIO group. Now it is enough to add the following to
>> the QEMU command line to get VFIO up with all the devices from
>> IOMMU group with id=3:
>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/Makefile.objs |    3 ++
>>  hw/spapr.h           |    4 ++
>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
>>  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
>>  hw/spapr_pci.h       |    5 +++
>>  hw/vfio_pci.c        |   28 +++++++++++-
>>  hw/vfio_pci.h        |    2 +
>>  7 files changed, 237 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index f573a95..c46a049 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>  # Xilinx PPC peripherals
>>  obj-y += xilinx_ethlite.o
>>
>> +# VFIO PCI device assignment
>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>> +
>>  obj-y := $(addprefix ../,$(obj-y))
>> diff --git a/hw/spapr.h b/hw/spapr.h
>> index b37f337..9dca704 100644
>> --- a/hw/spapr.h
>> +++ b/hw/spapr.h
>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        DMAContext *dma);
>>
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size);
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>> index 50c288d..0a194e8 100644
>> --- a/hw/spapr_iommu.c
>> +++ b/hw/spapr_iommu.c
>> @@ -16,6 +16,8 @@
>>   * You should have received a copy of the GNU Lesser General Public
>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>   */
>> +#include <sys/ioctl.h>
>> +
>>  #include "hw.h"
>>  #include "kvm.h"
>>  #include "qdev.h"
>> @@ -23,6 +25,7 @@
>>  #include "dma.h"
>>
>>  #include "hw/spapr.h"
>> +#include "hw/linux-vfio.h"
>>
>>  #include <libfdt.h>
>>
>> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>      return 0;
>>  }
>>
>> +/* -------- API for POWERPC IOMMU -------- */
>> +
>> +#define POWERPC_IOMMU           2
>> +
>> +struct tce_iommu_info {
> 
> CamelCase.
> 
>> +    __u32 argsz;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
> 
> Please use uint32_t.
> 
>> +};
>> +
>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
> 
> The structure may or may not be padded here since there's no
> QEMU_PACKED attribute. If possible, just rearrange the fields.
> 
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>> +
>> +typedef struct sPAPRVFIOTable {
>> +    int fd;
>> +    uint32_t liobn;
>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>> +} sPAPRVFIOTable;
>> +
>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>> +
>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>> +
>> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
>> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
>> +        return;
>> +    }
>> +    *dma32_window_start = info.dma32_window_start;
>> +    *dma32_window_size = info.dma32_window_size;
>> +
>> +    t = g_malloc0(sizeof(*t));
>> +    t->fd = fd;
>> +    t->liobn = liobn;
>> +
>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>> +}
>> +
>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .va = 0,
>> +        .dmaaddr = ioba,
>> +    };
>> +
>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>> +        if (t->liobn != liobn) {
>> +            continue;
>> +        }
>> +        if (tce) {
>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
> 
> perror()?
> 
>> +                return H_PARAMETER;
>> +            }
>> +        } else {
>> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
>> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
>> +                return H_PARAMETER;
>> +            }
>> +        }
>> +        return H_SUCCESS;
>> +    }
>> +    return H_CONTINUE; /* positive non-zero value */
>> +}
>> +
>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>                                target_ulong opcode, target_ulong *args)
>>  {
>> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>      if (0 >= ret) {
>>          return ret ? H_PARAMETER : H_SUCCESS;
>>      }
>> +    ret = put_tce_vfio(liobn, ioba, tce);
>> +    if (0 >= ret) {
> 
> This order in expressions is not common, please reverse.
> 
>> +        return ret ? H_PARAMETER : H_SUCCESS;
>> +    }
>>  #ifdef DEBUG_TCE
>>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 014297b..92c48b6 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -22,6 +22,9 @@
>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>   * THE SOFTWARE.
>>   */
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +
>>  #include "hw.h"
>>  #include "pci.h"
>>  #include "msi.h"
>> @@ -29,10 +32,10 @@
>>  #include "pci_host.h"
>>  #include "hw/spapr.h"
>>  #include "hw/spapr_pci.h"
>> +#include "hw/vfio_pci.h"
>>  #include "exec-memory.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>> -
>>  #include "hw/pci_internals.h"
>>
>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>> @@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>                   level);
>>  }
>>
>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>> +{
>> +    sPAPRPHBState *phb = opaque;
>> +    return phb->lsi_table[irq_num].dt_irq;
>> +}
>> +
>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>                                unsigned size)
>>  {
>> @@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>      return phb->dma;
>>  }
>>
>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>> +{
>> +    char iommupath[256];
>> +    DIR *dirp;
>> +    struct dirent *entry;
>> +
>> +    phb->iommugroup = vfio_get_group(phb->iommugroupid);
>> +    if (!phb->iommugroup) {
>> +        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
>> +        return -1;
>> +    }
>> +
>> +    if (!phb->scan) {
>> +        printf("Autoscan disabled\n");
>> +        return 0;
>> +    }
>> +
>> +    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> 
> Please use snprintf() or g_strdup_printf().
> 
> 
>> +    dirp = opendir(iommupath);
>> +
>> +    while ((entry = readdir(dirp)) != NULL) {
>> +        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
>> +        FILE *deviceclassfile;
>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>> +        char addr[32];
>> +        DeviceState *dev;
>> +
>> +        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",
> 
> Please put the constant last.
> 
>> +                        &domainid, &busid, &devid, &fnid)) {
>> +            continue;
>> +        }
>> +
>> +        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);
> 
> Again, snprintf() or g_strdup_printf() (which avoids the alloca() too).
> 
>> +        printf("Reading device class from %s\n", tmp);
> 
> Leftover debugging?
> 
>> +
>> +        deviceclassfile = fopen(tmp, "r");
>> +        if (deviceclassfile) {
>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>> +            fclose(deviceclassfile);
>> +        }
>> +        if (!deviceclass) {
>> +            continue;
>> +        }
>> +#define PCI_BASE_CLASS_BRIDGE           0x06
> 
> This belongs to pci_ids.h.


It should but it is not. It is way smaller than the same one in the kernel.



>> +        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
>> +            continue;
>> +        }
>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>> +            /* Tweak USB */
>> +            phb->force_addr = 1;
>> +            phb->enable_multifunction = 1;
>> +        }
>> +
>> +        printf("Creating device %X:%X:%X.%x class=0x%X\n",
>> +               domainid, busid, devid, fnid, deviceclass);
> 
> Lower case hex, please.
> 
>> +
>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>> +        if (!dev) {
>> +            fprintf(stderr, "failed to create vfio-pci\n");
>> +            continue;
>> +        }
>> +        qdev_prop_parse(dev, "host", entry->d_name);
>> +        if (phb->force_addr) {
>> +            sprintf(addr, "%X.%X", devid, fnid);
> 
> snprintf, lower case hex.
> 
>> +            qdev_prop_parse(dev, "addr", addr);
>> +        }
>> +        if (phb->enable_multifunction) {
>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>> +        }
>> +        qdev_init_nofail(dev);
>> +    }
>> +    closedir(dirp);
>> +
>> +    return 0;
>> +}
>> +
>>  static int spapr_phb_init(SysBusDevice *s)
>>  {
>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>> @@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>
>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>                             phb->busname ? phb->busname : phb->dtbusname,
>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>> +                           pci_spapr_map_irq, phb,
>>                             &phb->memspace, &phb->iospace,
>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>      phb->host_state.bus = bus;
>>
>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>> -    phb->dma_window_start = 0;
>> -    phb->dma_window_size = 0x40000000;
>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>
>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>> @@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
>>          }
>>      }
>>
>> +    if (phb->iommugroupid >= 0) {
>> +        if (0 > spapr_pci_scan_vfio(phb)) {
> 
> Order.
> 
>> +            return -1;
>> +        }
>> +        if (!phb->iommugroup || !phb->iommugroup->container) {
>> +            return -1;
>> +        }
>> +        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,
>> +                            &phb->dma_window_start,
>> +                            &phb->dma_window_size);
>> +        return 0;
>> +    }
>> +
>> +    phb->dma_window_start = 0;
>> +    phb->dma_window_size = 0x40000000;
>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>> +                                         phb->dma_window_size);
>> +
>>      return 0;
>>  }
>>
>> @@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>
>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>> index 145071c..1953a74 100644
>> --- a/hw/spapr_pci.h
>> +++ b/hw/spapr_pci.h
>> @@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
>>          int nvec;
>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>
>> +    int32_t iommugroupid;
>> +    struct VFIOGroup *iommugroup;
>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>> +    uint8_t enable_multifunction, force_addr;
> 
> Use bool for both?



There is no DEFINE_PROP_BOOL. There is DEFINE_PROP_BIT but it works with bits, not bools.
uint8_t is the closest one.



>> +
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  } sPAPRPHBState;
>>
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 1ac287f..73681fb 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -21,7 +21,6 @@
>>  #include <dirent.h>
>>  #include <stdio.h>
>>  #include <unistd.h>
>> -#include <sys/io.h>
>>  #include <sys/ioctl.h>
>>  #include <sys/mman.h>
>>  #include <sys/types.h>
>> @@ -43,6 +42,12 @@
>>  #include "range.h"
>>  #include "vfio_pci.h"
>>  #include "linux-vfio.h"
>> +#ifndef TARGET_PPC64
>> +#include <sys/io.h>
>> +#else
>> +#include "hw/pci_internals.h"
>> +#include "hw/spapr.h"
>> +#endif
>>
>>  //#define DEBUG_VFIO
>>  #ifdef DEBUG_VFIO
>> @@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
>>
>>          memory_listener_register(&container->listener, get_system_memory());
>>
>> +#define POWERPC_IOMMU           2
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>>      } else {
>>          error_report("vfio: No available IOMMU models\n");
>>          g_free(container);
>> @@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>      }
>>  }
>>
>> -static VFIOGroup *vfio_get_group(int groupid)
>> +VFIOGroup *vfio_get_group(int groupid)
>>  {
>>      VFIOGroup *group;
>>      char path[32];
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 226607c..d63dd63 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>>
>> +VFIOGroup *vfio_get_group(int groupid);
>> +
>>  #endif /* __VFIO_H__ */
>> --
>> 1.7.10
>>
>>


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-12 22:35   ` Scott Wood
@ 2012-07-13  5:31     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-13  5:31 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alex Williamson, qemu-ppc, qemu-devel, David Gibson

On 13/07/12 08:35, Scott Wood wrote:
> On 07/12/2012 03:52 AM, Alexey Kardashevskiy wrote:
>> +/* -------- API for POWERPC IOMMU -------- */
>> +
>> +#define POWERPC_IOMMU           2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +};
>> +
>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> 
> I thought you were going to change the name to be less generic...


I will change them indeed, I am just focused now on other things such as EOI handlers and order of
devices creation. Next iteration will be fixed.



-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
@ 2012-07-13  7:26 ` Alexey Kardashevskiy
  2012-07-13 14:38   ` Blue Swirl
  2012-07-13 15:07   ` Alex Williamson
  2012-07-18 11:09 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4) Alexey Kardashevskiy
  2012-07-19  4:04 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v5) Alexey Kardashevskiy
  6 siblings, 2 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-13  7:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Alexander Graf, qemu-devel, Blue Swirl,
	qemu-ppc, David Gibson

It literally does the following:

1. POWERPC IOMMU support (the kernel counterpart is required)

2. The patch assumes that IOAPIC calls are going to be replaced
with something generic.

3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
handler to call VFIO IOMMU driver.

4. Change sPAPR PHB to scan the PCI bus which is used for
the IOMMU-VFIO group. Now it is enough to add the following to
the QEMU command line to get VFIO up with all the devices from
IOMMU group with id=3:
-device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs  |    3 ++
 hw/spapr.h            |    4 ++
 hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
 hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
 hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
 hw/spapr_pci.h        |    4 ++
 hw/vfio_pci.c         |   30 ++++++++++++++
 hw/vfio_pci.h         |    2 +
 trace-events          |    1 +
 9 files changed, 264 insertions(+), 6 deletions(-)
 create mode 100644 hw/spapr_iommu_vfio.h

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index f573a95..c46a049 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
 # Xilinx PPC peripherals
 obj-y += xilinx_ethlite.o
 
+# VFIO PCI device assignment
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
+
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/spapr.h b/hw/spapr.h
index b37f337..26e26f6 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       DMAContext *dma);
 
+void spapr_vfio_init_dma(int group_id, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 50c288d..e48ced1 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -23,6 +23,8 @@
 #include "dma.h"
 
 #include "hw/spapr.h"
+#include "hw/spapr_iommu_vfio.h"
+#include "hw/vfio_pci.h"
 
 #include <libfdt.h>
 
@@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
     return 0;
 }
 
+typedef struct sPAPRVFIOTable {
+    int group_id;
+    uint32_t liobn;
+    QLIST_ENTRY(sPAPRVFIOTable) list;
+} sPAPRVFIOTable;
+
+QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
+
+void spapr_vfio_init_dma(int group_id, uint32_t liobn,
+                         uint64_t *dma32_window_start,
+                         uint64_t *dma32_window_size)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_info info = { .argsz = sizeof(info) };
+
+    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
+        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
+        return;
+    }
+    *dma32_window_start = info.dma32_window_start;
+    *dma32_window_size = info.dma32_window_size;
+
+    t = g_malloc0(sizeof(*t));
+    t->group_id = group_id;
+    t->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
+}
+
+static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_dma_map map = {
+        .argsz = sizeof(map),
+        .va = 0,
+        .dmaaddr = ioba,
+    };
+
+    QLIST_FOREACH(t, &vfio_tce_tables, list) {
+        if (t->liobn != liobn) {
+            continue;
+        }
+        if (tce) {
+            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
+            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
+                                       &map)) {
+                perror("TCE_MAP_DMA");
+                return H_PARAMETER;
+            }
+        } else {
+            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
+                                       &map)) {
+                perror("TCE_UNMAP_DMA");
+                return H_PARAMETER;
+            }
+        }
+        return H_SUCCESS;
+    }
+    return H_CONTINUE; /* positive non-zero value */
+}
+
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
 {
@@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
     ret = put_tce_emu(liobn, ioba, tce);
-    if (0 >= ret) {
+    if (ret <= 0) {
+        return ret ? H_PARAMETER : H_SUCCESS;
+    }
+    ret = put_tce_vfio(liobn, ioba, tce);
+    if (ret <= 0) {
         return ret ? H_PARAMETER : H_SUCCESS;
     }
 #ifdef DEBUG_TCE
diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
new file mode 100644
index 0000000..711e3e4
--- /dev/null
+++ b/hw/spapr_iommu_vfio.h
@@ -0,0 +1,49 @@
+/*
+ * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
+ * This is the copy of the kernel header.
+ *
+ * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
+#define __HW_SPAPR_IOMMU_VFIO_H__
+
+#include "hw/linux-vfio.h"
+
+#define SPAPR_TCE_IOMMU         2
+
+struct tce_iommu_info {
+    __u32 argsz;
+    __u32 flags;
+    __u32 dma32_window_start;
+    __u32 dma32_window_size;
+    __u64 dma64_window_start;
+    __u64 dma64_window_size;
+};
+
+#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+struct tce_iommu_dma_map {
+    __u32 argsz;
+    __u32 flags;
+    __u64 va;
+    __u64 dmaaddr;
+};
+
+#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
+#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+#endif
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 014297b..836ec4f 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -22,6 +22,9 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include <sys/types.h>
+#include <dirent.h>
+
 #include "hw.h"
 #include "pci.h"
 #include "msi.h"
@@ -32,7 +35,6 @@
 #include "exec-memory.h"
 #include <libfdt.h>
 #include "trace.h"
-
 #include "hw/pci_internals.h"
 
 /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
@@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
                  level);
 }
 
+static int pci_spapr_get_irq(void *opaque, int irq_num)
+{
+    sPAPRPHBState *phb = opaque;
+    return phb->lsi_table[irq_num].dt_irq;
+}
+
 static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
                               unsigned size)
 {
@@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
     return phb->dma;
 }
 
+static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
+{
+    char iommupath[256];
+    DIR *dirp;
+    struct dirent *entry;
+
+    if (!phb->scan) {
+        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
+        return 0;
+    }
+
+    snprintf(iommupath, sizeof(iommupath),
+             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
+    dirp = opendir(iommupath);
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((phb->scan < 2) &&
+            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
+            /* Skip _any_ bridge */
+            continue;
+        }
+        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
+            /* Tweak USB */
+            phb->force_addr = 1;
+            phb->enable_multifunction = 1;
+        }
+
+        trace_spapr_pci("Creating devicei from ", entry->d_name);
+
+        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
+        if (!dev) {
+            fprintf(stderr, "failed to create vfio-pci\n");
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name);
+        if (phb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            qdev_prop_parse(dev, "addr", addr);
+        }
+        if (phb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        qdev_init_nofail(dev);
+    }
+    closedir(dirp);
+
+    return 0;
+}
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
@@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
 
     bus = pci_register_bus(&phb->host_state.busdev.qdev,
                            phb->busname ? phb->busname : phb->dtbusname,
-                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
+                           pci_spapr_set_irq, pci_spapr_get_irq,
+                           pci_spapr_map_irq, phb,
                            &phb->memspace, &phb->iospace,
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
 
     phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
-    phb->dma_window_start = 0;
-    phb->dma_window_size = 0x40000000;
-    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
@@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
         }
     }
 
+    if (phb->iommugroupid >= 0) {
+        if (spapr_pci_scan_vfio(phb) < 0) {
+            return -1;
+        }
+        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
+                            &phb->dma_window_start,
+                            &phb->dma_window_size);
+        return 0;
+    }
+
+    phb->dma_window_start = 0;
+    phb->dma_window_size = 0x40000000;
+    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
+                                         phb->dma_window_size);
+
     return 0;
 }
 
@@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
     DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
     DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 145071c..f514823 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
         int nvec;
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
+    int32_t iommugroupid;
+    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
+    uint8_t enable_multifunction, force_addr;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 1ac287f..fc84fb4 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
 
         memory_listener_register(&container->listener, get_system_memory());
 
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
     } else {
         error_report("vfio: No available IOMMU models\n");
         g_free(container);
@@ -2005,3 +2023,15 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
+{
+    VFIOGroup *group;
+
+    group = vfio_get_group(iommu_group);
+    if (!group->container) {
+        return -EINVAL;
+    }
+
+    return ioctl(group->container->fd, request, data);
+}
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 226607c..f44ff07 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -105,4 +105,6 @@ typedef struct VFIOGroup {
 #define VFIO_FLAG_IOMMU_SHARED_BIT 0
 #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
 
+int vfio_group_iommu_ioctl(int iommu_group, int request, void *data);
+
 #endif /* __VFIO_H__ */
diff --git a/trace-events b/trace-events
index e548f86..9100591 100644
--- a/trace-events
+++ b/trace-events
@@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/spapr_pci.c
+spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH] RFC: vfio-powerpc: added VFIO support (v2)
  2012-07-13  5:24     ` Alexey Kardashevskiy
@ 2012-07-13 14:33       ` Blue Swirl
  0 siblings, 0 replies; 52+ messages in thread
From: Blue Swirl @ 2012-07-13 14:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel, David Gibson

On Fri, Jul 13, 2012 at 5:24 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> Two comments below.
>
> On 13/07/12 06:54, Blue Swirl wrote:
>> On Thu, Jul 12, 2012 at 8:52 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>> It literally does the following:
>>>
>>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>>
>>> 2. The patch assumes that IOAPIC calls are going to be replaced
>>> with something generic. I have something in my local git but it's
>>> too early, we need to extend PCIINTxRoute first.
>>>
>>> 3. vfio_get_group() made public. I want to open IOMMU group from
>>> the sPAPR code to have everything I need for VFIO on sPAPR and
>>> avoid ugly workarounds with finilizing PHB setup on sPAPR.
>>>
>>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>>> the IOMMU-VFIO group. Now it is enough to add the following to
>>> the QEMU command line to get VFIO up with all the devices from
>>> IOMMU group with id=3:
>>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>  hw/ppc/Makefile.objs |    3 ++
>>>  hw/spapr.h           |    4 ++
>>>  hw/spapr_iommu.c     |   87 ++++++++++++++++++++++++++++++++++++++
>>>  hw/spapr_pci.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++---
>>>  hw/spapr_pci.h       |    5 +++
>>>  hw/vfio_pci.c        |   28 +++++++++++-
>>>  hw/vfio_pci.h        |    2 +
>>>  7 files changed, 237 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>> index f573a95..c46a049 100644
>>> --- a/hw/ppc/Makefile.objs
>>> +++ b/hw/ppc/Makefile.objs
>>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>>  # Xilinx PPC peripherals
>>>  obj-y += xilinx_ethlite.o
>>>
>>> +# VFIO PCI device assignment
>>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>>> +
>>>  obj-y := $(addprefix ../,$(obj-y))
>>> diff --git a/hw/spapr.h b/hw/spapr.h
>>> index b37f337..9dca704 100644
>>> --- a/hw/spapr.h
>>> +++ b/hw/spapr.h
>>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>                        DMAContext *dma);
>>>
>>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>>> +                         uint64_t *dma32_window_start,
>>> +                         uint64_t *dma32_window_size);
>>> +
>>>  #endif /* !defined (__HW_SPAPR_H__) */
>>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>>> index 50c288d..0a194e8 100644
>>> --- a/hw/spapr_iommu.c
>>> +++ b/hw/spapr_iommu.c
>>> @@ -16,6 +16,8 @@
>>>   * You should have received a copy of the GNU Lesser General Public
>>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>>   */
>>> +#include <sys/ioctl.h>
>>> +
>>>  #include "hw.h"
>>>  #include "kvm.h"
>>>  #include "qdev.h"
>>> @@ -23,6 +25,7 @@
>>>  #include "dma.h"
>>>
>>>  #include "hw/spapr.h"
>>> +#include "hw/linux-vfio.h"
>>>
>>>  #include <libfdt.h>
>>>
>>> @@ -183,6 +186,86 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>>      return 0;
>>>  }
>>>
>>> +/* -------- API for POWERPC IOMMU -------- */
>>> +
>>> +#define POWERPC_IOMMU           2
>>> +
>>> +struct tce_iommu_info {
>>
>> CamelCase.
>>
>>> +    __u32 argsz;
>>> +    __u32 dma32_window_start;
>>> +    __u32 dma32_window_size;
>>
>> Please use uint32_t.
>>
>>> +};
>>> +
>>> +#define POWERPC_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>>> +
>>> +struct tce_iommu_dma_map {
>>> +    __u32 argsz;
>>
>> The structure may or may not be padded here since there's no
>> QEMU_PACKED attribute. If possible, just rearrange the fields.
>>
>>> +    __u64 va;
>>> +    __u64 dmaaddr;
>>> +};
>>> +
>>> +#define POWERPC_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
>>> +#define POWERPC_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>>> +
>>> +typedef struct sPAPRVFIOTable {
>>> +    int fd;
>>> +    uint32_t liobn;
>>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>>> +} sPAPRVFIOTable;
>>> +
>>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>>> +
>>> +void spapr_vfio_init_dma(int fd, uint32_t liobn,
>>> +                         uint64_t *dma32_window_start,
>>> +                         uint64_t *dma32_window_size)
>>> +{
>>> +    sPAPRVFIOTable *t;
>>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>>> +
>>> +    if (ioctl(fd, POWERPC_IOMMU_GET_INFO, &info)) {
>>> +        fprintf(stderr, "POWERPC_IOMMU_GET_INFO failed %d\n", errno);
>>> +        return;
>>> +    }
>>> +    *dma32_window_start = info.dma32_window_start;
>>> +    *dma32_window_size = info.dma32_window_size;
>>> +
>>> +    t = g_malloc0(sizeof(*t));
>>> +    t->fd = fd;
>>> +    t->liobn = liobn;
>>> +
>>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>>> +}
>>> +
>>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>>> +{
>>> +    sPAPRVFIOTable *t;
>>> +    struct tce_iommu_dma_map map = {
>>> +        .argsz = sizeof(map),
>>> +        .va = 0,
>>> +        .dmaaddr = ioba,
>>> +    };
>>> +
>>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>>> +        if (t->liobn != liobn) {
>>> +            continue;
>>> +        }
>>> +        if (tce) {
>>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>>> +            if (ioctl(t->fd, POWERPC_IOMMU_MAP_DMA, &map)) {
>>> +                fprintf(stderr, "TCE_MAP_DMA: %d\n", errno);
>>
>> perror()?
>>
>>> +                return H_PARAMETER;
>>> +            }
>>> +        } else {
>>> +            if (ioctl(t->fd, POWERPC_IOMMU_UNMAP_DMA, &map)) {
>>> +                fprintf(stderr, "TCE_UNMAP_DMA: %d\n", errno);
>>> +                return H_PARAMETER;
>>> +            }
>>> +        }
>>> +        return H_SUCCESS;
>>> +    }
>>> +    return H_CONTINUE; /* positive non-zero value */
>>> +}
>>> +
>>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>                                target_ulong opcode, target_ulong *args)
>>>  {
>>> @@ -203,6 +286,10 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>      if (0 >= ret) {
>>>          return ret ? H_PARAMETER : H_SUCCESS;
>>>      }
>>> +    ret = put_tce_vfio(liobn, ioba, tce);
>>> +    if (0 >= ret) {
>>
>> This order in expressions is not common, please reverse.
>>
>>> +        return ret ? H_PARAMETER : H_SUCCESS;
>>> +    }
>>>  #ifdef DEBUG_TCE
>>>      fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
>>>              "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx " ret=%d\n",
>>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>>> index 014297b..92c48b6 100644
>>> --- a/hw/spapr_pci.c
>>> +++ b/hw/spapr_pci.c
>>> @@ -22,6 +22,9 @@
>>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>   * THE SOFTWARE.
>>>   */
>>> +#include <sys/types.h>
>>> +#include <dirent.h>
>>> +
>>>  #include "hw.h"
>>>  #include "pci.h"
>>>  #include "msi.h"
>>> @@ -29,10 +32,10 @@
>>>  #include "pci_host.h"
>>>  #include "hw/spapr.h"
>>>  #include "hw/spapr_pci.h"
>>> +#include "hw/vfio_pci.h"
>>>  #include "exec-memory.h"
>>>  #include <libfdt.h>
>>>  #include "trace.h"
>>> -
>>>  #include "hw/pci_internals.h"
>>>
>>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>>> @@ -440,6 +443,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>>                   level);
>>>  }
>>>
>>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>>> +{
>>> +    sPAPRPHBState *phb = opaque;
>>> +    return phb->lsi_table[irq_num].dt_irq;
>>> +}
>>> +
>>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>>                                unsigned size)
>>>  {
>>> @@ -515,6 +524,82 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>>      return phb->dma;
>>>  }
>>>
>>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>>> +{
>>> +    char iommupath[256];
>>> +    DIR *dirp;
>>> +    struct dirent *entry;
>>> +
>>> +    phb->iommugroup = vfio_get_group(phb->iommugroupid);
>>> +    if (!phb->iommugroup) {
>>> +        fprintf(stderr, "Cannot open IOMMU group %d\n", phb->iommugroupid);
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!phb->scan) {
>>> +        printf("Autoscan disabled\n");
>>> +        return 0;
>>> +    }
>>> +
>>> +    sprintf(iommupath, "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
>>
>> Please use snprintf() or g_strdup_printf().
>>
>>
>>> +    dirp = opendir(iommupath);
>>> +
>>> +    while ((entry = readdir(dirp)) != NULL) {
>>> +        char *tmp = alloca(strlen(iommupath) + strlen(entry->d_name) + 32);
>>> +        FILE *deviceclassfile;
>>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>>> +        char addr[32];
>>> +        DeviceState *dev;
>>> +
>>> +        if (4 != sscanf(entry->d_name, "%X:%X:%X.%x",
>>
>> Please put the constant last.
>>
>>> +                        &domainid, &busid, &devid, &fnid)) {
>>> +            continue;
>>> +        }
>>> +
>>> +        sprintf(tmp, "%s%s/class", iommupath, entry->d_name);
>>
>> Again, snprintf() or g_strdup_printf() (which avoids the alloca() too).
>>
>>> +        printf("Reading device class from %s\n", tmp);
>>
>> Leftover debugging?
>>
>>> +
>>> +        deviceclassfile = fopen(tmp, "r");
>>> +        if (deviceclassfile) {
>>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>>> +            fclose(deviceclassfile);
>>> +        }
>>> +        if (!deviceclass) {
>>> +            continue;
>>> +        }
>>> +#define PCI_BASE_CLASS_BRIDGE           0x06
>>
>> This belongs to pci_ids.h.
>
>
> It should but it is not. It is way smaller than the same one in the kernel.

It's because the policy is that the file should be kept in synch with
kernel, except unused entries can be left out. So please add the entry
to pci_ids.h.

>
>
>
>>> +        if ((phb->scan < 2) && ((deviceclass >> 16) == PCI_BASE_CLASS_BRIDGE)) {
>>> +            continue;
>>> +        }
>>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>>> +            /* Tweak USB */
>>> +            phb->force_addr = 1;
>>> +            phb->enable_multifunction = 1;
>>> +        }
>>> +
>>> +        printf("Creating device %X:%X:%X.%x class=0x%X\n",
>>> +               domainid, busid, devid, fnid, deviceclass);
>>
>> Lower case hex, please.
>>
>>> +
>>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>>> +        if (!dev) {
>>> +            fprintf(stderr, "failed to create vfio-pci\n");
>>> +            continue;
>>> +        }
>>> +        qdev_prop_parse(dev, "host", entry->d_name);
>>> +        if (phb->force_addr) {
>>> +            sprintf(addr, "%X.%X", devid, fnid);
>>
>> snprintf, lower case hex.
>>
>>> +            qdev_prop_parse(dev, "addr", addr);
>>> +        }
>>> +        if (phb->enable_multifunction) {
>>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>>> +        }
>>> +        qdev_init_nofail(dev);
>>> +    }
>>> +    closedir(dirp);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>  static int spapr_phb_init(SysBusDevice *s)
>>>  {
>>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>>> @@ -567,15 +652,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>>
>>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>>                             phb->busname ? phb->busname : phb->dtbusname,
>>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>>> +                           pci_spapr_map_irq, phb,
>>>                             &phb->memspace, &phb->iospace,
>>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>>      phb->host_state.bus = bus;
>>>
>>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>>> -    phb->dma_window_start = 0;
>>> -    phb->dma_window_size = 0x40000000;
>>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>>
>>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>>> @@ -588,6 +671,24 @@ static int spapr_phb_init(SysBusDevice *s)
>>>          }
>>>      }
>>>
>>> +    if (phb->iommugroupid >= 0) {
>>> +        if (0 > spapr_pci_scan_vfio(phb)) {
>>
>> Order.
>>
>>> +            return -1;
>>> +        }
>>> +        if (!phb->iommugroup || !phb->iommugroup->container) {
>>> +            return -1;
>>> +        }
>>> +        spapr_vfio_init_dma(phb->iommugroup->container->fd, phb->dma_liobn,
>>> +                            &phb->dma_window_start,
>>> +                            &phb->dma_window_size);
>>> +        return 0;
>>> +    }
>>> +
>>> +    phb->dma_window_start = 0;
>>> +    phb->dma_window_size = 0x40000000;
>>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>>> +                                         phb->dma_window_size);
>>> +
>>>      return 0;
>>>  }
>>>
>>> @@ -599,6 +700,10 @@ static Property spapr_phb_properties[] = {
>>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1), /* 0 don't 1 +devices 2 +buses */
>>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>>      DEFINE_PROP_END_OF_LIST(),
>>>  };
>>>
>>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>>> index 145071c..1953a74 100644
>>> --- a/hw/spapr_pci.h
>>> +++ b/hw/spapr_pci.h
>>> @@ -57,6 +57,11 @@ typedef struct sPAPRPHBState {
>>>          int nvec;
>>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>>
>>> +    int32_t iommugroupid;
>>> +    struct VFIOGroup *iommugroup;
>>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>>> +    uint8_t enable_multifunction, force_addr;
>>
>> Use bool for both?
>
>
>
> There is no DEFINE_PROP_BOOL. There is DEFINE_PROP_BIT but it works with bits, not bools.
> uint8_t is the closest one.
>
>
>
>>> +
>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>  } sPAPRPHBState;
>>>
>>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>>> index 1ac287f..73681fb 100644
>>> --- a/hw/vfio_pci.c
>>> +++ b/hw/vfio_pci.c
>>> @@ -21,7 +21,6 @@
>>>  #include <dirent.h>
>>>  #include <stdio.h>
>>>  #include <unistd.h>
>>> -#include <sys/io.h>
>>>  #include <sys/ioctl.h>
>>>  #include <sys/mman.h>
>>>  #include <sys/types.h>
>>> @@ -43,6 +42,12 @@
>>>  #include "range.h"
>>>  #include "vfio_pci.h"
>>>  #include "linux-vfio.h"
>>> +#ifndef TARGET_PPC64
>>> +#include <sys/io.h>
>>> +#else
>>> +#include "hw/pci_internals.h"
>>> +#include "hw/spapr.h"
>>> +#endif
>>>
>>>  //#define DEBUG_VFIO
>>>  #ifdef DEBUG_VFIO
>>> @@ -1581,6 +1586,25 @@ static int vfio_connect_container(VFIOGroup *group)
>>>
>>>          memory_listener_register(&container->listener, get_system_memory());
>>>
>>> +#define POWERPC_IOMMU           2
>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, POWERPC_IOMMU)) {
>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>> +        if (ret) {
>>> +            error_report("vfio: failed to set group container: %s\n",
>>> +                         strerror(errno));
>>> +            g_free(container);
>>> +            close(fd);
>>> +            return -1;
>>> +        }
>>> +
>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, POWERPC_IOMMU);
>>> +        if (ret) {
>>> +            error_report("vfio: failed to set iommu for container: %s\n",
>>> +                         strerror(errno));
>>> +            g_free(container);
>>> +            close(fd);
>>> +            return -1;
>>> +        }
>>>      } else {
>>>          error_report("vfio: No available IOMMU models\n");
>>>          g_free(container);
>>> @@ -1620,7 +1644,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>>      }
>>>  }
>>>
>>> -static VFIOGroup *vfio_get_group(int groupid)
>>> +VFIOGroup *vfio_get_group(int groupid)
>>>  {
>>>      VFIOGroup *group;
>>>      char path[32];
>>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>>> index 226607c..d63dd63 100644
>>> --- a/hw/vfio_pci.h
>>> +++ b/hw/vfio_pci.h
>>> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>>>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>>>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>>>
>>> +VFIOGroup *vfio_get_group(int groupid);
>>> +
>>>  #endif /* __VFIO_H__ */
>>> --
>>> 1.7.10
>>>
>>>
>
>
> --
> Alexey
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-13  7:26 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3) Alexey Kardashevskiy
@ 2012-07-13 14:38   ` Blue Swirl
  2012-07-13 15:07   ` Alex Williamson
  1 sibling, 0 replies; 52+ messages in thread
From: Blue Swirl @ 2012-07-13 14:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc, David Gibson

On Fri, Jul 13, 2012 at 7:26 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> It literally does the following:
>
> 1. POWERPC IOMMU support (the kernel counterpart is required)
>
> 2. The patch assumes that IOAPIC calls are going to be replaced
> with something generic.
>
> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
> handler to call VFIO IOMMU driver.
>
> 4. Change sPAPR PHB to scan the PCI bus which is used for
> the IOMMU-VFIO group. Now it is enough to add the following to
> the QEMU command line to get VFIO up with all the devices from
> IOMMU group with id=3:
> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs  |    3 ++
>  hw/spapr.h            |    4 ++
>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
>  hw/spapr_pci.h        |    4 ++
>  hw/vfio_pci.c         |   30 ++++++++++++++
>  hw/vfio_pci.h         |    2 +
>  trace-events          |    1 +
>  9 files changed, 264 insertions(+), 6 deletions(-)
>  create mode 100644 hw/spapr_iommu_vfio.h
>
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..26e26f6 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..e48ced1 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -23,6 +23,8 @@
>  #include "dma.h"
>
>  #include "hw/spapr.h"
> +#include "hw/spapr_iommu_vfio.h"
> +#include "hw/vfio_pci.h"
>
>  #include <libfdt.h>
>
> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>
> +typedef struct sPAPRVFIOTable {
> +    int group_id;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> +
> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
> +        return;
> +    }
> +    *dma32_window_start = info.dma32_window_start;
> +    *dma32_window_size = info.dma32_window_size;
> +
> +    t = g_malloc0(sizeof(*t));

It looks like you initialize all fields, so plain g_malloc() can be used.

> +    t->group_id = group_id;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
> +                                       &map)) {
> +                perror("TCE_MAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
> +                                       &map)) {
> +                perror("TCE_UNMAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +
>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>
>      ret = put_tce_emu(liobn, ioba, tce);
> -    if (0 >= ret) {
> +    if (ret <= 0) {
> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (ret <= 0) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
>  #ifdef DEBUG_TCE
> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> new file mode 100644
> index 0000000..711e3e4
> --- /dev/null
> +++ b/hw/spapr_iommu_vfio.h
> @@ -0,0 +1,49 @@
> +/*
> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
> + * This is the copy of the kernel header.

This file should be put somewhere into linux-headers directory to match kernel.

> + *
> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> +#define __HW_SPAPR_IOMMU_VFIO_H__
> +
> +#include "hw/linux-vfio.h"
> +
> +#define SPAPR_TCE_IOMMU         2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +    __u64 dma64_window_start;
> +    __u64 dma64_window_size;
> +};
> +
> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
> +#endif
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 014297b..836ec4f 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -22,6 +22,9 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
>  #include "hw.h"
>  #include "pci.h"
>  #include "msi.h"
> @@ -32,7 +35,6 @@
>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> -
>  #include "hw/pci_internals.h"
>
>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>      return phb->dma;
>  }
>
> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> +{
> +    char iommupath[256];
> +    DIR *dirp;
> +    struct dirent *entry;
> +
> +    if (!phb->scan) {
> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> +        return 0;
> +    }
> +
> +    snprintf(iommupath, sizeof(iommupath),
> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> +    dirp = opendir(iommupath);
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        char *tmp;
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> +                   &domainid, &busid, &devid, &fnid) != 4) {
> +            continue;
> +        }
> +
> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> +        trace_spapr_pci("Reading device class from ", tmp);
> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +        }
> +        g_free(tmp);
> +
> +        if (!deviceclass) {
> +            continue;
> +        }
> +        if ((phb->scan < 2) &&
> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> +            /* Skip _any_ bridge */
> +            continue;
> +        }
> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> +            /* Tweak USB */
> +            phb->force_addr = 1;
> +            phb->enable_multifunction = 1;
> +        }
> +
> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> +
> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            fprintf(stderr, "failed to create vfio-pci\n");
> +            continue;
> +        }
> +        qdev_prop_parse(dev, "host", entry->d_name);
> +        if (phb->force_addr) {
> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> +            qdev_prop_parse(dev, "addr", addr);
> +        }
> +        if (phb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        qdev_init_nofail(dev);
> +    }
> +    closedir(dirp);
> +
> +    return 0;
> +}
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
>
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
>
>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> -    phb->dma_window_start = 0;
> -    phb->dma_window_size = 0x40000000;
> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>
>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
>          }
>      }
>
> +    if (phb->iommugroupid >= 0) {
> +        if (spapr_pci_scan_vfio(phb) < 0) {
> +            return -1;
> +        }
> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
> +                            &phb->dma_window_start,
> +                            &phb->dma_window_size);
> +        return 0;
> +    }
> +
> +    phb->dma_window_start = 0;
> +    phb->dma_window_size = 0x40000000;
> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> +                                         phb->dma_window_size);
> +
>      return 0;
>  }
>
> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 145071c..f514823 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>
> +    int32_t iommugroupid;
> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> +    uint8_t enable_multifunction, force_addr;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1ac287f..fc84fb4 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
>
>          memory_listener_register(&container->listener, get_system_memory());
>
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> @@ -2005,3 +2023,15 @@ static void register_vfio_pci_dev_type(void)
>  }
>
>  type_init(register_vfio_pci_dev_type)
> +
> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
> +{
> +    VFIOGroup *group;
> +
> +    group = vfio_get_group(iommu_group);
> +    if (!group->container) {
> +        return -EINVAL;
> +    }
> +
> +    return ioctl(group->container->fd, request, data);
> +}
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 226607c..f44ff07 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>
> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data);
> +
>  #endif /* __VFIO_H__ */
> diff --git a/trace-events b/trace-events
> index e548f86..9100591 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
>  qxl_render_update_area_done(void *cookie) "%p"
>
>  # hw/spapr_pci.c
> +spapr_pci(const char *msg1, const char *msg2) "%s%s"
>  spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
>  spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
>  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
> --
> 1.7.10
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-13  7:26 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3) Alexey Kardashevskiy
  2012-07-13 14:38   ` Blue Swirl
@ 2012-07-13 15:07   ` Alex Williamson
  2012-07-14  2:34     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-13 15:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
> It literally does the following:
> 
> 1. POWERPC IOMMU support (the kernel counterpart is required)
> 
> 2. The patch assumes that IOAPIC calls are going to be replaced
> with something generic.
> 
> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
> handler to call VFIO IOMMU driver.
> 
> 4. Change sPAPR PHB to scan the PCI bus which is used for
> the IOMMU-VFIO group. Now it is enough to add the following to
> the QEMU command line to get VFIO up with all the devices from
> IOMMU group with id=3:
> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs  |    3 ++
>  hw/spapr.h            |    4 ++
>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
>  hw/spapr_pci.h        |    4 ++
>  hw/vfio_pci.c         |   30 ++++++++++++++
>  hw/vfio_pci.h         |    2 +
>  trace-events          |    1 +
>  9 files changed, 264 insertions(+), 6 deletions(-)
>  create mode 100644 hw/spapr_iommu_vfio.h
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>  
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..26e26f6 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>  
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..e48ced1 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -23,6 +23,8 @@
>  #include "dma.h"
>  
>  #include "hw/spapr.h"
> +#include "hw/spapr_iommu_vfio.h"
> +#include "hw/vfio_pci.h"
>  
>  #include <libfdt.h>
>  
> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>  
> +typedef struct sPAPRVFIOTable {
> +    int group_id;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         uint64_t *dma32_window_start,
> +                         uint64_t *dma32_window_size)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> +
> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
> +        return;
> +    }
> +    *dma32_window_start = info.dma32_window_start;
> +    *dma32_window_size = info.dma32_window_size;
> +
> +    t = g_malloc0(sizeof(*t));
> +    t->group_id = group_id;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
> +                                       &map)) {
> +                perror("TCE_MAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
> +                                       &map)) {
> +                perror("TCE_UNMAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +
>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>  
>      ret = put_tce_emu(liobn, ioba, tce);
> -    if (0 >= ret) {
> +    if (ret <= 0) {
> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (ret <= 0) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
>  #ifdef DEBUG_TCE
> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> new file mode 100644
> index 0000000..711e3e4
> --- /dev/null
> +++ b/hw/spapr_iommu_vfio.h
> @@ -0,0 +1,49 @@
> +/*
> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
> + * This is the copy of the kernel header.
> + *
> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> +#define __HW_SPAPR_IOMMU_VFIO_H__
> +
> +#include "hw/linux-vfio.h"
> +
> +#define SPAPR_TCE_IOMMU         2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +    __u64 dma64_window_start;
> +    __u64 dma64_window_size;
> +};
> +
> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
> +#endif
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 014297b..836ec4f 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -22,6 +22,9 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
>  #include "hw.h"
>  #include "pci.h"
>  #include "msi.h"
> @@ -32,7 +35,6 @@
>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> -
>  #include "hw/pci_internals.h"
>  
>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>  
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>      return phb->dma;
>  }
>  
> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> +{
> +    char iommupath[256];
> +    DIR *dirp;
> +    struct dirent *entry;
> +
> +    if (!phb->scan) {
> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> +        return 0;
> +    }
> +
> +    snprintf(iommupath, sizeof(iommupath),
> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> +    dirp = opendir(iommupath);
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        char *tmp;
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> +                   &domainid, &busid, &devid, &fnid) != 4) {
> +            continue;
> +        }
> +
> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> +        trace_spapr_pci("Reading device class from ", tmp);
> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +        }
> +        g_free(tmp);
> +
> +        if (!deviceclass) {
> +            continue;
> +        }
> +        if ((phb->scan < 2) &&
> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> +            /* Skip _any_ bridge */
> +            continue;
> +        }
> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> +            /* Tweak USB */
> +            phb->force_addr = 1;
> +            phb->enable_multifunction = 1;
> +        }
> +
> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> +
> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            fprintf(stderr, "failed to create vfio-pci\n");
> +            continue;
> +        }
> +        qdev_prop_parse(dev, "host", entry->d_name);
> +        if (phb->force_addr) {
> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> +            qdev_prop_parse(dev, "addr", addr);
> +        }
> +        if (phb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        qdev_init_nofail(dev);
> +    }
> +    closedir(dirp);
> +
> +    return 0;
> +}
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
>  
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
>  
>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> -    phb->dma_window_start = 0;
> -    phb->dma_window_size = 0x40000000;
> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>  
>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
>          }
>      }
>  
> +    if (phb->iommugroupid >= 0) {
> +        if (spapr_pci_scan_vfio(phb) < 0) {
> +            return -1;
> +        }
> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
> +                            &phb->dma_window_start,
> +                            &phb->dma_window_size);
> +        return 0;
> +    }
> +
> +    phb->dma_window_start = 0;
> +    phb->dma_window_size = 0x40000000;
> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> +                                         phb->dma_window_size);
> +
>      return 0;
>  }
>  
> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 145071c..f514823 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>  
> +    int32_t iommugroupid;
> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> +    uint8_t enable_multifunction, force_addr;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>  
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1ac287f..fc84fb4 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
>  
>          memory_listener_register(&container->listener, get_system_memory());
>  
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }

I think we can still do better.  The x86 code sets up a MemoryListener
here with data for that embedded into the VFIOContainer.  You don't
have, need, or want a MemoryListener, but that doesn't mean we can't
follow the model of registering that this group exists here and setting
up map/unmap callbacks.

For instance:

in vfio_pci.h:
struct sPAPRVFIOData {
    uint64_t dma32_window_start;
    uint64_t dma64_window_size;
    ....
    int (*map)(struct tce_iommu_dma_map *);
    int (*unmap)(struct tce_iommu_dma_map *);
};

appended to the above spapr tce iommu setup above:

struct tce_iommu_info info;

/* the MemoryListener embedded in container becomes a union to hold
 * iommu specific data. */
container->u.spapr.data->map = vfio_spapr_tce_map;
container->u.spapr.data->unmap = vfio_spapr_tce_unmap;

ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))

container->u.spapr.data->dma32_window_start = info.dma32_window_start;
container->u.spapr.data->dma32_window_size = info.dma32_window_size;

spapr_register_vfio_container(&container->u.spapr.data)

Then vfio_disconnect_container() could call
spapr_unregister_vfio_container().  Maybe the container contains a
function pointer to an uninit function so we don't have to ifdef between
x86 and power.  Does that make sense?  Thanks,

Alex

>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> @@ -2005,3 +2023,15 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
> +{
> +    VFIOGroup *group;
> +
> +    group = vfio_get_group(iommu_group);
> +    if (!group->container) {
> +        return -EINVAL;
> +    }
> +
> +    return ioctl(group->container->fd, request, data);
> +}
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 226607c..f44ff07 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>  
> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data);
> +
>  #endif /* __VFIO_H__ */
> diff --git a/trace-events b/trace-events
> index e548f86..9100591 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
>  qxl_render_update_area_done(void *cookie) "%p"
>  
>  # hw/spapr_pci.c
> +spapr_pci(const char *msg1, const char *msg2) "%s%s"
>  spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
>  spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
>  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-13 15:07   ` Alex Williamson
@ 2012-07-14  2:34     ` Alexey Kardashevskiy
  2012-07-16 14:21       ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-14  2:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On 14/07/12 01:07, Alex Williamson wrote:
> On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>
>> 2. The patch assumes that IOAPIC calls are going to be replaced
>> with something generic.
>>
>> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
>> handler to call VFIO IOMMU driver.
>>
>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>> the IOMMU-VFIO group. Now it is enough to add the following to
>> the QEMU command line to get VFIO up with all the devices from
>> IOMMU group with id=3:
>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/Makefile.objs  |    3 ++
>>  hw/spapr.h            |    4 ++
>>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
>>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
>>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
>>  hw/spapr_pci.h        |    4 ++
>>  hw/vfio_pci.c         |   30 ++++++++++++++
>>  hw/vfio_pci.h         |    2 +
>>  trace-events          |    1 +
>>  9 files changed, 264 insertions(+), 6 deletions(-)
>>  create mode 100644 hw/spapr_iommu_vfio.h
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index f573a95..c46a049 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>  # Xilinx PPC peripherals
>>  obj-y += xilinx_ethlite.o
>>  
>> +# VFIO PCI device assignment
>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>> +
>>  obj-y := $(addprefix ../,$(obj-y))
>> diff --git a/hw/spapr.h b/hw/spapr.h
>> index b37f337..26e26f6 100644
>> --- a/hw/spapr.h
>> +++ b/hw/spapr.h
>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        DMAContext *dma);
>>  
>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size);
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>> index 50c288d..e48ced1 100644
>> --- a/hw/spapr_iommu.c
>> +++ b/hw/spapr_iommu.c
>> @@ -23,6 +23,8 @@
>>  #include "dma.h"
>>  
>>  #include "hw/spapr.h"
>> +#include "hw/spapr_iommu_vfio.h"
>> +#include "hw/vfio_pci.h"
>>  
>>  #include <libfdt.h>
>>  
>> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>      return 0;
>>  }
>>  
>> +typedef struct sPAPRVFIOTable {
>> +    int group_id;
>> +    uint32_t liobn;
>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>> +} sPAPRVFIOTable;
>> +
>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>> +
>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>> +                         uint64_t *dma32_window_start,
>> +                         uint64_t *dma32_window_size)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>> +
>> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
>> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
>> +        return;
>> +    }
>> +    *dma32_window_start = info.dma32_window_start;
>> +    *dma32_window_size = info.dma32_window_size;
>> +
>> +    t = g_malloc0(sizeof(*t));
>> +    t->group_id = group_id;
>> +    t->liobn = liobn;
>> +
>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>> +}
>> +
>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .va = 0,
>> +        .dmaaddr = ioba,
>> +    };
>> +
>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>> +        if (t->liobn != liobn) {
>> +            continue;
>> +        }
>> +        if (tce) {
>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
>> +                                       &map)) {
>> +                perror("TCE_MAP_DMA");
>> +                return H_PARAMETER;
>> +            }
>> +        } else {
>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
>> +                                       &map)) {
>> +                perror("TCE_UNMAP_DMA");
>> +                return H_PARAMETER;
>> +            }
>> +        }
>> +        return H_SUCCESS;
>> +    }
>> +    return H_CONTINUE; /* positive non-zero value */
>> +}
>> +
>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>                                target_ulong opcode, target_ulong *args)
>>  {
>> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>>  
>>      ret = put_tce_emu(liobn, ioba, tce);
>> -    if (0 >= ret) {
>> +    if (ret <= 0) {
>> +        return ret ? H_PARAMETER : H_SUCCESS;
>> +    }
>> +    ret = put_tce_vfio(liobn, ioba, tce);
>> +    if (ret <= 0) {
>>          return ret ? H_PARAMETER : H_SUCCESS;
>>      }
>>  #ifdef DEBUG_TCE
>> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
>> new file mode 100644
>> index 0000000..711e3e4
>> --- /dev/null
>> +++ b/hw/spapr_iommu_vfio.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
>> + * This is the copy of the kernel header.
>> + *
>> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
>> + *
>> + * This library is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2 of the License, or (at your option) any later version.
>> + *
>> + * This library is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
>> +#define __HW_SPAPR_IOMMU_VFIO_H__
>> +
>> +#include "hw/linux-vfio.h"
>> +
>> +#define SPAPR_TCE_IOMMU         2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 flags;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +    __u64 dma64_window_start;
>> +    __u64 dma64_window_size;
>> +};
>> +
>> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
>> +    __u32 flags;
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
>> +
>> +#endif
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 014297b..836ec4f 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -22,6 +22,9 @@
>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>   * THE SOFTWARE.
>>   */
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +
>>  #include "hw.h"
>>  #include "pci.h"
>>  #include "msi.h"
>> @@ -32,7 +35,6 @@
>>  #include "exec-memory.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>> -
>>  #include "hw/pci_internals.h"
>>  
>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>                   level);
>>  }
>>  
>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>> +{
>> +    sPAPRPHBState *phb = opaque;
>> +    return phb->lsi_table[irq_num].dt_irq;
>> +}
>> +
>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>                                unsigned size)
>>  {
>> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>      return phb->dma;
>>  }
>>  
>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>> +{
>> +    char iommupath[256];
>> +    DIR *dirp;
>> +    struct dirent *entry;
>> +
>> +    if (!phb->scan) {
>> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
>> +        return 0;
>> +    }
>> +
>> +    snprintf(iommupath, sizeof(iommupath),
>> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
>> +    dirp = opendir(iommupath);
>> +
>> +    while ((entry = readdir(dirp)) != NULL) {
>> +        char *tmp;
>> +        FILE *deviceclassfile;
>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>> +        char addr[32];
>> +        DeviceState *dev;
>> +
>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
>> +                   &domainid, &busid, &devid, &fnid) != 4) {
>> +            continue;
>> +        }
>> +
>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
>> +        trace_spapr_pci("Reading device class from ", tmp);
>> +
>> +        deviceclassfile = fopen(tmp, "r");
>> +        if (deviceclassfile) {
>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>> +            fclose(deviceclassfile);
>> +        }
>> +        g_free(tmp);
>> +
>> +        if (!deviceclass) {
>> +            continue;
>> +        }
>> +        if ((phb->scan < 2) &&
>> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
>> +            /* Skip _any_ bridge */
>> +            continue;
>> +        }
>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>> +            /* Tweak USB */
>> +            phb->force_addr = 1;
>> +            phb->enable_multifunction = 1;
>> +        }
>> +
>> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
>> +
>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>> +        if (!dev) {
>> +            fprintf(stderr, "failed to create vfio-pci\n");
>> +            continue;
>> +        }
>> +        qdev_prop_parse(dev, "host", entry->d_name);
>> +        if (phb->force_addr) {
>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
>> +            qdev_prop_parse(dev, "addr", addr);
>> +        }
>> +        if (phb->enable_multifunction) {
>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>> +        }
>> +        qdev_init_nofail(dev);
>> +    }
>> +    closedir(dirp);
>> +
>> +    return 0;
>> +}
>> +
>>  static int spapr_phb_init(SysBusDevice *s)
>>  {
>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>  
>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>                             phb->busname ? phb->busname : phb->dtbusname,
>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>> +                           pci_spapr_map_irq, phb,
>>                             &phb->memspace, &phb->iospace,
>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>      phb->host_state.bus = bus;
>>  
>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>> -    phb->dma_window_start = 0;
>> -    phb->dma_window_size = 0x40000000;
>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>  
>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
>>          }
>>      }
>>  
>> +    if (phb->iommugroupid >= 0) {
>> +        if (spapr_pci_scan_vfio(phb) < 0) {
>> +            return -1;
>> +        }
>> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
>> +                            &phb->dma_window_start,
>> +                            &phb->dma_window_size);
>> +        return 0;
>> +    }
>> +
>> +    phb->dma_window_start = 0;
>> +    phb->dma_window_size = 0x40000000;
>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>> +                                         phb->dma_window_size);
>> +
>>      return 0;
>>  }
>>  
>> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>> index 145071c..f514823 100644
>> --- a/hw/spapr_pci.h
>> +++ b/hw/spapr_pci.h
>> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
>>          int nvec;
>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>  
>> +    int32_t iommugroupid;
>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>> +    uint8_t enable_multifunction, force_addr;
>> +
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  } sPAPRPHBState;
>>  
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 1ac287f..fc84fb4 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
>>  
>>          memory_listener_register(&container->listener, get_system_memory());
>>  
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
> 
> I think we can still do better.  The x86 code sets up a MemoryListener
> here with data for that embedded into the VFIOContainer.  You don't
> have, need, or want a MemoryListener, but that doesn't mean we can't
> follow the model of registering that this group exists here and setting
> up map/unmap callbacks.
> 
> For instance:
> 
> in vfio_pci.h:
> struct sPAPRVFIOData {
>     uint64_t dma32_window_start;
>     uint64_t dma64_window_size;
>     ....
>     int (*map)(struct tce_iommu_dma_map *);
>     int (*unmap)(struct tce_iommu_dma_map *);
> };
> 
> appended to the above spapr tce iommu setup above:
> 
> struct tce_iommu_info info;
> 
> /* the MemoryListener embedded in container becomes a union to hold
>  * iommu specific data. */
> container->u.spapr.data->map = vfio_spapr_tce_map;
> container->u.spapr.data->unmap = vfio_spapr_tce_unmap;

I could actually reuse x86 callbacks, just wanted to keep POWER ioctls together. The problem here is
getting the DMA window parameters and avoiding MemoryListener.

> ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))did 
> 
> container->u.spapr.data->dma32_window_start = info.dma32_window_start;
> container->u.spapr.data->dma32_window_size = info.dma32_window_size;
> 
> spapr_register_vfio_container(&container->u.spapr.data)

I assume it is called within vfio_pci.c as we do not want to access VFIOContainer members from
anywhere but vfio_pci.c.
Or we are changing the approach? I am a bit confused.


> Then vfio_disconnect_container() could call
> spapr_unregister_vfio_container().  Maybe the container contains a
> function pointer to an uninit function so we don't have to ifdef between
> x86 and power.  Does that make sense?  Thanks,

We also need to pass the numbers from the info struct to spapr_pci.c in order to tell the guest
where the DMA window besides. Another callback? This exactly what I avoided in the kernel when we
decided not to extend IOMMU API with POWER stuff, I would like to have the same here.

In general, what is good in pulling to VFIO as much platform specific stuff as possible?

I am trying to keep sPAPR IOMMU stuff away and make it easy to add new platforms to VFIO.

For example, I would rather think of moving the piece of code which checks for SPAPR_TCE_IOMMU out
of VFIO, make it a QEMUMachine callback (together with add-eoi-notifier) as the way IOMMU works is
definitely the specific machine type feature.

For example, int QEMUMachine::init_iommu(VFIOContainter *container) which would not even try
VFIO_TYPE1_IOMMU on POWER or SPAPR_TCE_IOMMU on x86 as it knows the machine and IOMMU types already.


And do something like:
typedef struct VFIOContainer {
    int fd;
    void *platform_iommu_data;
} VFIOContainer;

Create additional file called vfio_iommu_x86.c with:
struct VFIO_Type1_IOMMU {
    MemoryListener listener;
    QLIST_HEAD(, VFIOGroup) group_list;
    QLIST_ENTRY(VFIOContainer) next;
};
and put all MemoryListener stuff there.

For POWER we already have spapr_iommu.c.

Wrong direction? :)


> Alex
> 
>>      } else {
>>          error_report("vfio: No available IOMMU models\n");
>>          g_free(container);
>> @@ -2005,3 +2023,15 @@ static void register_vfio_pci_dev_type(void)
>>  }
>>  
>>  type_init(register_vfio_pci_dev_type)
>> +
>> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
>> +{
>> +    VFIOGroup *group;
>> +
>> +    group = vfio_get_group(iommu_group);
>> +    if (!group->container) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return ioctl(group->container->fd, request, data);
>> +}
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 226607c..f44ff07 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -105,4 +105,6 @@ typedef struct VFIOGroup {
>>  #define VFIO_FLAG_IOMMU_SHARED_BIT 0
>>  #define VFIO_FLAG_IOMMU_SHARED (1U << VFIO_FLAG_UIOMMU_SHARED_BIT)
>>  
>> +int vfio_group_iommu_ioctl(int iommu_group, int request, void *data);
>> +
>>  #endif /* __VFIO_H__ */
>> diff --git a/trace-events b/trace-events
>> index e548f86..9100591 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
>>  qxl_render_update_area_done(void *cookie) "%p"
>>  
>>  # hw/spapr_pci.c
>> +spapr_pci(const char *msg1, const char *msg2) "%s%s"
>>  spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
>>  spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
>>  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support
  2012-07-12  5:47                   ` Alexey Kardashevskiy
@ 2012-07-16  3:51                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-16  3:51 UTC (permalink / raw)
  To: Alex Williamson, Jan Kiszka
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

Jan, Alex, ping?


On 12/07/12 15:47, Alexey Kardashevskiy wrote:
> On 12/07/12 15:29, Alex Williamson wrote:
>> On Thu, 2012-07-12 at 14:58 +1000, Alexey Kardashevskiy wrote:
>>> On 12/07/12 14:43, Alex Williamson wrote:
>>>> On Thu, 2012-07-12 at 14:38 +1000, Alexey Kardashevskiy wrote:
>>>>> On 12/07/12 14:31, Alex Williamson wrote:
>>>>>> On Thu, 2012-07-12 at 14:16 +1000, Alexey Kardashevskiy wrote:
>>>>>>> On 12/07/12 12:54, Alex Williamson wrote:
>>>>>>>> On Wed, 2012-07-11 at 12:25 +1000, Alexey Kardashevskiy wrote:
>>>>>>>>> On 11/07/12 02:57, Alex Williamson wrote:
>>>>>>>>>> On Tue, 2012-07-10 at 15:51 +1000, Alexey Kardashevskiy wrote:
>>>>>>>>>>> The two patches in this set are supposed to add VFIO support for POWER.
>>>>>>>>>>>
>>>>>>>>>>> The first one adds one more step in the initalizaion sequence which I am not
>>>>>>>>>>> sure is correct.
>>>>>>>>>>>
>>>>>>>>>>> The second patch adds actual VFIO support. It is not ready to submit but
>>>>>>>>>>> ready to discuss. I would like to get rid of all #ifdef TARGET_PPC64 in patch #2
>>>>>>>>>>> and I wonder if there is any plan to implement some generic EOI support code, etc.
>>>>>>>>>>
>>>>>>>>>> A generic EOI notifier is on my todo list, but I have no idea what it's
>>>>>>>>>> going to look like.  As you know, I've got an ioapic specific notifier
>>>>>>>>>> in my tree, you add a spapr specific one.  I welcome ideas on how to
>>>>>>>>>> create something generic that has a chance of being accepted.  Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So far the only platform specific call is xxxx_add_gsi_eoi_notifier. The
>>>>>>>>> xxxx_remove_gsi_eoi_notifier only calls notifier_remove, you've got to fix yours
>>>>>>>>> ioapic_remove_gsi_eoi_notifier() as it does too much :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The only place for placing "add_eoi" callback I can see right now is QEMUMachine as there is no
>>>>>>>>> unified machine interrupt controller - IOAPIC has its own type TYPE_IOAPIC_COMMON and XICS is not
>>>>>>>>> even a SysBusDevice. And the callback is not specific for any kind of bus so it cannot go to PCIBus.
>>>>>>>>>
>>>>>>>>> Does it sound reasonable?
>>>>>>>>
>>>>>>>> I suspect we'd need to somehow tie it into qemu_irq where both handlers
>>>>>>>> and notifiers are allocated so we don't really care the underlying
>>>>>>>> implementation.  Something like qemu_add_irq_eoi_notifier(qemu_irq
>>>>>>>> irq, ...).  It's another mess like adding the PCIBus interrupt line to
>>>>>>>> gsi effort though.  Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Tried. Added add_eoi_notifier() callback to qemu_irq, new IRQ allocator:
>>>>>>> qemu_irq *qemu_allocate_irqs2(qemu_irq_handler handler, void *opaque, int n,
>>>>>>>                               qemu_eoi_add_notifier add_notifier);
>>>>>>> and called it from the XICS initialization code.
>>>>>>>
>>>>>>> It could work out if pci_get_irq() or pci_route_irq_fn() returned qemu_irq but no, they just return
>>>>>>> a global IRQ number (pure or embedded in a struct) and there is no common way to resolve qemu_irq
>>>>>>> (and then add_eoi_notifier()) from that number within vfio_pci.
>>>>>>
>>>>>> Well GSI and qemu_irq are different address spaces.  We still need GSI
>>>>>> for any kind of qemu bypass case.
>>>>>
>>>>> No, that is ok, we also need GSI because XICS and IOAPIC need it in the end.
>>>>>
>>>>>>> May be we could add the callback pointer into PCIINTxRoute?
>>>>>>
>>>>>> Maybe, but why is this PCI specific?  Can't we call it as
>>>>>> qemu_add_irq_eoi_notifier(pdev->irq[0], Notifier)?  That would work much
>>>>>> like qemu_set_irq, extracting the irq number from the IRQState and
>>>>>> passing it through to the add_notifier callback for IRQState until it
>>>>>> got to the ioapic/pic/xics.
>>>>>>
>>>>>> int qemu_add_irq_eoi_notifier(qemu_irq *irq, Notifier notifier)
>>>>>> {
>>>>>>     if (!irq || !irq->add_eoi_notifier)
>>>>>>         return -1;
>>>>>>
>>>>>>    return irq->add_eoi_notifier(irq->opaque, irq->n, notifier);
>>>>>> }
>>>>>>
>>>>>
>>>>> Then we will have to entirely replace qemu_allocate_irqs() with qemu_allocate_irqs2() and pass some
>>>>> non-zero add_eoi_notifier() on every level, at least for PCI for now. I would like to avoid that if
>>>>> possible - hard to get accepted :)
>>>>
>>>> Yep, that's why I said it was the same kind of mess as the PCIBus intx
>>>> routing.  It's intrusive, but qemu_irq is the common interrupt model so
>>>> we need to make use of it.
>>>
>>> There are 2 level of intrusion.
>>>
>>> 1. Fix PCIINTxRoute to return the GSI's qemu_irq as well.
>>
>> Slightly confusing because pdev->irq[] is a qemu_irq, but you want the
>> actual ioapic/pic/xics qemu_irq w/o walking through the various devices,
>> correct?
> 
> Yes. The qemu_irq which corresponds to the GSI which pci_get_irq is returning.
> 
>>  I'm not sure what we do once we have it though.  Do we get to
>> call something like the function outlined above on these "special"
>> qemu_irqs?
> 
> They are not special but just "global". This is what hw/pc_piix.c allocates with qemu_allocate_irqs().
> 
> Assuming we have properly initialized add_eoi_notifier() callback in the qemu_irq struct, we can
> easily add a notifier via this callback.
> 
> Or I did not get the whole idea.
> 
>>
>>> 2. Add add_eoi_notifier to all levels including PCI. As a part of this, we will have to add this
>>> callback to all pci_register_bus() calls to reach global interrupts via platform-specific PCI bus.
>>
>> Just like the PCI INTx route callback, most of these can just be
>> passthrough.  We just need to get to the end qemu_irq that registered a
>> real add notifier.  That might make it possible to do it w/o interfering
>> too much with other callers, I hope.
> 
> Yes. This is why I propose to extend the PCIINTxRoute struct.
> 
> Actually even adding a callback into QEMUMachine is not that bad idea.
> 
> If a pointer to the struct QEMUMachine was passed into QEMUMachineInitFunc(), it would be the right
> place to init such callback, one per machine but not per every qemu_irq as it is the same for the
> whole machine and will not change.
> 
> 
>>> I would stay with 1). Is that bad?
>>
>> It still seems to present a rather large incongruity, but if we're
>> planning to cache the qemu_irq there anyway, maybe it's a secondary use.
> 
> Cannot see how it is different from having pci_get_irq() or pci_route_irq_fn() though.
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-14  2:34     ` Alexey Kardashevskiy
@ 2012-07-16 14:21       ` Alex Williamson
  2012-07-16 21:17         ` Alex Williamson
  2012-07-17  7:53         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-16 14:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On Sat, 2012-07-14 at 12:34 +1000, Alexey Kardashevskiy wrote:
> On 14/07/12 01:07, Alex Williamson wrote:
> > On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
> >> It literally does the following:
> >>
> >> 1. POWERPC IOMMU support (the kernel counterpart is required)
> >>
> >> 2. The patch assumes that IOAPIC calls are going to be replaced
> >> with something generic.
> >>
> >> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
> >> handler to call VFIO IOMMU driver.
> >>
> >> 4. Change sPAPR PHB to scan the PCI bus which is used for
> >> the IOMMU-VFIO group. Now it is enough to add the following to
> >> the QEMU command line to get VFIO up with all the devices from
> >> IOMMU group with id=3:
> >> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> >> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/ppc/Makefile.objs  |    3 ++
> >>  hw/spapr.h            |    4 ++
> >>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
> >>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
> >>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
> >>  hw/spapr_pci.h        |    4 ++
> >>  hw/vfio_pci.c         |   30 ++++++++++++++
> >>  hw/vfio_pci.h         |    2 +
> >>  trace-events          |    1 +
> >>  9 files changed, 264 insertions(+), 6 deletions(-)
> >>  create mode 100644 hw/spapr_iommu_vfio.h
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index f573a95..c46a049 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
> >>  # Xilinx PPC peripherals
> >>  obj-y += xilinx_ethlite.o
> >>  
> >> +# VFIO PCI device assignment
> >> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> >> +
> >>  obj-y := $(addprefix ../,$(obj-y))
> >> diff --git a/hw/spapr.h b/hw/spapr.h
> >> index b37f337..26e26f6 100644
> >> --- a/hw/spapr.h
> >> +++ b/hw/spapr.h
> >> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >>                        DMAContext *dma);
> >>  
> >> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> >> +                         uint64_t *dma32_window_start,
> >> +                         uint64_t *dma32_window_size);
> >> +
> >>  #endif /* !defined (__HW_SPAPR_H__) */
> >> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> >> index 50c288d..e48ced1 100644
> >> --- a/hw/spapr_iommu.c
> >> +++ b/hw/spapr_iommu.c
> >> @@ -23,6 +23,8 @@
> >>  #include "dma.h"
> >>  
> >>  #include "hw/spapr.h"
> >> +#include "hw/spapr_iommu_vfio.h"
> >> +#include "hw/vfio_pci.h"
> >>  
> >>  #include <libfdt.h>
> >>  
> >> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
> >>      return 0;
> >>  }
> >>  
> >> +typedef struct sPAPRVFIOTable {
> >> +    int group_id;
> >> +    uint32_t liobn;
> >> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> >> +} sPAPRVFIOTable;
> >> +
> >> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> >> +
> >> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> >> +                         uint64_t *dma32_window_start,
> >> +                         uint64_t *dma32_window_size)
> >> +{
> >> +    sPAPRVFIOTable *t;
> >> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> >> +
> >> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
> >> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
> >> +        return;
> >> +    }
> >> +    *dma32_window_start = info.dma32_window_start;
> >> +    *dma32_window_size = info.dma32_window_size;
> >> +
> >> +    t = g_malloc0(sizeof(*t));
> >> +    t->group_id = group_id;
> >> +    t->liobn = liobn;
> >> +
> >> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> >> +}
> >> +
> >> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> >> +{
> >> +    sPAPRVFIOTable *t;
> >> +    struct tce_iommu_dma_map map = {
> >> +        .argsz = sizeof(map),
> >> +        .va = 0,
> >> +        .dmaaddr = ioba,
> >> +    };
> >> +
> >> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> >> +        if (t->liobn != liobn) {
> >> +            continue;
> >> +        }
> >> +        if (tce) {
> >> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> >> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
> >> +                                       &map)) {
> >> +                perror("TCE_MAP_DMA");
> >> +                return H_PARAMETER;
> >> +            }
> >> +        } else {
> >> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
> >> +                                       &map)) {
> >> +                perror("TCE_UNMAP_DMA");
> >> +                return H_PARAMETER;
> >> +            }
> >> +        }
> >> +        return H_SUCCESS;
> >> +    }
> >> +    return H_CONTINUE; /* positive non-zero value */
> >> +}
> >> +
> >>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>                                target_ulong opcode, target_ulong *args)
> >>  {
> >> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
> >>  
> >>      ret = put_tce_emu(liobn, ioba, tce);
> >> -    if (0 >= ret) {
> >> +    if (ret <= 0) {
> >> +        return ret ? H_PARAMETER : H_SUCCESS;
> >> +    }
> >> +    ret = put_tce_vfio(liobn, ioba, tce);
> >> +    if (ret <= 0) {
> >>          return ret ? H_PARAMETER : H_SUCCESS;
> >>      }
> >>  #ifdef DEBUG_TCE
> >> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> >> new file mode 100644
> >> index 0000000..711e3e4
> >> --- /dev/null
> >> +++ b/hw/spapr_iommu_vfio.h
> >> @@ -0,0 +1,49 @@
> >> +/*
> >> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
> >> + * This is the copy of the kernel header.
> >> + *
> >> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> >> + *
> >> + * This library is free software; you can redistribute it and/or
> >> + * modify it under the terms of the GNU Lesser General Public
> >> + * License as published by the Free Software Foundation; either
> >> + * version 2 of the License, or (at your option) any later version.
> >> + *
> >> + * This library is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * Lesser General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU Lesser General Public
> >> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> >> + */
> >> +
> >> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> >> +#define __HW_SPAPR_IOMMU_VFIO_H__
> >> +
> >> +#include "hw/linux-vfio.h"
> >> +
> >> +#define SPAPR_TCE_IOMMU         2
> >> +
> >> +struct tce_iommu_info {
> >> +    __u32 argsz;
> >> +    __u32 flags;
> >> +    __u32 dma32_window_start;
> >> +    __u32 dma32_window_size;
> >> +    __u64 dma64_window_start;
> >> +    __u64 dma64_window_size;
> >> +};
> >> +
> >> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> >> +
> >> +struct tce_iommu_dma_map {
> >> +    __u32 argsz;
> >> +    __u32 flags;
> >> +    __u64 va;
> >> +    __u64 dmaaddr;
> >> +};
> >> +
> >> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> >> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> >> +
> >> +#endif
> >> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> >> index 014297b..836ec4f 100644
> >> --- a/hw/spapr_pci.c
> >> +++ b/hw/spapr_pci.c
> >> @@ -22,6 +22,9 @@
> >>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >>   * THE SOFTWARE.
> >>   */
> >> +#include <sys/types.h>
> >> +#include <dirent.h>
> >> +
> >>  #include "hw.h"
> >>  #include "pci.h"
> >>  #include "msi.h"
> >> @@ -32,7 +35,6 @@
> >>  #include "exec-memory.h"
> >>  #include <libfdt.h>
> >>  #include "trace.h"
> >> -
> >>  #include "hw/pci_internals.h"
> >>  
> >>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> >> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
> >>                   level);
> >>  }
> >>  
> >> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> >> +{
> >> +    sPAPRPHBState *phb = opaque;
> >> +    return phb->lsi_table[irq_num].dt_irq;
> >> +}
> >> +
> >>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
> >>                                unsigned size)
> >>  {
> >> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
> >>      return phb->dma;
> >>  }
> >>  
> >> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> >> +{
> >> +    char iommupath[256];
> >> +    DIR *dirp;
> >> +    struct dirent *entry;
> >> +
> >> +    if (!phb->scan) {
> >> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> >> +        return 0;
> >> +    }
> >> +
> >> +    snprintf(iommupath, sizeof(iommupath),
> >> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> >> +    dirp = opendir(iommupath);
> >> +
> >> +    while ((entry = readdir(dirp)) != NULL) {
> >> +        char *tmp;
> >> +        FILE *deviceclassfile;
> >> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> >> +        char addr[32];
> >> +        DeviceState *dev;
> >> +
> >> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> >> +                   &domainid, &busid, &devid, &fnid) != 4) {
> >> +            continue;
> >> +        }
> >> +
> >> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> >> +        trace_spapr_pci("Reading device class from ", tmp);
> >> +
> >> +        deviceclassfile = fopen(tmp, "r");
> >> +        if (deviceclassfile) {
> >> +            fscanf(deviceclassfile, "%x", &deviceclass);
> >> +            fclose(deviceclassfile);
> >> +        }
> >> +        g_free(tmp);
> >> +
> >> +        if (!deviceclass) {
> >> +            continue;
> >> +        }
> >> +        if ((phb->scan < 2) &&
> >> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> >> +            /* Skip _any_ bridge */
> >> +            continue;
> >> +        }
> >> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> >> +            /* Tweak USB */
> >> +            phb->force_addr = 1;
> >> +            phb->enable_multifunction = 1;
> >> +        }
> >> +
> >> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> >> +
> >> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> >> +        if (!dev) {
> >> +            fprintf(stderr, "failed to create vfio-pci\n");
> >> +            continue;
> >> +        }
> >> +        qdev_prop_parse(dev, "host", entry->d_name);
> >> +        if (phb->force_addr) {
> >> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> >> +            qdev_prop_parse(dev, "addr", addr);
> >> +        }
> >> +        if (phb->enable_multifunction) {
> >> +            qdev_prop_set_bit(dev, "multifunction", 1);
> >> +        }
> >> +        qdev_init_nofail(dev);
> >> +    }
> >> +    closedir(dirp);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>  static int spapr_phb_init(SysBusDevice *s)
> >>  {
> >>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> >> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
> >>  
> >>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
> >>                             phb->busname ? phb->busname : phb->dtbusname,
> >> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> >> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> >> +                           pci_spapr_map_irq, phb,
> >>                             &phb->memspace, &phb->iospace,
> >>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
> >>      phb->host_state.bus = bus;
> >>  
> >>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> >> -    phb->dma_window_start = 0;
> >> -    phb->dma_window_size = 0x40000000;
> >> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
> >>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
> >>  
> >>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> >> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
> >>          }
> >>      }
> >>  
> >> +    if (phb->iommugroupid >= 0) {
> >> +        if (spapr_pci_scan_vfio(phb) < 0) {
> >> +            return -1;
> >> +        }
> >> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
> >> +                            &phb->dma_window_start,
> >> +                            &phb->dma_window_size);
> >> +        return 0;
> >> +    }
> >> +
> >> +    phb->dma_window_start = 0;
> >> +    phb->dma_window_size = 0x40000000;
> >> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> >> +                                         phb->dma_window_size);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
> >>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
> >>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
> >>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> >> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> >> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> >> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> >> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> >> index 145071c..f514823 100644
> >> --- a/hw/spapr_pci.h
> >> +++ b/hw/spapr_pci.h
> >> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
> >>          int nvec;
> >>      } msi_table[SPAPR_MSIX_MAX_DEVS];
> >>  
> >> +    int32_t iommugroupid;
> >> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> >> +    uint8_t enable_multifunction, force_addr;
> >> +
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >>  } sPAPRPHBState;
> >>  
> >> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> >> index 1ac287f..fc84fb4 100644
> >> --- a/hw/vfio_pci.c
> >> +++ b/hw/vfio_pci.c
> >> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
> >>  
> >>          memory_listener_register(&container->listener, get_system_memory());
> >>  
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> >> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set group container: %s\n",
> >> +                         strerror(errno));
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -1;
> >> +        }
> >> +
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set iommu for container: %s\n",
> >> +                         strerror(errno));
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -1;
> >> +        }
> > 
> > I think we can still do better.  The x86 code sets up a MemoryListener
> > here with data for that embedded into the VFIOContainer.  You don't
> > have, need, or want a MemoryListener, but that doesn't mean we can't
> > follow the model of registering that this group exists here and setting
> > up map/unmap callbacks.
> > 
> > For instance:
> > 
> > in vfio_pci.h:
> > struct sPAPRVFIOData {
> >     uint64_t dma32_window_start;
> >     uint64_t dma64_window_size;
> >     ....
> >     int (*map)(struct tce_iommu_dma_map *);
> >     int (*unmap)(struct tce_iommu_dma_map *);
> > };
> > 
> > appended to the above spapr tce iommu setup above:
> > 
> > struct tce_iommu_info info;
> > 
> > /* the MemoryListener embedded in container becomes a union to hold
> >  * iommu specific data. */
> > container->u.spapr.data->map = vfio_spapr_tce_map;
> > container->u.spapr.data->unmap = vfio_spapr_tce_unmap;
> 
> I could actually reuse x86 callbacks, just wanted to keep POWER ioctls together. The problem here is
> getting the DMA window parameters and avoiding MemoryListener.

The callback is pretty trivial though, filling in the data structure and
calling the ioctl.  We have different data structures and different
ioctls, so probably not a lot to leverage.

> > ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))did 
> > 
> > container->u.spapr.data->dma32_window_start = info.dma32_window_start;
> > container->u.spapr.data->dma32_window_size = info.dma32_window_size;
> > 
> > spapr_register_vfio_container(&container->u.spapr.data)
> 
> I assume it is called within vfio_pci.c as we do not want to access VFIOContainer members from
> anywhere but vfio_pci.c.
> Or we are changing the approach? I am a bit confused.

Yes, the registration function would be called from vfio_pci at the
equivalent place in the spar iommu test as x86 is setting up the memory
listener.  That would register the map/unmap function pointers and dma
window information.  spapr would then make map/unmap calls using those
function pointers, those would be implemented in vfio_pci where they
could dereference the container and therefore get to the container fd.

> > Then vfio_disconnect_container() could call
> > spapr_unregister_vfio_container().  Maybe the container contains a
> > function pointer to an uninit function so we don't have to ifdef between
> > x86 and power.  Does that make sense?  Thanks,
> 
> We also need to pass the numbers from the info struct to spapr_pci.c in order to tell the guest
> where the DMA window besides. Another callback? This exactly what I avoided in the kernel when we
> decided not to extend IOMMU API with POWER stuff, I would like to have the same here.

This in an internal API, there's no penalty for fixing it later.
callbacks and window info are passed in the data structure outlined
above.

> In general, what is good in pulling to VFIO as much platform specific stuff as possible?
> 
> I am trying to keep sPAPR IOMMU stuff away and make it easy to add new platforms to VFIO.
> 
> For example, I would rather think of moving the piece of code which checks for SPAPR_TCE_IOMMU out
> of VFIO, make it a QEMUMachine callback (together with add-eoi-notifier) as the way IOMMU works is
> definitely the specific machine type feature.
> 
> For example, int QEMUMachine::init_iommu(VFIOContainter *container) which would not even try
> VFIO_TYPE1_IOMMU on POWER or SPAPR_TCE_IOMMU on x86 as it knows the machine and IOMMU types already.

I don't think tying vfio into the QEMUMachine type has a future.  If you
convince Anthony or Michael otherwise, let me know.  Your attempt to
keep spapr stuff completely out of vfio is requiring private interfaces
to be exposed.  That I think is the wrong direction.

> And do something like:
> typedef struct VFIOContainer {
>     int fd;
>     void *platform_iommu_data;
> } VFIOContainer;
> 
> Create additional file called vfio_iommu_x86.c with:
> struct VFIO_Type1_IOMMU {
>     MemoryListener listener;
>     QLIST_HEAD(, VFIOGroup) group_list;
>     QLIST_ENTRY(VFIOContainer) next;
> };
> and put all MemoryListener stuff there.
> 
> For POWER we already have spapr_iommu.c.
> 
> Wrong direction? :)

This is actually very similar to what I'm proposing above.  Rather than
a platform_iommu_data pointer, I'm suggesting that we make a union in
VFIOContainer.  That allows us to actually dereference the container and
get back to the fd in the callback functions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-16 14:21       ` Alex Williamson
@ 2012-07-16 21:17         ` Alex Williamson
  2012-07-17  7:53         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-16 21:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On Mon, 2012-07-16 at 08:21 -0600, Alex Williamson wrote:
> On Sat, 2012-07-14 at 12:34 +1000, Alexey Kardashevskiy wrote:
> > On 14/07/12 01:07, Alex Williamson wrote:
> > > On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
> > >> It literally does the following:
> > >>
> > >> 1. POWERPC IOMMU support (the kernel counterpart is required)
> > >>
> > >> 2. The patch assumes that IOAPIC calls are going to be replaced
> > >> with something generic.
> > >>
> > >> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
> > >> handler to call VFIO IOMMU driver.
> > >>
> > >> 4. Change sPAPR PHB to scan the PCI bus which is used for
> > >> the IOMMU-VFIO group. Now it is enough to add the following to
> > >> the QEMU command line to get VFIO up with all the devices from
> > >> IOMMU group with id=3:
> > >> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> > >> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >>  hw/ppc/Makefile.objs  |    3 ++
> > >>  hw/spapr.h            |    4 ++
> > >>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
> > >>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
> > >>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
> > >>  hw/spapr_pci.h        |    4 ++
> > >>  hw/vfio_pci.c         |   30 ++++++++++++++
> > >>  hw/vfio_pci.h         |    2 +
> > >>  trace-events          |    1 +
> > >>  9 files changed, 264 insertions(+), 6 deletions(-)
> > >>  create mode 100644 hw/spapr_iommu_vfio.h
> > >>
> > >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> > >> index f573a95..c46a049 100644
> > >> --- a/hw/ppc/Makefile.objs
> > >> +++ b/hw/ppc/Makefile.objs
> > >> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
> > >>  # Xilinx PPC peripherals
> > >>  obj-y += xilinx_ethlite.o
> > >>  
> > >> +# VFIO PCI device assignment
> > >> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> > >> +
> > >>  obj-y := $(addprefix ../,$(obj-y))
> > >> diff --git a/hw/spapr.h b/hw/spapr.h
> > >> index b37f337..26e26f6 100644
> > >> --- a/hw/spapr.h
> > >> +++ b/hw/spapr.h
> > >> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> > >>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> > >>                        DMAContext *dma);
> > >>  
> > >> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> > >> +                         uint64_t *dma32_window_start,
> > >> +                         uint64_t *dma32_window_size);
> > >> +
> > >>  #endif /* !defined (__HW_SPAPR_H__) */
> > >> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> > >> index 50c288d..e48ced1 100644
> > >> --- a/hw/spapr_iommu.c
> > >> +++ b/hw/spapr_iommu.c
> > >> @@ -23,6 +23,8 @@
> > >>  #include "dma.h"
> > >>  
> > >>  #include "hw/spapr.h"
> > >> +#include "hw/spapr_iommu_vfio.h"
> > >> +#include "hw/vfio_pci.h"
> > >>  
> > >>  #include <libfdt.h>
> > >>  
> > >> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
> > >>      return 0;
> > >>  }
> > >>  
> > >> +typedef struct sPAPRVFIOTable {
> > >> +    int group_id;
> > >> +    uint32_t liobn;
> > >> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> > >> +} sPAPRVFIOTable;
> > >> +
> > >> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> > >> +
> > >> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> > >> +                         uint64_t *dma32_window_start,
> > >> +                         uint64_t *dma32_window_size)
> > >> +{
> > >> +    sPAPRVFIOTable *t;
> > >> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> > >> +
> > >> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
> > >> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
> > >> +        return;
> > >> +    }
> > >> +    *dma32_window_start = info.dma32_window_start;
> > >> +    *dma32_window_size = info.dma32_window_size;
> > >> +
> > >> +    t = g_malloc0(sizeof(*t));
> > >> +    t->group_id = group_id;
> > >> +    t->liobn = liobn;
> > >> +
> > >> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> > >> +}
> > >> +
> > >> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> > >> +{
> > >> +    sPAPRVFIOTable *t;
> > >> +    struct tce_iommu_dma_map map = {
> > >> +        .argsz = sizeof(map),
> > >> +        .va = 0,
> > >> +        .dmaaddr = ioba,
> > >> +    };
> > >> +
> > >> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> > >> +        if (t->liobn != liobn) {
> > >> +            continue;
> > >> +        }
> > >> +        if (tce) {
> > >> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> > >> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
> > >> +                                       &map)) {
> > >> +                perror("TCE_MAP_DMA");
> > >> +                return H_PARAMETER;
> > >> +            }
> > >> +        } else {
> > >> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
> > >> +                                       &map)) {
> > >> +                perror("TCE_UNMAP_DMA");
> > >> +                return H_PARAMETER;
> > >> +            }
> > >> +        }
> > >> +        return H_SUCCESS;
> > >> +    }
> > >> +    return H_CONTINUE; /* positive non-zero value */
> > >> +}
> > >> +
> > >>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> > >>                                target_ulong opcode, target_ulong *args)
> > >>  {
> > >> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> > >>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
> > >>  
> > >>      ret = put_tce_emu(liobn, ioba, tce);
> > >> -    if (0 >= ret) {
> > >> +    if (ret <= 0) {
> > >> +        return ret ? H_PARAMETER : H_SUCCESS;
> > >> +    }
> > >> +    ret = put_tce_vfio(liobn, ioba, tce);
> > >> +    if (ret <= 0) {
> > >>          return ret ? H_PARAMETER : H_SUCCESS;
> > >>      }
> > >>  #ifdef DEBUG_TCE
> > >> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> > >> new file mode 100644
> > >> index 0000000..711e3e4
> > >> --- /dev/null
> > >> +++ b/hw/spapr_iommu_vfio.h
> > >> @@ -0,0 +1,49 @@
> > >> +/*
> > >> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
> > >> + * This is the copy of the kernel header.
> > >> + *
> > >> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> > >> + *
> > >> + * This library is free software; you can redistribute it and/or
> > >> + * modify it under the terms of the GNU Lesser General Public
> > >> + * License as published by the Free Software Foundation; either
> > >> + * version 2 of the License, or (at your option) any later version.
> > >> + *
> > >> + * This library is distributed in the hope that it will be useful,
> > >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > >> + * Lesser General Public License for more details.
> > >> + *
> > >> + * You should have received a copy of the GNU Lesser General Public
> > >> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> > >> + */
> > >> +
> > >> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> > >> +#define __HW_SPAPR_IOMMU_VFIO_H__
> > >> +
> > >> +#include "hw/linux-vfio.h"
> > >> +
> > >> +#define SPAPR_TCE_IOMMU         2
> > >> +
> > >> +struct tce_iommu_info {
> > >> +    __u32 argsz;
> > >> +    __u32 flags;
> > >> +    __u32 dma32_window_start;
> > >> +    __u32 dma32_window_size;
> > >> +    __u64 dma64_window_start;
> > >> +    __u64 dma64_window_size;
> > >> +};
> > >> +
> > >> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> > >> +
> > >> +struct tce_iommu_dma_map {
> > >> +    __u32 argsz;
> > >> +    __u32 flags;
> > >> +    __u64 va;
> > >> +    __u64 dmaaddr;
> > >> +};
> > >> +
> > >> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> > >> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> > >> +
> > >> +#endif
> > >> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> > >> index 014297b..836ec4f 100644
> > >> --- a/hw/spapr_pci.c
> > >> +++ b/hw/spapr_pci.c
> > >> @@ -22,6 +22,9 @@
> > >>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> > >>   * THE SOFTWARE.
> > >>   */
> > >> +#include <sys/types.h>
> > >> +#include <dirent.h>
> > >> +
> > >>  #include "hw.h"
> > >>  #include "pci.h"
> > >>  #include "msi.h"
> > >> @@ -32,7 +35,6 @@
> > >>  #include "exec-memory.h"
> > >>  #include <libfdt.h>
> > >>  #include "trace.h"
> > >> -
> > >>  #include "hw/pci_internals.h"
> > >>  
> > >>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> > >> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
> > >>                   level);
> > >>  }
> > >>  
> > >> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> > >> +{
> > >> +    sPAPRPHBState *phb = opaque;
> > >> +    return phb->lsi_table[irq_num].dt_irq;
> > >> +}
> > >> +
> > >>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
> > >>                                unsigned size)
> > >>  {
> > >> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
> > >>      return phb->dma;
> > >>  }
> > >>  
> > >> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> > >> +{
> > >> +    char iommupath[256];
> > >> +    DIR *dirp;
> > >> +    struct dirent *entry;
> > >> +
> > >> +    if (!phb->scan) {
> > >> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> > >> +        return 0;
> > >> +    }
> > >> +
> > >> +    snprintf(iommupath, sizeof(iommupath),
> > >> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> > >> +    dirp = opendir(iommupath);
> > >> +
> > >> +    while ((entry = readdir(dirp)) != NULL) {
> > >> +        char *tmp;
> > >> +        FILE *deviceclassfile;
> > >> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> > >> +        char addr[32];
> > >> +        DeviceState *dev;
> > >> +
> > >> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> > >> +                   &domainid, &busid, &devid, &fnid) != 4) {
> > >> +            continue;
> > >> +        }
> > >> +
> > >> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> > >> +        trace_spapr_pci("Reading device class from ", tmp);
> > >> +
> > >> +        deviceclassfile = fopen(tmp, "r");
> > >> +        if (deviceclassfile) {
> > >> +            fscanf(deviceclassfile, "%x", &deviceclass);
> > >> +            fclose(deviceclassfile);
> > >> +        }
> > >> +        g_free(tmp);
> > >> +
> > >> +        if (!deviceclass) {
> > >> +            continue;
> > >> +        }
> > >> +        if ((phb->scan < 2) &&
> > >> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> > >> +            /* Skip _any_ bridge */
> > >> +            continue;
> > >> +        }
> > >> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> > >> +            /* Tweak USB */
> > >> +            phb->force_addr = 1;
> > >> +            phb->enable_multifunction = 1;
> > >> +        }
> > >> +
> > >> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> > >> +
> > >> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> > >> +        if (!dev) {
> > >> +            fprintf(stderr, "failed to create vfio-pci\n");
> > >> +            continue;
> > >> +        }
> > >> +        qdev_prop_parse(dev, "host", entry->d_name);
> > >> +        if (phb->force_addr) {
> > >> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> > >> +            qdev_prop_parse(dev, "addr", addr);
> > >> +        }
> > >> +        if (phb->enable_multifunction) {
> > >> +            qdev_prop_set_bit(dev, "multifunction", 1);
> > >> +        }
> > >> +        qdev_init_nofail(dev);
> > >> +    }
> > >> +    closedir(dirp);
> > >> +
> > >> +    return 0;
> > >> +}
> > >> +
> > >>  static int spapr_phb_init(SysBusDevice *s)
> > >>  {
> > >>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> > >> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
> > >>  
> > >>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
> > >>                             phb->busname ? phb->busname : phb->dtbusname,
> > >> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> > >> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> > >> +                           pci_spapr_map_irq, phb,
> > >>                             &phb->memspace, &phb->iospace,
> > >>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
> > >>      phb->host_state.bus = bus;
> > >>  
> > >>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> > >> -    phb->dma_window_start = 0;
> > >> -    phb->dma_window_size = 0x40000000;
> > >> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
> > >>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
> > >>  
> > >>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> > >> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
> > >>          }
> > >>      }
> > >>  
> > >> +    if (phb->iommugroupid >= 0) {
> > >> +        if (spapr_pci_scan_vfio(phb) < 0) {
> > >> +            return -1;
> > >> +        }
> > >> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
> > >> +                            &phb->dma_window_start,
> > >> +                            &phb->dma_window_size);
> > >> +        return 0;
> > >> +    }
> > >> +
> > >> +    phb->dma_window_start = 0;
> > >> +    phb->dma_window_size = 0x40000000;
> > >> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> > >> +                                         phb->dma_window_size);
> > >> +
> > >>      return 0;
> > >>  }
> > >>  
> > >> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
> > >>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
> > >>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
> > >>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> > >> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> > >> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> > >> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> > >> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
> > >>      DEFINE_PROP_END_OF_LIST(),
> > >>  };
> > >>  
> > >> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> > >> index 145071c..f514823 100644
> > >> --- a/hw/spapr_pci.h
> > >> +++ b/hw/spapr_pci.h
> > >> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
> > >>          int nvec;
> > >>      } msi_table[SPAPR_MSIX_MAX_DEVS];
> > >>  
> > >> +    int32_t iommugroupid;
> > >> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> > >> +    uint8_t enable_multifunction, force_addr;
> > >> +
> > >>      QLIST_ENTRY(sPAPRPHBState) list;
> > >>  } sPAPRPHBState;
> > >>  
> > >> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> > >> index 1ac287f..fc84fb4 100644
> > >> --- a/hw/vfio_pci.c
> > >> +++ b/hw/vfio_pci.c
> > >> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
> > >>  
> > >>          memory_listener_register(&container->listener, get_system_memory());
> > >>  
> > >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> > >> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> > >> +        if (ret) {
> > >> +            error_report("vfio: failed to set group container: %s\n",
> > >> +                         strerror(errno));
> > >> +            g_free(container);
> > >> +            close(fd);
> > >> +            return -1;
> > >> +        }
> > >> +
> > >> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> > >> +        if (ret) {
> > >> +            error_report("vfio: failed to set iommu for container: %s\n",
> > >> +                         strerror(errno));
> > >> +            g_free(container);
> > >> +            close(fd);
> > >> +            return -1;
> > >> +        }
> > > 
> > > I think we can still do better.  The x86 code sets up a MemoryListener
> > > here with data for that embedded into the VFIOContainer.  You don't
> > > have, need, or want a MemoryListener, but that doesn't mean we can't
> > > follow the model of registering that this group exists here and setting
> > > up map/unmap callbacks.
> > > 
> > > For instance:
> > > 
> > > in vfio_pci.h:
> > > struct sPAPRVFIOData {
> > >     uint64_t dma32_window_start;
> > >     uint64_t dma64_window_size;
> > >     ....
> > >     int (*map)(struct tce_iommu_dma_map *);
> > >     int (*unmap)(struct tce_iommu_dma_map *);
> > > };
> > > 
> > > appended to the above spapr tce iommu setup above:
> > > 
> > > struct tce_iommu_info info;
> > > 
> > > /* the MemoryListener embedded in container becomes a union to hold
> > >  * iommu specific data. */
> > > container->u.spapr.data->map = vfio_spapr_tce_map;
> > > container->u.spapr.data->unmap = vfio_spapr_tce_unmap;
> > 
> > I could actually reuse x86 callbacks, just wanted to keep POWER ioctls together. The problem here is
> > getting the DMA window parameters and avoiding MemoryListener.
> 
> The callback is pretty trivial though, filling in the data structure and
> calling the ioctl.  We have different data structures and different
> ioctls, so probably not a lot to leverage.
> 
> > > ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))did 
> > > 
> > > container->u.spapr.data->dma32_window_start = info.dma32_window_start;
> > > container->u.spapr.data->dma32_window_size = info.dma32_window_size;
> > > 
> > > spapr_register_vfio_container(&container->u.spapr.data)
> > 
> > I assume it is called within vfio_pci.c as we do not want to access VFIOContainer members from
> > anywhere but vfio_pci.c.
> > Or we are changing the approach? I am a bit confused.
> 
> Yes, the registration function would be called from vfio_pci at the
> equivalent place in the spar iommu test as x86 is setting up the memory
> listener.  That would register the map/unmap function pointers and dma
> window information.  spapr would then make map/unmap calls using those
> function pointers, those would be implemented in vfio_pci where they
> could dereference the container and therefore get to the container fd.
> 
> > > Then vfio_disconnect_container() could call
> > > spapr_unregister_vfio_container().  Maybe the container contains a
> > > function pointer to an uninit function so we don't have to ifdef between
> > > x86 and power.  Does that make sense?  Thanks,
> > 
> > We also need to pass the numbers from the info struct to spapr_pci.c in order to tell the guest
> > where the DMA window besides. Another callback? This exactly what I avoided in the kernel when we
> > decided not to extend IOMMU API with POWER stuff, I would like to have the same here.
> 
> This in an internal API, there's no penalty for fixing it later.
> callbacks and window info are passed in the data structure outlined
> above.
> 
> > In general, what is good in pulling to VFIO as much platform specific stuff as possible?
> > 
> > I am trying to keep sPAPR IOMMU stuff away and make it easy to add new platforms to VFIO.
> > 
> > For example, I would rather think of moving the piece of code which checks for SPAPR_TCE_IOMMU out
> > of VFIO, make it a QEMUMachine callback (together with add-eoi-notifier) as the way IOMMU works is
> > definitely the specific machine type feature.
> > 
> > For example, int QEMUMachine::init_iommu(VFIOContainter *container) which would not even try
> > VFIO_TYPE1_IOMMU on POWER or SPAPR_TCE_IOMMU on x86 as it knows the machine and IOMMU types already.
> 
> I don't think tying vfio into the QEMUMachine type has a future.  If you
> convince Anthony or Michael otherwise, let me know.  Your attempt to
> keep spapr stuff completely out of vfio is requiring private interfaces
> to be exposed.  That I think is the wrong direction.
> 
> > And do something like:
> > typedef struct VFIOContainer {
> >     int fd;
> >     void *platform_iommu_data;
> > } VFIOContainer;
> > 
> > Create additional file called vfio_iommu_x86.c with:
> > struct VFIO_Type1_IOMMU {
> >     MemoryListener listener;
> >     QLIST_HEAD(, VFIOGroup) group_list;
> >     QLIST_ENTRY(VFIOContainer) next;
> > };
> > and put all MemoryListener stuff there.
> > 
> > For POWER we already have spapr_iommu.c.
> > 
> > Wrong direction? :)
> 
> This is actually very similar to what I'm proposing above.  Rather than
> a platform_iommu_data pointer, I'm suggesting that we make a union in
> VFIOContainer.  That allows us to actually dereference the container and
> get back to the fd in the callback functions.  Thanks,

Here's a stab at generalizing the way it works on x86.  The commit
comment outlines how I think it should be extended.
https://github.com/awilliam/qemu-vfio/commit/96e6f8446fc10a8f9d0506df8230218799de31ae
Thanks,
Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-16 14:21       ` Alex Williamson
  2012-07-16 21:17         ` Alex Williamson
@ 2012-07-17  7:53         ` Alexey Kardashevskiy
  2012-07-17 14:11           ` Alex Williamson
  1 sibling, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-17  7:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On 17/07/12 00:21, Alex Williamson wrote:
> On Sat, 2012-07-14 at 12:34 +1000, Alexey Kardashevskiy wrote:
>> On 14/07/12 01:07, Alex Williamson wrote:
>>> On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
>>>> It literally does the following:
>>>>
>>>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>>>
>>>> 2. The patch assumes that IOAPIC calls are going to be replaced
>>>> with something generic.
>>>>
>>>> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
>>>> handler to call VFIO IOMMU driver.
>>>>
>>>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>>>> the IOMMU-VFIO group. Now it is enough to add the following to
>>>> the QEMU command line to get VFIO up with all the devices from
>>>> IOMMU group with id=3:
>>>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>>>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  hw/ppc/Makefile.objs  |    3 ++
>>>>  hw/spapr.h            |    4 ++
>>>>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
>>>>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
>>>>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
>>>>  hw/spapr_pci.h        |    4 ++
>>>>  hw/vfio_pci.c         |   30 ++++++++++++++
>>>>  hw/vfio_pci.h         |    2 +
>>>>  trace-events          |    1 +
>>>>  9 files changed, 264 insertions(+), 6 deletions(-)
>>>>  create mode 100644 hw/spapr_iommu_vfio.h
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index f573a95..c46a049 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>>>  # Xilinx PPC peripherals
>>>>  obj-y += xilinx_ethlite.o
>>>>  
>>>> +# VFIO PCI device assignment
>>>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>>>> +
>>>>  obj-y := $(addprefix ../,$(obj-y))
>>>> diff --git a/hw/spapr.h b/hw/spapr.h
>>>> index b37f337..26e26f6 100644
>>>> --- a/hw/spapr.h
>>>> +++ b/hw/spapr.h
>>>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>>                        DMAContext *dma);
>>>>  
>>>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>>>> +                         uint64_t *dma32_window_start,
>>>> +                         uint64_t *dma32_window_size);
>>>> +
>>>>  #endif /* !defined (__HW_SPAPR_H__) */
>>>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>>>> index 50c288d..e48ced1 100644
>>>> --- a/hw/spapr_iommu.c
>>>> +++ b/hw/spapr_iommu.c
>>>> @@ -23,6 +23,8 @@
>>>>  #include "dma.h"
>>>>  
>>>>  #include "hw/spapr.h"
>>>> +#include "hw/spapr_iommu_vfio.h"
>>>> +#include "hw/vfio_pci.h"
>>>>  
>>>>  #include <libfdt.h>
>>>>  
>>>> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>>>      return 0;
>>>>  }
>>>>  
>>>> +typedef struct sPAPRVFIOTable {
>>>> +    int group_id;
>>>> +    uint32_t liobn;
>>>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>>>> +} sPAPRVFIOTable;
>>>> +
>>>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>>>> +
>>>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>>>> +                         uint64_t *dma32_window_start,
>>>> +                         uint64_t *dma32_window_size)
>>>> +{
>>>> +    sPAPRVFIOTable *t;
>>>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
>>>> +
>>>> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
>>>> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
>>>> +        return;
>>>> +    }
>>>> +    *dma32_window_start = info.dma32_window_start;
>>>> +    *dma32_window_size = info.dma32_window_size;
>>>> +
>>>> +    t = g_malloc0(sizeof(*t));
>>>> +    t->group_id = group_id;
>>>> +    t->liobn = liobn;
>>>> +
>>>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>>>> +}
>>>> +
>>>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>>>> +{
>>>> +    sPAPRVFIOTable *t;
>>>> +    struct tce_iommu_dma_map map = {
>>>> +        .argsz = sizeof(map),
>>>> +        .va = 0,
>>>> +        .dmaaddr = ioba,
>>>> +    };
>>>> +
>>>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>>>> +        if (t->liobn != liobn) {
>>>> +            continue;
>>>> +        }
>>>> +        if (tce) {
>>>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>>>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
>>>> +                                       &map)) {
>>>> +                perror("TCE_MAP_DMA");
>>>> +                return H_PARAMETER;
>>>> +            }
>>>> +        } else {
>>>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
>>>> +                                       &map)) {
>>>> +                perror("TCE_UNMAP_DMA");
>>>> +                return H_PARAMETER;
>>>> +            }
>>>> +        }
>>>> +        return H_SUCCESS;
>>>> +    }
>>>> +    return H_CONTINUE; /* positive non-zero value */
>>>> +}
>>>> +
>>>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>>                                target_ulong opcode, target_ulong *args)
>>>>  {
>>>> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>>>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>>>>  
>>>>      ret = put_tce_emu(liobn, ioba, tce);
>>>> -    if (0 >= ret) {
>>>> +    if (ret <= 0) {
>>>> +        return ret ? H_PARAMETER : H_SUCCESS;
>>>> +    }
>>>> +    ret = put_tce_vfio(liobn, ioba, tce);
>>>> +    if (ret <= 0) {
>>>>          return ret ? H_PARAMETER : H_SUCCESS;
>>>>      }
>>>>  #ifdef DEBUG_TCE
>>>> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
>>>> new file mode 100644
>>>> index 0000000..711e3e4
>>>> --- /dev/null
>>>> +++ b/hw/spapr_iommu_vfio.h
>>>> @@ -0,0 +1,49 @@
>>>> +/*
>>>> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
>>>> + * This is the copy of the kernel header.
>>>> + *
>>>> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
>>>> + *
>>>> + * This library is free software; you can redistribute it and/or
>>>> + * modify it under the terms of the GNU Lesser General Public
>>>> + * License as published by the Free Software Foundation; either
>>>> + * version 2 of the License, or (at your option) any later version.
>>>> + *
>>>> + * This library is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> + * Lesser General Public License for more details.
>>>> + *
>>>> + * You should have received a copy of the GNU Lesser General Public
>>>> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
>>>> +#define __HW_SPAPR_IOMMU_VFIO_H__
>>>> +
>>>> +#include "hw/linux-vfio.h"
>>>> +
>>>> +#define SPAPR_TCE_IOMMU         2
>>>> +
>>>> +struct tce_iommu_info {
>>>> +    __u32 argsz;
>>>> +    __u32 flags;
>>>> +    __u32 dma32_window_start;
>>>> +    __u32 dma32_window_size;
>>>> +    __u64 dma64_window_start;
>>>> +    __u64 dma64_window_size;
>>>> +};
>>>> +
>>>> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
>>>> +
>>>> +struct tce_iommu_dma_map {
>>>> +    __u32 argsz;
>>>> +    __u32 flags;
>>>> +    __u64 va;
>>>> +    __u64 dmaaddr;
>>>> +};
>>>> +
>>>> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
>>>> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
>>>> +
>>>> +#endif
>>>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>>>> index 014297b..836ec4f 100644
>>>> --- a/hw/spapr_pci.c
>>>> +++ b/hw/spapr_pci.c
>>>> @@ -22,6 +22,9 @@
>>>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>>   * THE SOFTWARE.
>>>>   */
>>>> +#include <sys/types.h>
>>>> +#include <dirent.h>
>>>> +
>>>>  #include "hw.h"
>>>>  #include "pci.h"
>>>>  #include "msi.h"
>>>> @@ -32,7 +35,6 @@
>>>>  #include "exec-memory.h"
>>>>  #include <libfdt.h>
>>>>  #include "trace.h"
>>>> -
>>>>  #include "hw/pci_internals.h"
>>>>  
>>>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>>>> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>>>                   level);
>>>>  }
>>>>  
>>>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>>>> +{
>>>> +    sPAPRPHBState *phb = opaque;
>>>> +    return phb->lsi_table[irq_num].dt_irq;
>>>> +}
>>>> +
>>>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>>>                                unsigned size)
>>>>  {
>>>> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>>>      return phb->dma;
>>>>  }
>>>>  
>>>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>>>> +{
>>>> +    char iommupath[256];
>>>> +    DIR *dirp;
>>>> +    struct dirent *entry;
>>>> +
>>>> +    if (!phb->scan) {
>>>> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    snprintf(iommupath, sizeof(iommupath),
>>>> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
>>>> +    dirp = opendir(iommupath);
>>>> +
>>>> +    while ((entry = readdir(dirp)) != NULL) {
>>>> +        char *tmp;
>>>> +        FILE *deviceclassfile;
>>>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>>>> +        char addr[32];
>>>> +        DeviceState *dev;
>>>> +
>>>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
>>>> +                   &domainid, &busid, &devid, &fnid) != 4) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
>>>> +        trace_spapr_pci("Reading device class from ", tmp);
>>>> +
>>>> +        deviceclassfile = fopen(tmp, "r");
>>>> +        if (deviceclassfile) {
>>>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>>>> +            fclose(deviceclassfile);
>>>> +        }
>>>> +        g_free(tmp);
>>>> +
>>>> +        if (!deviceclass) {
>>>> +            continue;
>>>> +        }
>>>> +        if ((phb->scan < 2) &&
>>>> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
>>>> +            /* Skip _any_ bridge */
>>>> +            continue;
>>>> +        }
>>>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>>>> +            /* Tweak USB */
>>>> +            phb->force_addr = 1;
>>>> +            phb->enable_multifunction = 1;
>>>> +        }
>>>> +
>>>> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
>>>> +
>>>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>>>> +        if (!dev) {
>>>> +            fprintf(stderr, "failed to create vfio-pci\n");
>>>> +            continue;
>>>> +        }
>>>> +        qdev_prop_parse(dev, "host", entry->d_name);
>>>> +        if (phb->force_addr) {
>>>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
>>>> +            qdev_prop_parse(dev, "addr", addr);
>>>> +        }
>>>> +        if (phb->enable_multifunction) {
>>>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>>>> +        }
>>>> +        qdev_init_nofail(dev);
>>>> +    }
>>>> +    closedir(dirp);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>  static int spapr_phb_init(SysBusDevice *s)
>>>>  {
>>>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>>>> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>>>  
>>>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>>>                             phb->busname ? phb->busname : phb->dtbusname,
>>>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>>>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>>>> +                           pci_spapr_map_irq, phb,
>>>>                             &phb->memspace, &phb->iospace,
>>>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>>>      phb->host_state.bus = bus;
>>>>  
>>>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>>>> -    phb->dma_window_start = 0;
>>>> -    phb->dma_window_size = 0x40000000;
>>>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>>>  
>>>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>>>> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
>>>>          }
>>>>      }
>>>>  
>>>> +    if (phb->iommugroupid >= 0) {
>>>> +        if (spapr_pci_scan_vfio(phb) < 0) {
>>>> +            return -1;
>>>> +        }
>>>> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
>>>> +                            &phb->dma_window_start,
>>>> +                            &phb->dma_window_size);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    phb->dma_window_start = 0;
>>>> +    phb->dma_window_size = 0x40000000;
>>>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>>>> +                                         phb->dma_window_size);
>>>> +
>>>>      return 0;
>>>>  }
>>>>  
>>>> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
>>>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>>>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>>>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
>>>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>>>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>>>> index 145071c..f514823 100644
>>>> --- a/hw/spapr_pci.h
>>>> +++ b/hw/spapr_pci.h
>>>> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
>>>>          int nvec;
>>>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>>>  
>>>> +    int32_t iommugroupid;
>>>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>>>> +    uint8_t enable_multifunction, force_addr;
>>>> +
>>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>>  } sPAPRPHBState;
>>>>  
>>>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>>>> index 1ac287f..fc84fb4 100644
>>>> --- a/hw/vfio_pci.c
>>>> +++ b/hw/vfio_pci.c
>>>> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
>>>>  
>>>>          memory_listener_register(&container->listener, get_system_memory());
>>>>  
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
>>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set group container: %s\n",
>>>> +                         strerror(errno));
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -1;
>>>> +        }
>>>> +
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set iommu for container: %s\n",
>>>> +                         strerror(errno));
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -1;
>>>> +        }
>>>
>>> I think we can still do better.  The x86 code sets up a MemoryListener
>>> here with data for that embedded into the VFIOContainer.  You don't
>>> have, need, or want a MemoryListener, but that doesn't mean we can't
>>> follow the model of registering that this group exists here and setting
>>> up map/unmap callbacks.
>>>
>>> For instance:
>>>
>>> in vfio_pci.h:
>>> struct sPAPRVFIOData {
>>>     uint64_t dma32_window_start;
>>>     uint64_t dma64_window_size;
>>>     ....
>>>     int (*map)(struct tce_iommu_dma_map *);
>>>     int (*unmap)(struct tce_iommu_dma_map *);
>>> };

>>> appended to the above spapr tce iommu setup above:
>>>
>>> struct tce_iommu_info info;
>>>
>>> /* the MemoryListener embedded in container becomes a union to hold
>>>  * iommu specific data. */
>>> container->u.spapr.data->map = vfio_spapr_tce_map;
>>> container->u.spapr.data->unmap = vfio_spapr_tce_unmap;
>>
>> I could actually reuse x86 callbacks, just wanted to keep POWER ioctls together. The problem here is
>> getting the DMA window parameters and avoiding MemoryListener.
> 
> The callback is pretty trivial though, filling in the data structure and
> calling the ioctl.  We have different data structures and different
> ioctls, so probably not a lot to leverage.
> 
>>> ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))did 
>>>
>>> container->u.spapr.data->dma32_window_start = info.dma32_window_start;
>>> container->u.spapr.data->dma32_window_size = info.dma32_window_size;
>>>
>>> spapr_register_vfio_container(&container->u.spapr.data)
>>
>> I assume it is called within vfio_pci.c as we do not want to access VFIOContainer members from
>> anywhere but vfio_pci.c.
>> Or we are changing the approach? I am a bit confused.
> 
> Yes, the registration function would be called from vfio_pci at the
> equivalent place in the spar iommu test as x86 is setting up the memory
> listener.  That would register the map/unmap function pointers and dma
> window information.  spapr would then make map/unmap calls using those
> function pointers, those would be implemented in vfio_pci where they
> could dereference the container and therefore get to the container fd.


How do we match this data with the PCI bus or device?

Even if we add IOMMU ID to sPAPRVFIOData and map/unmap callbacks, we still do not know which PCI bus it corresponds to if we create devices as it is done on x86. sPAPR PHB does not have an IOMMU id/fd and cannot get it from VFIO as it would be "exposing a private interface".

So I will have to specify an IOMMU id for the PCI bus from the command line.

And I really want to be sure that spapr_register_vfio_container() is called before spapr_pci.c started populating the device tree with the DMA window parameters which is done in spapr_reset() (a reset function of sPAPR PHB) now. Ideally I would like to know where the window is even before in order to initialize DMAContext, so keeping everything in one function spapr_phb_init() seems very right for me.


>>> Then vfio_disconnect_container() could call
>>> spapr_unregister_vfio_container().  Maybe the container contains a
>>> function pointer to an uninit function so we don't have to ifdef between
>>> x86 and power.  Does that make sense?  Thanks,
>>
>> We also need to pass the numbers from the info struct to spapr_pci.c in order to tell the guest
>> where the DMA window besides. Another callback? This exactly what I avoided in the kernel when we
>> decided not to extend IOMMU API with POWER stuff, I would like to have the same here.
> 
> This in an internal API, there's no penalty for fixing it later.
> callbacks and window info are passed in the data structure outlined
> above.
> 
>> In general, what is good in pulling to VFIO as much platform specific stuff as possible?
>>
>> I am trying to keep sPAPR IOMMU stuff away and make it easy to add new platforms to VFIO.
>>
>> For example, I would rather think of moving the piece of code which checks for SPAPR_TCE_IOMMU out
>> of VFIO, make it a QEMUMachine callback (together with add-eoi-notifier) as the way IOMMU works is
>> definitely the specific machine type feature.
>>
>> For example, int QEMUMachine::init_iommu(VFIOContainter *container) which would not even try
>> VFIO_TYPE1_IOMMU on POWER or SPAPR_TCE_IOMMU on x86 as it knows the machine and IOMMU types already.
> 
> I don't think tying vfio into the QEMUMachine type has a future.  If you
> convince Anthony or Michael otherwise, let me know.  


This is mostly because it has "vfio" in its name.
If it was something generic like IOMMU-via-fd with no mention of VFIO, then it would have got a chance :)
Seriously, it already has a set of various unrelated flags, I do not see why not to add something what really belongs to it.


> Your attempt to
> keep spapr stuff completely out of vfio is requiring private interfaces
> to be exposed.  That I think is the wrong direction.


int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
does not expose anything private from VFIO. IOMMU id is not any kind of private data. And struct tce_iommu_info is not a data which VFIO really wants to know. The calling code (spapr_pci.c) should know the IOMMU id either way and only it knows how map/unmap works, lets keep it there.

I would only add to VFIO API this:
int vfio_group_iommu_init(int iommu_group)
to make a group initialization explicit.

What is wrong with such a solution?


>> And do something like:
>> typedef struct VFIOContainer {
>>     int fd;
>>     void *platform_iommu_data;
>> } VFIOContainer;
>>
>> Create additional file called vfio_iommu_x86.c with:
>> struct VFIO_Type1_IOMMU {
>>     MemoryListener listener;
>>     QLIST_HEAD(, VFIOGroup) group_list;
>>     QLIST_ENTRY(VFIOContainer) next;
>> };
>> and put all MemoryListener stuff there.
>>
>> For POWER we already have spapr_iommu.c.
>>
>> Wrong direction? :)
> 
> This is actually very similar to what I'm proposing above.  Rather than
> a platform_iommu_data pointer, I'm suggesting that we make a union in
> VFIOContainer.  That allows us to actually dereference the container and
> get back to the fd in the callback functions.  Thanks,


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3)
  2012-07-17  7:53         ` Alexey Kardashevskiy
@ 2012-07-17 14:11           ` Alex Williamson
  0 siblings, 0 replies; 52+ messages in thread
From: Alex Williamson @ 2012-07-17 14:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Blue Swirl, qemu-ppc, David Gibson

On Tue, 2012-07-17 at 17:53 +1000, Alexey Kardashevskiy wrote:
> On 17/07/12 00:21, Alex Williamson wrote:
> > On Sat, 2012-07-14 at 12:34 +1000, Alexey Kardashevskiy wrote:
> >> On 14/07/12 01:07, Alex Williamson wrote:
> >>> On Fri, 2012-07-13 at 17:26 +1000, Alexey Kardashevskiy wrote:
> >>>> It literally does the following:
> >>>>
> >>>> 1. POWERPC IOMMU support (the kernel counterpart is required)
> >>>>
> >>>> 2. The patch assumes that IOAPIC calls are going to be replaced
> >>>> with something generic.
> >>>>
> >>>> 3. vfio_group_iommu_ioctl() has been added to let sPAPR IOMMU
> >>>> handler to call VFIO IOMMU driver.
> >>>>
> >>>> 4. Change sPAPR PHB to scan the PCI bus which is used for
> >>>> the IOMMU-VFIO group. Now it is enough to add the following to
> >>>> the QEMU command line to get VFIO up with all the devices from
> >>>> IOMMU group with id=3:
> >>>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
> >>>> mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  hw/ppc/Makefile.objs  |    3 ++
> >>>>  hw/spapr.h            |    4 ++
> >>>>  hw/spapr_iommu.c      |   69 ++++++++++++++++++++++++++++++-
> >>>>  hw/spapr_iommu_vfio.h |   49 ++++++++++++++++++++++
> >>>>  hw/spapr_pci.c        |  108 ++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>  hw/spapr_pci.h        |    4 ++
> >>>>  hw/vfio_pci.c         |   30 ++++++++++++++
> >>>>  hw/vfio_pci.h         |    2 +
> >>>>  trace-events          |    1 +
> >>>>  9 files changed, 264 insertions(+), 6 deletions(-)
> >>>>  create mode 100644 hw/spapr_iommu_vfio.h
> >>>>
> >>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>> index f573a95..c46a049 100644
> >>>> --- a/hw/ppc/Makefile.objs
> >>>> +++ b/hw/ppc/Makefile.objs
> >>>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
> >>>>  # Xilinx PPC peripherals
> >>>>  obj-y += xilinx_ethlite.o
> >>>>  
> >>>> +# VFIO PCI device assignment
> >>>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> >>>> +
> >>>>  obj-y := $(addprefix ../,$(obj-y))
> >>>> diff --git a/hw/spapr.h b/hw/spapr.h
> >>>> index b37f337..26e26f6 100644
> >>>> --- a/hw/spapr.h
> >>>> +++ b/hw/spapr.h
> >>>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >>>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >>>>                        DMAContext *dma);
> >>>>  
> >>>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> >>>> +                         uint64_t *dma32_window_start,
> >>>> +                         uint64_t *dma32_window_size);
> >>>> +
> >>>>  #endif /* !defined (__HW_SPAPR_H__) */
> >>>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> >>>> index 50c288d..e48ced1 100644
> >>>> --- a/hw/spapr_iommu.c
> >>>> +++ b/hw/spapr_iommu.c
> >>>> @@ -23,6 +23,8 @@
> >>>>  #include "dma.h"
> >>>>  
> >>>>  #include "hw/spapr.h"
> >>>> +#include "hw/spapr_iommu_vfio.h"
> >>>> +#include "hw/vfio_pci.h"
> >>>>  
> >>>>  #include <libfdt.h>
> >>>>  
> >>>> @@ -183,6 +185,67 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
> >>>>      return 0;
> >>>>  }
> >>>>  
> >>>> +typedef struct sPAPRVFIOTable {
> >>>> +    int group_id;
> >>>> +    uint32_t liobn;
> >>>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> >>>> +} sPAPRVFIOTable;
> >>>> +
> >>>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> >>>> +
> >>>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> >>>> +                         uint64_t *dma32_window_start,
> >>>> +                         uint64_t *dma32_window_size)
> >>>> +{
> >>>> +    sPAPRVFIOTable *t;
> >>>> +    struct tce_iommu_info info = { .argsz = sizeof(info) };
> >>>> +
> >>>> +    if (vfio_group_iommu_ioctl(group_id, SPAPR_TCE_IOMMU_GET_INFO, &info)) {
> >>>> +        perror("SPAPR_TCE_IOMMU_GET_INFO failed");
> >>>> +        return;
> >>>> +    }
> >>>> +    *dma32_window_start = info.dma32_window_start;
> >>>> +    *dma32_window_size = info.dma32_window_size;
> >>>> +
> >>>> +    t = g_malloc0(sizeof(*t));
> >>>> +    t->group_id = group_id;
> >>>> +    t->liobn = liobn;
> >>>> +
> >>>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> >>>> +}
> >>>> +
> >>>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> >>>> +{
> >>>> +    sPAPRVFIOTable *t;
> >>>> +    struct tce_iommu_dma_map map = {
> >>>> +        .argsz = sizeof(map),
> >>>> +        .va = 0,
> >>>> +        .dmaaddr = ioba,
> >>>> +    };
> >>>> +
> >>>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> >>>> +        if (t->liobn != liobn) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        if (tce) {
> >>>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> >>>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_MAP_DMA,
> >>>> +                                       &map)) {
> >>>> +                perror("TCE_MAP_DMA");
> >>>> +                return H_PARAMETER;
> >>>> +            }
> >>>> +        } else {
> >>>> +            if (vfio_group_iommu_ioctl(t->group_id, SPAPR_TCE_IOMMU_UNMAP_DMA,
> >>>> +                                       &map)) {
> >>>> +                perror("TCE_UNMAP_DMA");
> >>>> +                return H_PARAMETER;
> >>>> +            }
> >>>> +        }
> >>>> +        return H_SUCCESS;
> >>>> +    }
> >>>> +    return H_CONTINUE; /* positive non-zero value */
> >>>> +}
> >>>> +
> >>>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>>>                                target_ulong opcode, target_ulong *args)
> >>>>  {
> >>>> @@ -200,7 +263,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
> >>>>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
> >>>>  
> >>>>      ret = put_tce_emu(liobn, ioba, tce);
> >>>> -    if (0 >= ret) {
> >>>> +    if (ret <= 0) {
> >>>> +        return ret ? H_PARAMETER : H_SUCCESS;
> >>>> +    }
> >>>> +    ret = put_tce_vfio(liobn, ioba, tce);
> >>>> +    if (ret <= 0) {
> >>>>          return ret ? H_PARAMETER : H_SUCCESS;
> >>>>      }
> >>>>  #ifdef DEBUG_TCE
> >>>> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> >>>> new file mode 100644
> >>>> index 0000000..711e3e4
> >>>> --- /dev/null
> >>>> +++ b/hw/spapr_iommu_vfio.h
> >>>> @@ -0,0 +1,49 @@
> >>>> +/*
> >>>> + * Definitions for VFIO IOMMU driver implementing SPAPR TCE.
> >>>> + * This is the copy of the kernel header.
> >>>> + *
> >>>> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> >>>> + *
> >>>> + * This library is free software; you can redistribute it and/or
> >>>> + * modify it under the terms of the GNU Lesser General Public
> >>>> + * License as published by the Free Software Foundation; either
> >>>> + * version 2 of the License, or (at your option) any later version.
> >>>> + *
> >>>> + * This library is distributed in the hope that it will be useful,
> >>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>>> + * Lesser General Public License for more details.
> >>>> + *
> >>>> + * You should have received a copy of the GNU Lesser General Public
> >>>> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> >>>> + */
> >>>> +
> >>>> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> >>>> +#define __HW_SPAPR_IOMMU_VFIO_H__
> >>>> +
> >>>> +#include "hw/linux-vfio.h"
> >>>> +
> >>>> +#define SPAPR_TCE_IOMMU         2
> >>>> +
> >>>> +struct tce_iommu_info {
> >>>> +    __u32 argsz;
> >>>> +    __u32 flags;
> >>>> +    __u32 dma32_window_start;
> >>>> +    __u32 dma32_window_size;
> >>>> +    __u64 dma64_window_start;
> >>>> +    __u64 dma64_window_size;
> >>>> +};
> >>>> +
> >>>> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> >>>> +
> >>>> +struct tce_iommu_dma_map {
> >>>> +    __u32 argsz;
> >>>> +    __u32 flags;
> >>>> +    __u64 va;
> >>>> +    __u64 dmaaddr;
> >>>> +};
> >>>> +
> >>>> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> >>>> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> >>>> +
> >>>> +#endif
> >>>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> >>>> index 014297b..836ec4f 100644
> >>>> --- a/hw/spapr_pci.c
> >>>> +++ b/hw/spapr_pci.c
> >>>> @@ -22,6 +22,9 @@
> >>>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >>>>   * THE SOFTWARE.
> >>>>   */
> >>>> +#include <sys/types.h>
> >>>> +#include <dirent.h>
> >>>> +
> >>>>  #include "hw.h"
> >>>>  #include "pci.h"
> >>>>  #include "msi.h"
> >>>> @@ -32,7 +35,6 @@
> >>>>  #include "exec-memory.h"
> >>>>  #include <libfdt.h>
> >>>>  #include "trace.h"
> >>>> -
> >>>>  #include "hw/pci_internals.h"
> >>>>  
> >>>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> >>>> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
> >>>>                   level);
> >>>>  }
> >>>>  
> >>>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> >>>> +{
> >>>> +    sPAPRPHBState *phb = opaque;
> >>>> +    return phb->lsi_table[irq_num].dt_irq;
> >>>> +}
> >>>> +
> >>>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
> >>>>                                unsigned size)
> >>>>  {
> >>>> @@ -515,6 +523,79 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
> >>>>      return phb->dma;
> >>>>  }
> >>>>  
> >>>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> >>>> +{
> >>>> +    char iommupath[256];
> >>>> +    DIR *dirp;
> >>>> +    struct dirent *entry;
> >>>> +
> >>>> +    if (!phb->scan) {
> >>>> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    snprintf(iommupath, sizeof(iommupath),
> >>>> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> >>>> +    dirp = opendir(iommupath);
> >>>> +
> >>>> +    while ((entry = readdir(dirp)) != NULL) {
> >>>> +        char *tmp;
> >>>> +        FILE *deviceclassfile;
> >>>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> >>>> +        char addr[32];
> >>>> +        DeviceState *dev;
> >>>> +
> >>>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> >>>> +                   &domainid, &busid, &devid, &fnid) != 4) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> >>>> +        trace_spapr_pci("Reading device class from ", tmp);
> >>>> +
> >>>> +        deviceclassfile = fopen(tmp, "r");
> >>>> +        if (deviceclassfile) {
> >>>> +            fscanf(deviceclassfile, "%x", &deviceclass);
> >>>> +            fclose(deviceclassfile);
> >>>> +        }
> >>>> +        g_free(tmp);
> >>>> +
> >>>> +        if (!deviceclass) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        if ((phb->scan < 2) &&
> >>>> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> >>>> +            /* Skip _any_ bridge */
> >>>> +            continue;
> >>>> +        }
> >>>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> >>>> +            /* Tweak USB */
> >>>> +            phb->force_addr = 1;
> >>>> +            phb->enable_multifunction = 1;
> >>>> +        }
> >>>> +
> >>>> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> >>>> +
> >>>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> >>>> +        if (!dev) {
> >>>> +            fprintf(stderr, "failed to create vfio-pci\n");
> >>>> +            continue;
> >>>> +        }
> >>>> +        qdev_prop_parse(dev, "host", entry->d_name);
> >>>> +        if (phb->force_addr) {
> >>>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> >>>> +            qdev_prop_parse(dev, "addr", addr);
> >>>> +        }
> >>>> +        if (phb->enable_multifunction) {
> >>>> +            qdev_prop_set_bit(dev, "multifunction", 1);
> >>>> +        }
> >>>> +        qdev_init_nofail(dev);
> >>>> +    }
> >>>> +    closedir(dirp);
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>>  static int spapr_phb_init(SysBusDevice *s)
> >>>>  {
> >>>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> >>>> @@ -567,15 +648,13 @@ static int spapr_phb_init(SysBusDevice *s)
> >>>>  
> >>>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
> >>>>                             phb->busname ? phb->busname : phb->dtbusname,
> >>>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> >>>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> >>>> +                           pci_spapr_map_irq, phb,
> >>>>                             &phb->memspace, &phb->iospace,
> >>>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
> >>>>      phb->host_state.bus = bus;
> >>>>  
> >>>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> >>>> -    phb->dma_window_start = 0;
> >>>> -    phb->dma_window_size = 0x40000000;
> >>>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
> >>>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
> >>>>  
> >>>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> >>>> @@ -588,6 +667,21 @@ static int spapr_phb_init(SysBusDevice *s)
> >>>>          }
> >>>>      }
> >>>>  
> >>>> +    if (phb->iommugroupid >= 0) {
> >>>> +        if (spapr_pci_scan_vfio(phb) < 0) {
> >>>> +            return -1;
> >>>> +        }
> >>>> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn,
> >>>> +                            &phb->dma_window_start,
> >>>> +                            &phb->dma_window_size);
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    phb->dma_window_start = 0;
> >>>> +    phb->dma_window_size = 0x40000000;
> >>>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> >>>> +                                         phb->dma_window_size);
> >>>> +
> >>>>      return 0;
> >>>>  }
> >>>>  
> >>>> @@ -599,6 +693,10 @@ static Property spapr_phb_properties[] = {
> >>>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
> >>>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
> >>>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> >>>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> >>>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> >>>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> >>>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
> >>>>      DEFINE_PROP_END_OF_LIST(),
> >>>>  };
> >>>>  
> >>>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> >>>> index 145071c..f514823 100644
> >>>> --- a/hw/spapr_pci.h
> >>>> +++ b/hw/spapr_pci.h
> >>>> @@ -57,6 +57,10 @@ typedef struct sPAPRPHBState {
> >>>>          int nvec;
> >>>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
> >>>>  
> >>>> +    int32_t iommugroupid;
> >>>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> >>>> +    uint8_t enable_multifunction, force_addr;
> >>>> +
> >>>>      QLIST_ENTRY(sPAPRPHBState) list;
> >>>>  } sPAPRPHBState;
> >>>>  
> >>>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> >>>> index 1ac287f..fc84fb4 100644
> >>>> --- a/hw/vfio_pci.c
> >>>> +++ b/hw/vfio_pci.c
> >>>> @@ -1581,6 +1581,24 @@ static int vfio_connect_container(VFIOGroup *group)
> >>>>  
> >>>>          memory_listener_register(&container->listener, get_system_memory());
> >>>>  
> >>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> >>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio: failed to set group container: %s\n",
> >>>> +                         strerror(errno));
> >>>> +            g_free(container);
> >>>> +            close(fd);
> >>>> +            return -1;
> >>>> +        }
> >>>> +
> >>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio: failed to set iommu for container: %s\n",
> >>>> +                         strerror(errno));
> >>>> +            g_free(container);
> >>>> +            close(fd);
> >>>> +            return -1;
> >>>> +        }
> >>>
> >>> I think we can still do better.  The x86 code sets up a MemoryListener
> >>> here with data for that embedded into the VFIOContainer.  You don't
> >>> have, need, or want a MemoryListener, but that doesn't mean we can't
> >>> follow the model of registering that this group exists here and setting
> >>> up map/unmap callbacks.
> >>>
> >>> For instance:
> >>>
> >>> in vfio_pci.h:
> >>> struct sPAPRVFIOData {
> >>>     uint64_t dma32_window_start;
> >>>     uint64_t dma64_window_size;
> >>>     ....
> >>>     int (*map)(struct tce_iommu_dma_map *);
> >>>     int (*unmap)(struct tce_iommu_dma_map *);
> >>> };
> 
> >>> appended to the above spapr tce iommu setup above:
> >>>
> >>> struct tce_iommu_info info;
> >>>
> >>> /* the MemoryListener embedded in container becomes a union to hold
> >>>  * iommu specific data. */
> >>> container->u.spapr.data->map = vfio_spapr_tce_map;
> >>> container->u.spapr.data->unmap = vfio_spapr_tce_unmap;
> >>
> >> I could actually reuse x86 callbacks, just wanted to keep POWER ioctls together. The problem here is
> >> getting the DMA window parameters and avoiding MemoryListener.
> > 
> > The callback is pretty trivial though, filling in the data structure and
> > calling the ioctl.  We have different data structures and different
> > ioctls, so probably not a lot to leverage.
> > 
> >>> ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO, &info))did 
> >>>
> >>> container->u.spapr.data->dma32_window_start = info.dma32_window_start;
> >>> container->u.spapr.data->dma32_window_size = info.dma32_window_size;
> >>>
> >>> spapr_register_vfio_container(&container->u.spapr.data)
> >>
> >> I assume it is called within vfio_pci.c as we do not want to access VFIOContainer members from
> >> anywhere but vfio_pci.c.
> >> Or we are changing the approach? I am a bit confused.
> > 
> > Yes, the registration function would be called from vfio_pci at the
> > equivalent place in the spar iommu test as x86 is setting up the memory
> > listener.  That would register the map/unmap function pointers and dma
> > window information.  spapr would then make map/unmap calls using those
> > function pointers, those would be implemented in vfio_pci where they
> > could dereference the container and therefore get to the container fd.
> 
> 
> How do we match this data with the PCI bus or device?

At the point where we're initializing the vfio iommu, you have a
PCIDevice, you can pass the PCIBus from that if you prefer.

> Even if we add IOMMU ID to sPAPRVFIOData and map/unmap callbacks, we
> still do not know which PCI bus it corresponds to if we create devices
> as it is done on x86. sPAPR PHB does not have an IOMMU id/fd and
> cannot get it from VFIO as it would be "exposing a private interface".
> 
> So I will have to specify an IOMMU id for the PCI bus from the command
> line.
> 
> And I really want to be sure that spapr_register_vfio_container() is
> called before spapr_pci.c started populating the device tree with the
> DMA window parameters which is done in spapr_reset() (a reset function
> of sPAPR PHB) now. Ideally I would like to know where the window is
> even before in order to initialize DMAContext, so keeping everything
> in one function spapr_phb_init() seems very right for me.

Disagree, you're limited in getting the iommu info by attaching a group
to a container.  By calling out to spapr code when that happens, you
guarantee the info is valid.  By calling into vfio, you may shift code
around and find out you're now trying to get iommu info before you have
vfio privileges to do so.

> >>> Then vfio_disconnect_container() could call
> >>> spapr_unregister_vfio_container().  Maybe the container contains a
> >>> function pointer to an uninit function so we don't have to ifdef between
> >>> x86 and power.  Does that make sense?  Thanks,
> >>
> >> We also need to pass the numbers from the info struct to spapr_pci.c in order to tell the guest
> >> where the DMA window besides. Another callback? This exactly what I avoided in the kernel when we
> >> decided not to extend IOMMU API with POWER stuff, I would like to have the same here.
> > 
> > This in an internal API, there's no penalty for fixing it later.
> > callbacks and window info are passed in the data structure outlined
> > above.
> > 
> >> In general, what is good in pulling to VFIO as much platform specific stuff as possible?
> >>
> >> I am trying to keep sPAPR IOMMU stuff away and make it easy to add new platforms to VFIO.
> >>
> >> For example, I would rather think of moving the piece of code which checks for SPAPR_TCE_IOMMU out
> >> of VFIO, make it a QEMUMachine callback (together with add-eoi-notifier) as the way IOMMU works is
> >> definitely the specific machine type feature.
> >>
> >> For example, int QEMUMachine::init_iommu(VFIOContainter *container) which would not even try
> >> VFIO_TYPE1_IOMMU on POWER or SPAPR_TCE_IOMMU on x86 as it knows the machine and IOMMU types already.
> > 
> > I don't think tying vfio into the QEMUMachine type has a future.  If you
> > convince Anthony or Michael otherwise, let me know.  
> 
> 
> This is mostly because it has "vfio" in its name.
> If it was something generic like IOMMU-via-fd with no mention of VFIO, then it would have got a chance :)
> Seriously, it already has a set of various unrelated flags, I do not see why not to add something what really belongs to it.
> 
> 
> > Your attempt to
> > keep spapr stuff completely out of vfio is requiring private interfaces
> > to be exposed.  That I think is the wrong direction.
> 
> 
> int vfio_group_iommu_ioctl(int iommu_group, int request, void *data)
> does not expose anything private from VFIO. IOMMU id is not any kind
> of private data. And struct tce_iommu_info is not a data which VFIO
> really wants to know. The calling code (spapr_pci.c) should know the
> IOMMU id either way and only it knows how map/unmap works, lets keep
> it there.

It's completely asynchronous to anything in vfio.  We can't call out to
spapr code and let it know that group went away.  We're exposing a raw
ioctl to all of qemu.  It's a pretty ugly interface (yes, I know I
suggested it).

> I would only add to VFIO API this:
> int vfio_group_iommu_init(int iommu_group)
> to make a group initialization explicit.
> 
> What is wrong with such a solution?

And when do you propose calling that?  You can *only* init the iommu
once a group is attached to a container.  This gives vfio the privilege
it needs to init the iommu.  vfio_group_iommu_init therefore logically
happens at the point when we set the iommu, which is where I'm
suggesting the callout happen, just like x86 does with the memory
listener.  Thanks,

Alex

> >> And do something like:
> >> typedef struct VFIOContainer {
> >>     int fd;
> >>     void *platform_iommu_data;
> >> } VFIOContainer;
> >>
> >> Create additional file called vfio_iommu_x86.c with:
> >> struct VFIO_Type1_IOMMU {
> >>     MemoryListener listener;
> >>     QLIST_HEAD(, VFIOGroup) group_list;
> >>     QLIST_ENTRY(VFIOContainer) next;
> >> };
> >> and put all MemoryListener stuff there.
> >>
> >> For POWER we already have spapr_iommu.c.
> >>
> >> Wrong direction? :)
> > 
> > This is actually very similar to what I'm proposing above.  Rather than
> > a platform_iommu_data pointer, I'm suggesting that we make a union in
> > VFIOContainer.  That allows us to actually dereference the container and
> > get back to the fd in the callback functions.  Thanks,
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4)
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2012-07-13  7:26 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3) Alexey Kardashevskiy
@ 2012-07-18 11:09 ` Alexey Kardashevskiy
  2012-07-18 14:14   ` Alex Williamson
  2012-07-19  4:04 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v5) Alexey Kardashevskiy
  6 siblings, 1 reply; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-18 11:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf, qemu-ppc, David Gibson

It literally does the following:

1. POWERPC IOMMU support (the kernel counterpart is required)

2. The patch assumes that IOAPIC calls are going to be replaced
with something generic.

3. Added sPAPRVFIOData (hw/spapr_iommu_vfio.h) which describes
the interface between VFIO and sPAPR IOMMU.

4. Change sPAPR PHB to scan the PCI bus which is used for
the IOMMU-VFIO group. Now it is enough to add the following to
the QEMU command line to get VFIO up with all the devices from
IOMMU group with id=3:
-device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
 mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000

WIth the pathes posted today a bit earlier, this patch fully supports
VFIO what includes MSIX as well.

ps. yes, I know that linux_vfio.h has moved, will fix it later :)

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/linux-vfio.h       |   26 +++++++++++
 hw/ppc/Makefile.objs  |    3 ++
 hw/spapr.h            |    4 ++
 hw/spapr_iommu.c      |   62 ++++++++++++++++++++++++-
 hw/spapr_iommu_vfio.h |   34 ++++++++++++++
 hw/spapr_pci.c        |  124 +++++++++++++++++++++++++++++++++++++++++++++++--
 hw/spapr_pci.h        |    6 +++
 hw/vfio_pci.c         |   64 +++++++++++++++++++++++++
 hw/vfio_pci.h         |    2 +
 trace-events          |    1 +
 10 files changed, 320 insertions(+), 6 deletions(-)
 create mode 100644 hw/spapr_iommu_vfio.h

diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
index 300d49b..27a0501 100644
--- a/hw/linux-vfio.h
+++ b/hw/linux-vfio.h
@@ -442,4 +442,30 @@ struct vfio_iommu_type1_dma_unmap {
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
 
+/*
+ * Interface to SPAPR TCE (POWERPC Book3S)
+ */
+#define SPAPR_TCE_IOMMU         2
+
+struct tce_iommu_info {
+    __u32 argsz;
+    __u32 flags;
+    __u32 dma32_window_start;
+    __u32 dma32_window_size;
+    __u64 dma64_window_start;
+    __u64 dma64_window_size;
+};
+
+#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+struct tce_iommu_dma_map {
+    __u32 argsz;
+    __u32 flags;
+    __u64 va;
+    __u64 dmaaddr;
+};
+
+#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
+#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
+
 #endif /* VFIO_H */
diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index f573a95..c46a049 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
 # Xilinx PPC peripherals
 obj-y += xilinx_ethlite.o
 
+# VFIO PCI device assignment
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
+
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/spapr.h b/hw/spapr.h
index b37f337..0c15c88 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       DMAContext *dma);
 
+struct sPAPRVFIOData;
+void spapr_vfio_init_dma(int group_id, uint32_t liobn,
+                         struct sPAPRVFIOData *data);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 50c288d..0a82842 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -23,6 +23,8 @@
 #include "dma.h"
 
 #include "hw/spapr.h"
+#include "hw/spapr_iommu_vfio.h"
+#include "hw/vfio_pci.h"
 
 #include <libfdt.h>
 
@@ -183,6 +185,60 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
     return 0;
 }
 
+typedef struct sPAPRVFIOTable {
+    struct sPAPRVFIOData *data;
+    uint32_t liobn;
+    QLIST_ENTRY(sPAPRVFIOTable) list;
+} sPAPRVFIOTable;
+
+QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
+
+void spapr_vfio_init_dma(int group_id, uint32_t liobn,
+                         struct sPAPRVFIOData *data)
+{
+    sPAPRVFIOTable *t;
+
+    t = g_malloc0(sizeof(*t));
+    t->data = data;
+    t->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
+}
+
+static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_dma_map map = {
+        .argsz = sizeof(map),
+        .va = 0,
+        .dmaaddr = ioba,
+    };
+
+    QLIST_FOREACH(t, &vfio_tce_tables, list) {
+        if (t->liobn != liobn) {
+            continue;
+        }
+        if (!t->data) {
+            return H_NO_MEM;
+        }
+        if (tce) {
+            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
+
+            if (t->data->map(t->data->groupid, &map)) {
+                perror("TCE_MAP_DMA");
+                return H_PARAMETER;
+            }
+        } else {
+            if (t->data->unmap(t->data->groupid, &map)) {
+                perror("TCE_UNMAP_DMA");
+                return H_PARAMETER;
+            }
+        }
+        return H_SUCCESS;
+    }
+    return H_CONTINUE; /* positive non-zero value */
+}
+
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
 {
@@ -200,7 +256,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
     ret = put_tce_emu(liobn, ioba, tce);
-    if (0 >= ret) {
+    if (ret <= 0) {
+        return ret ? H_PARAMETER : H_SUCCESS;
+    }
+    ret = put_tce_vfio(liobn, ioba, tce);
+    if (ret <= 0) {
         return ret ? H_PARAMETER : H_SUCCESS;
     }
 #ifdef DEBUG_TCE
diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
new file mode 100644
index 0000000..b3d6115
--- /dev/null
+++ b/hw/spapr_iommu_vfio.h
@@ -0,0 +1,34 @@
+/*
+ * Definitions for VFIO IOMMU implementation for SPAPR TCE.
+ *
+ * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
+#define __HW_SPAPR_IOMMU_VFIO_H__
+
+#include "hw/linux-vfio.h"
+
+struct sPAPRVFIOData {
+    int groupid;
+    struct tce_iommu_info info;
+    int (*map)(int groupid, struct tce_iommu_dma_map *param);
+    int (*unmap)(int groupid, struct tce_iommu_dma_map *param);
+};
+
+void spapr_register_vfio_container(struct sPAPRVFIOData *data);
+
+#endif
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 88c8f2c..75743e7 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -22,6 +22,9 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include <sys/types.h>
+#include <dirent.h>
+
 #include "hw.h"
 #include "pci.h"
 #include "msi.h"
@@ -32,7 +35,6 @@
 #include "exec-memory.h"
 #include <libfdt.h>
 #include "trace.h"
-
 #include "hw/pci_internals.h"
 
 /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
@@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
                  level);
 }
 
+static int pci_spapr_get_irq(void *opaque, int irq_num)
+{
+    sPAPRPHBState *phb = opaque;
+    return phb->lsi_table[irq_num].dt_irq;
+}
+
 static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
                               unsigned size)
 {
@@ -515,6 +523,93 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
     return phb->dma;
 }
 
+void spapr_register_vfio_container(struct sPAPRVFIOData *data)
+{
+    sPAPRPHBState *phb;
+
+    QLIST_FOREACH(phb, &spapr->phbs, list) {
+        if (phb->iommugroupid == data->groupid) {
+            phb->vfio_data = *data;
+            phb->dma_window_start = phb->vfio_data.info.dma32_window_start;
+            phb->dma_window_size = phb->vfio_data.info.dma32_window_size;
+            return;
+        }
+    }
+}
+
+static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
+{
+    char iommupath[256];
+    DIR *dirp;
+    struct dirent *entry;
+
+    if (!phb->scan) {
+        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
+        return 0;
+    }
+
+    snprintf(iommupath, sizeof(iommupath),
+             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
+    dirp = opendir(iommupath);
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((phb->scan < 2) &&
+            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
+            /* Skip _any_ bridge */
+            continue;
+        }
+        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
+            /* Tweak USB */
+            phb->force_addr = 1;
+            phb->enable_multifunction = 1;
+        }
+
+        trace_spapr_pci("Creating devicei from ", entry->d_name);
+
+        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
+        if (!dev) {
+            fprintf(stderr, "failed to create vfio-pci\n");
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name);
+        if (phb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            qdev_prop_parse(dev, "addr", addr);
+        }
+        if (phb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        qdev_init_nofail(dev);
+    }
+    closedir(dirp);
+
+    return 0;
+}
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
@@ -567,15 +662,13 @@ static int spapr_phb_init(SysBusDevice *s)
 
     bus = pci_register_bus(&phb->host_state.busdev.qdev,
                            phb->busname ? phb->busname : phb->dtbusname,
-                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
+                           pci_spapr_set_irq, pci_spapr_get_irq,
+                           pci_spapr_map_irq, phb,
                            &phb->memspace, &phb->iospace,
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
 
     phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
-    phb->dma_window_start = 0;
-    phb->dma_window_size = 0x40000000;
-    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
@@ -588,6 +681,19 @@ static int spapr_phb_init(SysBusDevice *s)
         }
     }
 
+    if (phb->iommugroupid >= 0) {
+        if (spapr_pci_scan_vfio(phb) < 0) {
+            return -1;
+        }
+        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn, &phb->vfio_data);
+        return 0;
+    }
+
+    phb->dma_window_start = 0;
+    phb->dma_window_size = 0x40000000;
+    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
+                                         phb->dma_window_size);
+
     return 0;
 }
 
@@ -599,6 +705,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
     DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
     DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -729,6 +839,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
+    if (!phb->dma_window_size) {
+        fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n");
+        exit(1);
+    }
     spapr_dma_dt(fdt, bus_off, "ibm,dma-window",
                  phb->dma_liobn, phb->dma_window_start,
                  phb->dma_window_size);
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 145071c..ed17053 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -26,6 +26,7 @@
 #include "hw/pci.h"
 #include "hw/pci_host.h"
 #include "hw/xics.h"
+#include "hw/spapr_iommu_vfio.h"
 
 #define SPAPR_MSIX_MAX_DEVS 32
 
@@ -57,6 +58,11 @@ typedef struct sPAPRPHBState {
         int nvec;
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
+    struct sPAPRVFIOData vfio_data;
+    int32_t iommugroupid;
+    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
+    uint8_t enable_multifunction, force_addr;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 0ad4761..0023d41 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -43,6 +43,7 @@
 #include "range.h"
 #include "vfio_pci.h"
 #include "linux-vfio.h"
+#include "hw/spapr_iommu_vfio.h"
 
 //#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
@@ -55,6 +56,8 @@
 
 #define MSIX_CAP_LENGTH 12
 
+static VFIOGroup *vfio_get_group(int groupid);
+
 static QLIST_HEAD(, VFIOContainer)
     container_list = QLIST_HEAD_INITIALIZER(container_list);
 
@@ -1088,6 +1091,31 @@ static void vfio_listener_release(VFIOContainer *container)
 }
 
 /*
+ * sPAPR TCE DMA interface
+ */
+static int spapr_tce_map(int groupid, struct tce_iommu_dma_map *param)
+{
+    VFIOGroup *group = vfio_get_group(groupid);
+
+    if (!group || !group->container) {
+        return -1;
+    }
+
+    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_MAP_DMA, param);
+}
+
+static int spapr_tce_unmap(int groupid, struct tce_iommu_dma_map *param)
+{
+    VFIOGroup *group = vfio_get_group(groupid);
+
+    if (!group || !group->container) {
+        return -1;
+    }
+
+    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_UNMAP_DMA, param);
+}
+
+/*
  * Interrupt setup
  */
 static void vfio_disable_interrupts(VFIODevice *vdev)
@@ -1590,6 +1618,42 @@ static int vfio_connect_container(VFIOGroup *group)
         memory_listener_register(&container->iommu_data.listener,
                                  get_system_memory());
 
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        container->iommu_data.spapr.info.argsz =
+                sizeof(container->iommu_data.spapr.info);
+        ret = ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO,
+                    &container->iommu_data.spapr.info);
+        if (ret) {
+            error_report("vfio: failed to get iommu info for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        container->iommu_data.spapr.map = spapr_tce_map;
+        container->iommu_data.spapr.unmap = spapr_tce_unmap;
+        container->iommu_data.spapr.groupid = group->groupid;
+        spapr_register_vfio_container(&container->iommu_data.spapr);
+
     } else {
         error_report("vfio: No available IOMMU models\n");
         g_free(container);
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 00bb3dd..ddb5332 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -6,6 +6,7 @@
 #include "pci.h"
 #include "ioapic.h"
 #include "event_notifier.h"
+#include "hw/spapr_iommu_vfio.h"
 
 typedef struct VFIOPCIHostDevice {
     uint16_t seg;
@@ -59,6 +60,7 @@ typedef struct VFIOContainer {
     struct {
         union {
             MemoryListener listener;
+            struct sPAPRVFIOData spapr;
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
diff --git a/trace-events b/trace-events
index e548f86..9100591 100644
--- a/trace-events
+++ b/trace-events
@@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/spapr_pci.c
+spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4)
  2012-07-18 11:09 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4) Alexey Kardashevskiy
@ 2012-07-18 14:14   ` Alex Williamson
  2012-07-19  4:01     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2012-07-18 14:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed, 2012-07-18 at 21:09 +1000, Alexey Kardashevskiy wrote:
> It literally does the following:
> 
> 1. POWERPC IOMMU support (the kernel counterpart is required)
> 
> 2. The patch assumes that IOAPIC calls are going to be replaced
> with something generic.
> 
> 3. Added sPAPRVFIOData (hw/spapr_iommu_vfio.h) which describes
> the interface between VFIO and sPAPR IOMMU.
> 
> 4. Change sPAPR PHB to scan the PCI bus which is used for
> the IOMMU-VFIO group. Now it is enough to add the following to
> the QEMU command line to get VFIO up with all the devices from
> IOMMU group with id=3:
> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>  mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
> 
> WIth the pathes posted today a bit earlier, this patch fully supports
> VFIO what includes MSIX as well.
> 
> ps. yes, I know that linux_vfio.h has moved, will fix it later :)
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/linux-vfio.h       |   26 +++++++++++
>  hw/ppc/Makefile.objs  |    3 ++
>  hw/spapr.h            |    4 ++
>  hw/spapr_iommu.c      |   62 ++++++++++++++++++++++++-
>  hw/spapr_iommu_vfio.h |   34 ++++++++++++++
>  hw/spapr_pci.c        |  124 +++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/spapr_pci.h        |    6 +++
>  hw/vfio_pci.c         |   64 +++++++++++++++++++++++++
>  hw/vfio_pci.h         |    2 +
>  trace-events          |    1 +
>  10 files changed, 320 insertions(+), 6 deletions(-)
>  create mode 100644 hw/spapr_iommu_vfio.h
> 
> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
> index 300d49b..27a0501 100644
> --- a/hw/linux-vfio.h
> +++ b/hw/linux-vfio.h
> @@ -442,4 +442,30 @@ struct vfio_iommu_type1_dma_unmap {
>  
>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>  
> +/*
> + * Interface to SPAPR TCE (POWERPC Book3S)
> + */
> +#define SPAPR_TCE_IOMMU         2
> +
> +struct tce_iommu_info {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u32 dma32_window_start;
> +    __u32 dma32_window_size;
> +    __u64 dma64_window_start;
> +    __u64 dma64_window_size;
> +};
> +
> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct tce_iommu_dma_map {
> +    __u32 argsz;
> +    __u32 flags;
> +    __u64 va;
> +    __u64 dmaaddr;
> +};
> +
> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
>  #endif /* VFIO_H */
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index f573a95..c46a049 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>  # Xilinx PPC peripherals
>  obj-y += xilinx_ethlite.o
>  
> +# VFIO PCI device assignment
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
> +
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/spapr.h b/hw/spapr.h
> index b37f337..0c15c88 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        DMAContext *dma);
>  
> +struct sPAPRVFIOData;
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         struct sPAPRVFIOData *data);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 50c288d..0a82842 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -23,6 +23,8 @@
>  #include "dma.h"
>  
>  #include "hw/spapr.h"
> +#include "hw/spapr_iommu_vfio.h"
> +#include "hw/vfio_pci.h"
>  
>  #include <libfdt.h>
>  
> @@ -183,6 +185,60 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>      return 0;
>  }
>  
> +typedef struct sPAPRVFIOTable {
> +    struct sPAPRVFIOData *data;
> +    uint32_t liobn;
> +    QLIST_ENTRY(sPAPRVFIOTable) list;
> +} sPAPRVFIOTable;
> +
> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
> +
> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
> +                         struct sPAPRVFIOData *data)
> +{
> +    sPAPRVFIOTable *t;
> +
> +    t = g_malloc0(sizeof(*t));
> +    t->data = data;
> +    t->liobn = liobn;
> +
> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
> +}
> +
> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
> +{
> +    sPAPRVFIOTable *t;
> +    struct tce_iommu_dma_map map = {
> +        .argsz = sizeof(map),
> +        .va = 0,
> +        .dmaaddr = ioba,
> +    };
> +
> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
> +        if (t->liobn != liobn) {
> +            continue;
> +        }
> +        if (!t->data) {
> +            return H_NO_MEM;
> +        }

Why would this ever happen?

> +        if (tce) {
> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
> +
> +            if (t->data->map(t->data->groupid, &map)) {

Just pass t->data, this is why the VFIOContainer has a union.

> +                perror("TCE_MAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        } else {
> +            if (t->data->unmap(t->data->groupid, &map)) {
> +                perror("TCE_UNMAP_DMA");
> +                return H_PARAMETER;
> +            }
> +        }
> +        return H_SUCCESS;
> +    }
> +    return H_CONTINUE; /* positive non-zero value */
> +}
> +
>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>                                target_ulong opcode, target_ulong *args)
>  {
> @@ -200,7 +256,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>  
>      ret = put_tce_emu(liobn, ioba, tce);
> -    if (0 >= ret) {
> +    if (ret <= 0) {
> +        return ret ? H_PARAMETER : H_SUCCESS;
> +    }
> +    ret = put_tce_vfio(liobn, ioba, tce);
> +    if (ret <= 0) {
>          return ret ? H_PARAMETER : H_SUCCESS;
>      }
>  #ifdef DEBUG_TCE
> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
> new file mode 100644
> index 0000000..b3d6115
> --- /dev/null
> +++ b/hw/spapr_iommu_vfio.h
> @@ -0,0 +1,34 @@
> +/*
> + * Definitions for VFIO IOMMU implementation for SPAPR TCE.
> + *
> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
> +#define __HW_SPAPR_IOMMU_VFIO_H__
> +
> +#include "hw/linux-vfio.h"
> +
> +struct sPAPRVFIOData {
> +    int groupid;
> +    struct tce_iommu_info info;

Seems a little lazy to embed this whole thing here, what does anyone
need argsz and flags for later?

> +    int (*map)(int groupid, struct tce_iommu_dma_map *param);
> +    int (*unmap)(int groupid, struct tce_iommu_dma_map *param);

s/int groupid/struct sPAPRVFIOData */ in map/unmap

> +};
> +
> +void spapr_register_vfio_container(struct sPAPRVFIOData *data);
> +
> +#endif
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 88c8f2c..75743e7 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -22,6 +22,9 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include <sys/types.h>
> +#include <dirent.h>
> +
>  #include "hw.h"
>  #include "pci.h"
>  #include "msi.h"
> @@ -32,7 +35,6 @@
>  #include "exec-memory.h"
>  #include <libfdt.h>
>  #include "trace.h"
> -
>  #include "hw/pci_internals.h"
>  
>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>                   level);
>  }
>  
> +static int pci_spapr_get_irq(void *opaque, int irq_num)
> +{
> +    sPAPRPHBState *phb = opaque;
> +    return phb->lsi_table[irq_num].dt_irq;
> +}
> +
>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>                                unsigned size)
>  {
> @@ -515,6 +523,93 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>      return phb->dma;
>  }
>  
> +void spapr_register_vfio_container(struct sPAPRVFIOData *data)
> +{
> +    sPAPRPHBState *phb;
> +
> +    QLIST_FOREACH(phb, &spapr->phbs, list) {
> +        if (phb->iommugroupid == data->groupid) {
> +            phb->vfio_data = *data;
> +            phb->dma_window_start = phb->vfio_data.info.dma32_window_start;
> +            phb->dma_window_size = phb->vfio_data.info.dma32_window_size;
> +            return;
> +        }
> +    }
> +}
> +
> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
> +{
> +    char iommupath[256];
> +    DIR *dirp;
> +    struct dirent *entry;
> +
> +    if (!phb->scan) {
> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
> +        return 0;
> +    }
> +
> +    snprintf(iommupath, sizeof(iommupath),
> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
> +    dirp = opendir(iommupath);
> +
> +    while ((entry = readdir(dirp)) != NULL) {
> +        char *tmp;
> +        FILE *deviceclassfile;
> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
> +        char addr[32];
> +        DeviceState *dev;
> +
> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
> +                   &domainid, &busid, &devid, &fnid) != 4) {
> +            continue;
> +        }
> +
> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
> +        trace_spapr_pci("Reading device class from ", tmp);
> +
> +        deviceclassfile = fopen(tmp, "r");
> +        if (deviceclassfile) {
> +            fscanf(deviceclassfile, "%x", &deviceclass);
> +            fclose(deviceclassfile);
> +        }
> +        g_free(tmp);
> +
> +        if (!deviceclass) {
> +            continue;
> +        }
> +        if ((phb->scan < 2) &&
> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
> +            /* Skip _any_ bridge */
> +            continue;
> +        }
> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
> +            /* Tweak USB */
> +            phb->force_addr = 1;
> +            phb->enable_multifunction = 1;
> +        }
> +
> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
> +
> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
> +        if (!dev) {
> +            fprintf(stderr, "failed to create vfio-pci\n");
> +            continue;
> +        }
> +        qdev_prop_parse(dev, "host", entry->d_name);
> +        if (phb->force_addr) {
> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
> +            qdev_prop_parse(dev, "addr", addr);
> +        }
> +        if (phb->enable_multifunction) {
> +            qdev_prop_set_bit(dev, "multifunction", 1);
> +        }
> +        qdev_init_nofail(dev);
> +    }
> +    closedir(dirp);
> +
> +    return 0;
> +}
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
> @@ -567,15 +662,13 @@ static int spapr_phb_init(SysBusDevice *s)
>  
>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>                             phb->busname ? phb->busname : phb->dtbusname,
> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
> +                           pci_spapr_set_irq, pci_spapr_get_irq,
> +                           pci_spapr_map_irq, phb,
>                             &phb->memspace, &phb->iospace,
>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>      phb->host_state.bus = bus;
>  
>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
> -    phb->dma_window_start = 0;
> -    phb->dma_window_size = 0x40000000;
> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>  
>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
> @@ -588,6 +681,19 @@ static int spapr_phb_init(SysBusDevice *s)
>          }
>      }
>  
> +    if (phb->iommugroupid >= 0) {
> +        if (spapr_pci_scan_vfio(phb) < 0) {
> +            return -1;
> +        }
> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn, &phb->vfio_data);
> +        return 0;
> +    }
> +
> +    phb->dma_window_start = 0;
> +    phb->dma_window_size = 0x40000000;
> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
> +                                         phb->dma_window_size);
> +
>      return 0;
>  }
>  
> @@ -599,6 +705,10 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -729,6 +839,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>  
> +    if (!phb->dma_window_size) {
> +        fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n");
> +        exit(1);
> +    }
>      spapr_dma_dt(fdt, bus_off, "ibm,dma-window",
>                   phb->dma_liobn, phb->dma_window_start,
>                   phb->dma_window_size);
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 145071c..ed17053 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -26,6 +26,7 @@
>  #include "hw/pci.h"
>  #include "hw/pci_host.h"
>  #include "hw/xics.h"
> +#include "hw/spapr_iommu_vfio.h"
>  
>  #define SPAPR_MSIX_MAX_DEVS 32
>  
> @@ -57,6 +58,11 @@ typedef struct sPAPRPHBState {
>          int nvec;
>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>  
> +    struct sPAPRVFIOData vfio_data;
> +    int32_t iommugroupid;
> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
> +    uint8_t enable_multifunction, force_addr;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  } sPAPRPHBState;
>  
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 0ad4761..0023d41 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -43,6 +43,7 @@
>  #include "range.h"
>  #include "vfio_pci.h"
>  #include "linux-vfio.h"
> +#include "hw/spapr_iommu_vfio.h"
>  
>  //#define DEBUG_VFIO
>  #ifdef DEBUG_VFIO
> @@ -55,6 +56,8 @@
>  
>  #define MSIX_CAP_LENGTH 12
>  
> +static VFIOGroup *vfio_get_group(int groupid);
> +

Unnecessary, because...

>  static QLIST_HEAD(, VFIOContainer)
>      container_list = QLIST_HEAD_INITIALIZER(container_list);
>  
> @@ -1088,6 +1091,31 @@ static void vfio_listener_release(VFIOContainer *container)
>  }
>  
>  /*
> + * sPAPR TCE DMA interface
> + */
> +static int spapr_tce_map(int groupid, struct tce_iommu_dma_map *param)
> +{
> +    VFIOGroup *group = vfio_get_group(groupid);
> +
> +    if (!group || !group->container) {
> +        return -1;
> +    }

This should be:

static int spapr_tce_map(sPAPRVFIOData *spapr_data, struct tce_iommu_dma_map *param)
{
    VFIOContainer *container;

    container = container_of(spapr_data, VFIOContainer, iommu_data.spapr);

    return ioctl(container->fd, ....
}


> +
> +    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_MAP_DMA, param);
> +}
> +
> +static int spapr_tce_unmap(int groupid, struct tce_iommu_dma_map *param)
> +{
> +    VFIOGroup *group = vfio_get_group(groupid);
> +
> +    if (!group || !group->container) {
> +        return -1;
> +    }
> +
> +    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_UNMAP_DMA, param);
> +}
> +
> +/*
>   * Interrupt setup
>   */
>  static void vfio_disable_interrupts(VFIODevice *vdev)
> @@ -1590,6 +1618,42 @@ static int vfio_connect_container(VFIOGroup *group)
>          memory_listener_register(&container->iommu_data.listener,
>                                   get_system_memory());
>  
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        container->iommu_data.spapr.info.argsz =
> +                sizeof(container->iommu_data.spapr.info);
> +        ret = ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO,
> +                    &container->iommu_data.spapr.info);
> +        if (ret) {
> +            error_report("vfio: failed to get iommu info for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        container->iommu_data.spapr.map = spapr_tce_map;
> +        container->iommu_data.spapr.unmap = spapr_tce_unmap;
> +        container->iommu_data.spapr.groupid = group->groupid;

This at least deserves a comment because x86 doesn't have a 1:1 mapping
of container to groupid.  I really think it might make more sense to
pass a PCIBus here instead of a groupid.  Whatever you choose, this
shouldn't be part of sPAPRVFIOData, it should be another parameter to
spapr_register_vfio_container.

> +        spapr_register_vfio_container(&container->iommu_data.spapr);
> +
>      } else {
>          error_report("vfio: No available IOMMU models\n");
>          g_free(container);
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 00bb3dd..ddb5332 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -6,6 +6,7 @@
>  #include "pci.h"
>  #include "ioapic.h"
>  #include "event_notifier.h"
> +#include "hw/spapr_iommu_vfio.h"
>  
>  typedef struct VFIOPCIHostDevice {
>      uint16_t seg;
> @@ -59,6 +60,7 @@ typedef struct VFIOContainer {
>      struct {
>          union {
>              MemoryListener listener;
> +            struct sPAPRVFIOData spapr;
>          };
>          void (*release)(struct VFIOContainer *);
>      } iommu_data;
> diff --git a/trace-events b/trace-events
> index e548f86..9100591 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
>  qxl_render_update_area_done(void *cookie) "%p"
>  
>  # hw/spapr_pci.c
> +spapr_pci(const char *msg1, const char *msg2) "%s%s"
>  spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
>  spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
>  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4)
  2012-07-18 14:14   ` Alex Williamson
@ 2012-07-19  4:01     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-19  4:01 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 19/07/12 00:14, Alex Williamson wrote:
> On Wed, 2012-07-18 at 21:09 +1000, Alexey Kardashevskiy wrote:
>> It literally does the following:
>>
>> 1. POWERPC IOMMU support (the kernel counterpart is required)
>>
>> 2. The patch assumes that IOAPIC calls are going to be replaced
>> with something generic.
>>
>> 3. Added sPAPRVFIOData (hw/spapr_iommu_vfio.h) which describes
>> the interface between VFIO and sPAPR IOMMU.
>>
>> 4. Change sPAPR PHB to scan the PCI bus which is used for
>> the IOMMU-VFIO group. Now it is enough to add the following to
>> the QEMU command line to get VFIO up with all the devices from
>> IOMMU group with id=3:
>> -device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
>>  mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000
>>
>> WIth the pathes posted today a bit earlier, this patch fully supports
>> VFIO what includes MSIX as well.
>>
>> ps. yes, I know that linux_vfio.h has moved, will fix it later :)
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/linux-vfio.h       |   26 +++++++++++
>>  hw/ppc/Makefile.objs  |    3 ++
>>  hw/spapr.h            |    4 ++
>>  hw/spapr_iommu.c      |   62 ++++++++++++++++++++++++-
>>  hw/spapr_iommu_vfio.h |   34 ++++++++++++++
>>  hw/spapr_pci.c        |  124 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  hw/spapr_pci.h        |    6 +++
>>  hw/vfio_pci.c         |   64 +++++++++++++++++++++++++
>>  hw/vfio_pci.h         |    2 +
>>  trace-events          |    1 +
>>  10 files changed, 320 insertions(+), 6 deletions(-)
>>  create mode 100644 hw/spapr_iommu_vfio.h
>>
>> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
>> index 300d49b..27a0501 100644
>> --- a/hw/linux-vfio.h
>> +++ b/hw/linux-vfio.h
>> @@ -442,4 +442,30 @@ struct vfio_iommu_type1_dma_unmap {
>>  
>>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>>  
>> +/*
>> + * Interface to SPAPR TCE (POWERPC Book3S)
>> + */
>> +#define SPAPR_TCE_IOMMU         2
>> +
>> +struct tce_iommu_info {
>> +    __u32 argsz;
>> +    __u32 flags;
>> +    __u32 dma32_window_start;
>> +    __u32 dma32_window_size;
>> +    __u64 dma64_window_start;
>> +    __u64 dma64_window_size;
>> +};
>> +
>> +#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>> +struct tce_iommu_dma_map {
>> +    __u32 argsz;
>> +    __u32 flags;
>> +    __u64 va;
>> +    __u64 dmaaddr;
>> +};
>> +
>> +#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
>> +#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
>> +
>>  #endif /* VFIO_H */
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index f573a95..c46a049 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
>>  # Xilinx PPC peripherals
>>  obj-y += xilinx_ethlite.o
>>  
>> +# VFIO PCI device assignment
>> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>> +
>>  obj-y := $(addprefix ../,$(obj-y))
>> diff --git a/hw/spapr.h b/hw/spapr.h
>> index b37f337..0c15c88 100644
>> --- a/hw/spapr.h
>> +++ b/hw/spapr.h
>> @@ -340,4 +340,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                        DMAContext *dma);
>>  
>> +struct sPAPRVFIOData;
>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>> +                         struct sPAPRVFIOData *data);
>> +
>>  #endif /* !defined (__HW_SPAPR_H__) */
>> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
>> index 50c288d..0a82842 100644
>> --- a/hw/spapr_iommu.c
>> +++ b/hw/spapr_iommu.c
>> @@ -23,6 +23,8 @@
>>  #include "dma.h"
>>  
>>  #include "hw/spapr.h"
>> +#include "hw/spapr_iommu_vfio.h"
>> +#include "hw/vfio_pci.h"
>>  
>>  #include <libfdt.h>
>>  
>> @@ -183,6 +185,60 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
>>      return 0;
>>  }
>>  
>> +typedef struct sPAPRVFIOTable {
>> +    struct sPAPRVFIOData *data;
>> +    uint32_t liobn;
>> +    QLIST_ENTRY(sPAPRVFIOTable) list;
>> +} sPAPRVFIOTable;
>> +
>> +QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
>> +
>> +void spapr_vfio_init_dma(int group_id, uint32_t liobn,
>> +                         struct sPAPRVFIOData *data)
>> +{
>> +    sPAPRVFIOTable *t;
>> +
>> +    t = g_malloc0(sizeof(*t));
>> +    t->data = data;
>> +    t->liobn = liobn;
>> +
>> +    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
>> +}
>> +
>> +static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
>> +{
>> +    sPAPRVFIOTable *t;
>> +    struct tce_iommu_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .va = 0,
>> +        .dmaaddr = ioba,
>> +    };
>> +
>> +    QLIST_FOREACH(t, &vfio_tce_tables, list) {
>> +        if (t->liobn != liobn) {
>> +            continue;
>> +        }
>> +        if (!t->data) {
>> +            return H_NO_MEM;
>> +        }
> 
> Why would this ever happen?
> 
>> +        if (tce) {
>> +            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
>> +
>> +            if (t->data->map(t->data->groupid, &map)) {
> 
> Just pass t->data, this is why the VFIOContainer has a union.
>
>> +                perror("TCE_MAP_DMA");
>> +                return H_PARAMETER;
>> +            }
>> +        } else {
>> +            if (t->data->unmap(t->data->groupid, &map)) {
>> +                perror("TCE_UNMAP_DMA");
>> +                return H_PARAMETER;
>> +            }
>> +        }
>> +        return H_SUCCESS;
>> +    }
>> +    return H_CONTINUE; /* positive non-zero value */
>> +}
>> +
>>  static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>                                target_ulong opcode, target_ulong *args)
>>  {
>> @@ -200,7 +256,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
>>      ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
>>  
>>      ret = put_tce_emu(liobn, ioba, tce);
>> -    if (0 >= ret) {
>> +    if (ret <= 0) {
>> +        return ret ? H_PARAMETER : H_SUCCESS;
>> +    }
>> +    ret = put_tce_vfio(liobn, ioba, tce);
>> +    if (ret <= 0) {
>>          return ret ? H_PARAMETER : H_SUCCESS;
>>      }
>>  #ifdef DEBUG_TCE
>> diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
>> new file mode 100644
>> index 0000000..b3d6115
>> --- /dev/null
>> +++ b/hw/spapr_iommu_vfio.h
>> @@ -0,0 +1,34 @@
>> +/*
>> + * Definitions for VFIO IOMMU implementation for SPAPR TCE.
>> + *
>> + * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
>> + *
>> + * This library is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2 of the License, or (at your option) any later version.
>> + *
>> + * This library is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
>> +#define __HW_SPAPR_IOMMU_VFIO_H__
>> +
>> +#include "hw/linux-vfio.h"
>> +
>> +struct sPAPRVFIOData {
>> +    int groupid;
>> +    struct tce_iommu_info info;
> 
> Seems a little lazy to embed this whole thing here, what does anyone
> need argsz and flags for later?


Trying to keep the same structures for both QEMU and the kernel as it is a SPAPR_TCE protocol. Otherwise I would have to copy the whole structure but argsz/flags.

 
>> +    int (*map)(int groupid, struct tce_iommu_dma_map *param);
>> +    int (*unmap)(int groupid, struct tce_iommu_dma_map *param);
> 
> s/int groupid/struct sPAPRVFIOData */ in map/unmap
> 
>> +};
>> +
>> +void spapr_register_vfio_container(struct sPAPRVFIOData *data);
>> +
>> +#endif
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 88c8f2c..75743e7 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -22,6 +22,9 @@
>>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>   * THE SOFTWARE.
>>   */
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +
>>  #include "hw.h"
>>  #include "pci.h"
>>  #include "msi.h"
>> @@ -32,7 +35,6 @@
>>  #include "exec-memory.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>> -
>>  #include "hw/pci_internals.h"
>>  
>>  /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
>> @@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
>>                   level);
>>  }
>>  
>> +static int pci_spapr_get_irq(void *opaque, int irq_num)
>> +{
>> +    sPAPRPHBState *phb = opaque;
>> +    return phb->lsi_table[irq_num].dt_irq;
>> +}
>> +
>>  static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
>>                                unsigned size)
>>  {
>> @@ -515,6 +523,93 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>>      return phb->dma;
>>  }
>>  
>> +void spapr_register_vfio_container(struct sPAPRVFIOData *data)
>> +{
>> +    sPAPRPHBState *phb;
>> +
>> +    QLIST_FOREACH(phb, &spapr->phbs, list) {
>> +        if (phb->iommugroupid == data->groupid) {
>> +            phb->vfio_data = *data;
>> +            phb->dma_window_start = phb->vfio_data.info.dma32_window_start;
>> +            phb->dma_window_size = phb->vfio_data.info.dma32_window_size;
>> +            return;
>> +        }
>> +    }
>> +}
>> +
>> +static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
>> +{
>> +    char iommupath[256];
>> +    DIR *dirp;
>> +    struct dirent *entry;
>> +
>> +    if (!phb->scan) {
>> +        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
>> +        return 0;
>> +    }
>> +
>> +    snprintf(iommupath, sizeof(iommupath),
>> +             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
>> +    dirp = opendir(iommupath);
>> +
>> +    while ((entry = readdir(dirp)) != NULL) {
>> +        char *tmp;
>> +        FILE *deviceclassfile;
>> +        unsigned deviceclass = 0, domainid, busid, devid, fnid;
>> +        char addr[32];
>> +        DeviceState *dev;
>> +
>> +        if (sscanf(entry->d_name, "%X:%X:%X.%x",
>> +                   &domainid, &busid, &devid, &fnid) != 4) {
>> +            continue;
>> +        }
>> +
>> +        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
>> +        trace_spapr_pci("Reading device class from ", tmp);
>> +
>> +        deviceclassfile = fopen(tmp, "r");
>> +        if (deviceclassfile) {
>> +            fscanf(deviceclassfile, "%x", &deviceclass);
>> +            fclose(deviceclassfile);
>> +        }
>> +        g_free(tmp);
>> +
>> +        if (!deviceclass) {
>> +            continue;
>> +        }
>> +        if ((phb->scan < 2) &&
>> +            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
>> +            /* Skip _any_ bridge */
>> +            continue;
>> +        }
>> +        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
>> +            /* Tweak USB */
>> +            phb->force_addr = 1;
>> +            phb->enable_multifunction = 1;
>> +        }
>> +
>> +        trace_spapr_pci("Creating devicei from ", entry->d_name);
>> +
>> +        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
>> +        if (!dev) {
>> +            fprintf(stderr, "failed to create vfio-pci\n");
>> +            continue;
>> +        }
>> +        qdev_prop_parse(dev, "host", entry->d_name);
>> +        if (phb->force_addr) {
>> +            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
>> +            qdev_prop_parse(dev, "addr", addr);
>> +        }
>> +        if (phb->enable_multifunction) {
>> +            qdev_prop_set_bit(dev, "multifunction", 1);
>> +        }
>> +        qdev_init_nofail(dev);
>> +    }
>> +    closedir(dirp);
>> +
>> +    return 0;
>> +}
>> +
>>  static int spapr_phb_init(SysBusDevice *s)
>>  {
>>      sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
>> @@ -567,15 +662,13 @@ static int spapr_phb_init(SysBusDevice *s)
>>  
>>      bus = pci_register_bus(&phb->host_state.busdev.qdev,
>>                             phb->busname ? phb->busname : phb->dtbusname,
>> -                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
>> +                           pci_spapr_set_irq, pci_spapr_get_irq,
>> +                           pci_spapr_map_irq, phb,
>>                             &phb->memspace, &phb->iospace,
>>                             PCI_DEVFN(0, 0), PCI_NUM_PINS);
>>      phb->host_state.bus = bus;
>>  
>>      phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
>> -    phb->dma_window_start = 0;
>> -    phb->dma_window_size = 0x40000000;
>> -    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
>>      pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
>>  
>>      QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
>> @@ -588,6 +681,19 @@ static int spapr_phb_init(SysBusDevice *s)
>>          }
>>      }
>>  
>> +    if (phb->iommugroupid >= 0) {
>> +        if (spapr_pci_scan_vfio(phb) < 0) {
>> +            return -1;
>> +        }
>> +        spapr_vfio_init_dma(phb->iommugroupid, phb->dma_liobn, &phb->vfio_data);
>> +        return 0;
>> +    }
>> +
>> +    phb->dma_window_start = 0;
>> +    phb->dma_window_size = 0x40000000;
>> +    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
>> +                                         phb->dma_window_size);
>> +
>>      return 0;
>>  }
>>  
>> @@ -599,6 +705,10 @@ static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
>>      DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
>>      DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
>> +    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
>> +    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
>> +    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -729,6 +839,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>                       sizeof(interrupt_map)));
>>  
>> +    if (!phb->dma_window_size) {
>> +        fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n");
>> +        exit(1);
>> +    }
>>      spapr_dma_dt(fdt, bus_off, "ibm,dma-window",
>>                   phb->dma_liobn, phb->dma_window_start,
>>                   phb->dma_window_size);
>> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
>> index 145071c..ed17053 100644
>> --- a/hw/spapr_pci.h
>> +++ b/hw/spapr_pci.h
>> @@ -26,6 +26,7 @@
>>  #include "hw/pci.h"
>>  #include "hw/pci_host.h"
>>  #include "hw/xics.h"
>> +#include "hw/spapr_iommu_vfio.h"
>>  
>>  #define SPAPR_MSIX_MAX_DEVS 32
>>  
>> @@ -57,6 +58,11 @@ typedef struct sPAPRPHBState {
>>          int nvec;
>>      } msi_table[SPAPR_MSIX_MAX_DEVS];
>>  
>> +    struct sPAPRVFIOData vfio_data;
>> +    int32_t iommugroupid;
>> +    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
>> +    uint8_t enable_multifunction, force_addr;
>> +
>>      QLIST_ENTRY(sPAPRPHBState) list;
>>  } sPAPRPHBState;
>>  
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 0ad4761..0023d41 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -43,6 +43,7 @@
>>  #include "range.h"
>>  #include "vfio_pci.h"
>>  #include "linux-vfio.h"
>> +#include "hw/spapr_iommu_vfio.h"
>>  
>>  //#define DEBUG_VFIO
>>  #ifdef DEBUG_VFIO
>> @@ -55,6 +56,8 @@
>>  
>>  #define MSIX_CAP_LENGTH 12
>>  
>> +static VFIOGroup *vfio_get_group(int groupid);
>> +
> 
> Unnecessary, because...
> 
>>  static QLIST_HEAD(, VFIOContainer)
>>      container_list = QLIST_HEAD_INITIALIZER(container_list);
>>  
>> @@ -1088,6 +1091,31 @@ static void vfio_listener_release(VFIOContainer *container)
>>  }
>>  
>>  /*
>> + * sPAPR TCE DMA interface
>> + */
>> +static int spapr_tce_map(int groupid, struct tce_iommu_dma_map *param)
>> +{
>> +    VFIOGroup *group = vfio_get_group(groupid);
>> +
>> +    if (!group || !group->container) {
>> +        return -1;
>> +    }
> 
> This should be:
> 
> static int spapr_tce_map(sPAPRVFIOData *spapr_data, struct tce_iommu_dma_map *param)
> {
>     VFIOContainer *container;
> 
>     container = container_of(spapr_data, VFIOContainer, iommu_data.spapr);
> 
>     return ioctl(container->fd, ....
> }
> 
> 
>> +
>> +    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_MAP_DMA, param);
>> +}
>> +
>> +static int spapr_tce_unmap(int groupid, struct tce_iommu_dma_map *param)
>> +{
>> +    VFIOGroup *group = vfio_get_group(groupid);
>> +
>> +    if (!group || !group->container) {
>> +        return -1;
>> +    }
>> +
>> +    return ioctl(group->container->fd, SPAPR_TCE_IOMMU_UNMAP_DMA, param);
>> +}
>> +
>> +/*
>>   * Interrupt setup
>>   */
>>  static void vfio_disable_interrupts(VFIODevice *vdev)
>> @@ -1590,6 +1618,42 @@ static int vfio_connect_container(VFIOGroup *group)
>>          memory_listener_register(&container->iommu_data.listener,
>>                                   get_system_memory());
>>  
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        container->iommu_data.spapr.info.argsz =
>> +                sizeof(container->iommu_data.spapr.info);
>> +        ret = ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO,
>> +                    &container->iommu_data.spapr.info);
>> +        if (ret) {
>> +            error_report("vfio: failed to get iommu info for container: %s\n",
>> +                         strerror(errno));
>> +            g_free(container);
>> +            close(fd);
>> +            return -1;
>> +        }
>> +
>> +        container->iommu_data.spapr.map = spapr_tce_map;
>> +        container->iommu_data.spapr.unmap = spapr_tce_unmap;
>> +        container->iommu_data.spapr.groupid = group->groupid;
> 
> This at least deserves a comment because x86 doesn't have a 1:1 mapping
> of container to groupid.


Give me a good example and add a comment about why you need MemoryListener as powerpc does not need the whole RAM to be mapped as a DMA Window :)
Honestly, I do not know what to write there what would make sense, explain at least to a powerpc-familiar person what is going on and not to be a 100 lines essay. I will put some though.


> I really think it might make more sense to pass a PCIBus here instead of a groupid.

Or a PCIDevice if I understand SRIOV right. Many choices. Let's stay with groupid for now.


> Whatever you choose, this
> shouldn't be part of sPAPRVFIOData, it should be another parameter to
> spapr_register_vfio_container.


Done. I will resend the patch. Thanks for your comments.



>> +        spapr_register_vfio_container(&container->iommu_data.spapr);
>> +
>>      } else {
>>          error_report("vfio: No available IOMMU models\n");
>>          g_free(container);
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 00bb3dd..ddb5332 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -6,6 +6,7 @@
>>  #include "pci.h"
>>  #include "ioapic.h"
>>  #include "event_notifier.h"
>> +#include "hw/spapr_iommu_vfio.h"
>>  
>>  typedef struct VFIOPCIHostDevice {
>>      uint16_t seg;
>> @@ -59,6 +60,7 @@ typedef struct VFIOContainer {
>>      struct {
>>          union {
>>              MemoryListener listener;
>> +            struct sPAPRVFIOData spapr;
>>          };
>>          void (*release)(struct VFIOContainer *);
>>      } iommu_data;
>> diff --git a/trace-events b/trace-events
>> index e548f86..9100591 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
>>  qxl_render_update_area_done(void *cookie) "%p"
>>  
>>  # hw/spapr_pci.c
>> +spapr_pci(const char *msg1, const char *msg2) "%s%s"
>>  spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
>>  spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
>>  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v5)
  2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2012-07-18 11:09 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4) Alexey Kardashevskiy
@ 2012-07-19  4:04 ` Alexey Kardashevskiy
  6 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-19  4:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf, qemu-ppc, David Gibson

It literally does the following:

1. POWERPC IOMMU support (the kernel counterpart is required)

2. The patch assumes that IOAPIC calls are going to be replaced
with something generic.

3. Added sPAPRVFIOData (hw/spapr_iommu_vfio.h) which describes
the interface between VFIO and sPAPR IOMMU.

4. Change sPAPR PHB to scan the PCI bus which is used for
the IOMMU-VFIO group. Now it is enough to add the following to
the QEMU command line to get VFIO up with all the devices from
IOMMU group with id=3:
-device spapr-pci-host-bridge,busname=E1000E,buid=0x3,iommu=3,\
 mem_win_addr=0x230000000000,io_win_addr=0x240000000000,msi_win_addr=0x250000000000

With the pathes posted earlier, this patch fully supports
VFIO what includes MSIX as well.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexander Graf <agraf@suse.de>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs       |    3 ++
 hw/spapr.h                 |    3 ++
 hw/spapr_iommu.c           |   58 ++++++++++++++++++++-
 hw/spapr_iommu_vfio.h      |   34 ++++++++++++
 hw/spapr_pci.c             |  124 ++++++++++++++++++++++++++++++++++++++++++--
 hw/spapr_pci.h             |    6 +++
 hw/vfio_pci.c              |   63 ++++++++++++++++++++++
 hw/vfio_pci.h              |    2 +
 linux-headers/linux/vfio.h |   26 ++++++++++
 trace-events               |    1 +
 10 files changed, 314 insertions(+), 6 deletions(-)
 create mode 100644 hw/spapr_iommu_vfio.h

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index f573a95..c46a049 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -25,4 +25,7 @@ obj-$(CONFIG_FDT) += ../device_tree.o
 # Xilinx PPC peripherals
 obj-y += xilinx_ethlite.o
 
+# VFIO PCI device assignment
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
+
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/spapr.h b/hw/spapr.h
index b37f337..aae6aee 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -340,4 +340,7 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       DMAContext *dma);
 
+#include "hw/spapr_iommu_vfio.h"
+void spapr_vfio_init_dma(uint32_t liobn, sPAPRVFIOData *data);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 50c288d..86a37d2 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -23,6 +23,8 @@
 #include "dma.h"
 
 #include "hw/spapr.h"
+#include "hw/spapr_iommu_vfio.h"
+#include "hw/vfio_pci.h"
 
 #include <libfdt.h>
 
@@ -183,6 +185,56 @@ static int put_tce_emu(target_ulong liobn, target_ulong ioba, target_ulong tce)
     return 0;
 }
 
+typedef struct sPAPRVFIOTable {
+    sPAPRVFIOData *data;
+    uint32_t liobn;
+    QLIST_ENTRY(sPAPRVFIOTable) list;
+} sPAPRVFIOTable;
+
+QLIST_HEAD(vfio_tce_tables, sPAPRVFIOTable) vfio_tce_tables;
+
+void spapr_vfio_init_dma(uint32_t liobn, sPAPRVFIOData *data)
+{
+    sPAPRVFIOTable *t;
+
+    t = g_malloc0(sizeof(*t));
+    t->data = data;
+    t->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&vfio_tce_tables, t, list);
+}
+
+static int put_tce_vfio(uint32_t liobn, target_ulong ioba, target_ulong tce)
+{
+    sPAPRVFIOTable *t;
+    struct tce_iommu_dma_map map = {
+        .argsz = sizeof(map),
+        .va = 0,
+        .dmaaddr = ioba,
+    };
+
+    QLIST_FOREACH(t, &vfio_tce_tables, list) {
+        if (t->liobn != liobn) {
+            continue;
+        }
+        if (tce) {
+            map.va = (uintptr_t)qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK);
+
+            if (t->data->map(t->data, &map)) {
+                perror("TCE_MAP_DMA");
+                return H_PARAMETER;
+            }
+        } else {
+            if (t->data->unmap(t->data, &map)) {
+                perror("TCE_UNMAP_DMA");
+                return H_PARAMETER;
+            }
+        }
+        return H_SUCCESS;
+    }
+    return H_CONTINUE; /* positive non-zero value */
+}
+
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
 {
@@ -200,7 +252,11 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
     ret = put_tce_emu(liobn, ioba, tce);
-    if (0 >= ret) {
+    if (ret <= 0) {
+        return ret ? H_PARAMETER : H_SUCCESS;
+    }
+    ret = put_tce_vfio(liobn, ioba, tce);
+    if (ret <= 0) {
         return ret ? H_PARAMETER : H_SUCCESS;
     }
 #ifdef DEBUG_TCE
diff --git a/hw/spapr_iommu_vfio.h b/hw/spapr_iommu_vfio.h
new file mode 100644
index 0000000..cc2d368
--- /dev/null
+++ b/hw/spapr_iommu_vfio.h
@@ -0,0 +1,34 @@
+/*
+ * Definitions for VFIO IOMMU implementation for SPAPR TCE.
+ *
+ * Copyright (c) 2012 Alexey Kardashevskiy <aik@olabs.ru>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#if !defined(__HW_SPAPR_IOMMU_VFIO_H__)
+#define __HW_SPAPR_IOMMU_VFIO_H__
+
+#include "hw/linux-vfio.h"
+
+typedef struct sPAPRVFIOData sPAPRVFIOData; 
+typedef struct sPAPRVFIOData {
+    struct tce_iommu_info info;
+    int (*map)(sPAPRVFIOData *data, struct tce_iommu_dma_map *param);
+    int (*unmap)(sPAPRVFIOData *data, struct tce_iommu_dma_map *param);
+} sPAPRVFIOData;
+
+void spapr_register_vfio_container(int groupid, sPAPRVFIOData *data);
+
+#endif
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 88c8f2c..b1b7514 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -22,6 +22,9 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include <sys/types.h>
+#include <dirent.h>
+
 #include "hw.h"
 #include "pci.h"
 #include "msi.h"
@@ -32,7 +35,6 @@
 #include "exec-memory.h"
 #include <libfdt.h>
 #include "trace.h"
-
 #include "hw/pci_internals.h"
 
 /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
@@ -440,6 +442,12 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, int level)
                  level);
 }
 
+static int pci_spapr_get_irq(void *opaque, int irq_num)
+{
+    sPAPRPHBState *phb = opaque;
+    return phb->lsi_table[irq_num].dt_irq;
+}
+
 static uint64_t spapr_io_read(void *opaque, target_phys_addr_t addr,
                               unsigned size)
 {
@@ -515,6 +523,93 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
     return phb->dma;
 }
 
+void spapr_register_vfio_container(int groupid, sPAPRVFIOData *data)
+{
+    sPAPRPHBState *phb;
+
+    QLIST_FOREACH(phb, &spapr->phbs, list) {
+        if (phb->iommugroupid == groupid) {
+            phb->vfio_data = data;
+            phb->dma_window_start = phb->vfio_data->info.dma32_window_start;
+            phb->dma_window_size = phb->vfio_data->info.dma32_window_size;
+            return;
+        }
+    }
+}
+
+static int spapr_pci_scan_vfio(sPAPRPHBState *phb)
+{
+    char iommupath[256];
+    DIR *dirp;
+    struct dirent *entry;
+
+    if (!phb->scan) {
+        trace_spapr_pci("Autoscan disabled for ", phb->dtbusname);
+        return 0;
+    }
+
+    snprintf(iommupath, sizeof(iommupath),
+             "/sys/kernel/iommu_groups/%d/devices/", phb->iommugroupid);
+    dirp = opendir(iommupath);
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((phb->scan < 2) &&
+            ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8))) {
+            /* Skip _any_ bridge */
+            continue;
+        }
+        if ((deviceclass == 0xc0310) || (deviceclass == 0xc0320)) {
+            /* Tweak USB */
+            phb->force_addr = 1;
+            phb->enable_multifunction = 1;
+        }
+
+        trace_spapr_pci("Creating devicei from ", entry->d_name);
+
+        dev = qdev_create(&phb->host_state.bus->qbus, "vfio-pci");
+        if (!dev) {
+            fprintf(stderr, "failed to create vfio-pci\n");
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name);
+        if (phb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            qdev_prop_parse(dev, "addr", addr);
+        }
+        if (phb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        qdev_init_nofail(dev);
+    }
+    closedir(dirp);
+
+    return 0;
+}
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
@@ -567,15 +662,13 @@ static int spapr_phb_init(SysBusDevice *s)
 
     bus = pci_register_bus(&phb->host_state.busdev.qdev,
                            phb->busname ? phb->busname : phb->dtbusname,
-                           pci_spapr_set_irq, NULL, pci_spapr_map_irq, phb,
+                           pci_spapr_set_irq, pci_spapr_get_irq,
+                           pci_spapr_map_irq, phb,
                            &phb->memspace, &phb->iospace,
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
 
     phb->dma_liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
-    phb->dma_window_start = 0;
-    phb->dma_window_size = 0x40000000;
-    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn, phb->dma_window_size);
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
@@ -588,6 +681,19 @@ static int spapr_phb_init(SysBusDevice *s)
         }
     }
 
+    if (phb->iommugroupid >= 0) {
+        if (spapr_pci_scan_vfio(phb) < 0) {
+            return -1;
+        }
+        spapr_vfio_init_dma(phb->dma_liobn, phb->vfio_data);
+        return 0;
+    }
+
+    phb->dma_window_start = 0;
+    phb->dma_window_size = 0x40000000;
+    phb->dma = spapr_tce_new_dma_context(phb->dma_liobn,
+                                         phb->dma_window_size);
+
     return 0;
 }
 
@@ -599,6 +705,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_HEX64("io_win_addr", sPAPRPHBState, io_win_addr, 0),
     DEFINE_PROP_HEX64("io_win_size", sPAPRPHBState, io_win_size, 0x10000),
     DEFINE_PROP_HEX64("msi_win_addr", sPAPRPHBState, msi_win_addr, 0),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBState, force_addr, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -729,6 +839,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
+    if (!phb->dma_window_size) {
+        fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n");
+        exit(1);
+    }
     spapr_dma_dt(fdt, bus_off, "ibm,dma-window",
                  phb->dma_liobn, phb->dma_window_start,
                  phb->dma_window_size);
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 145071c..ad68bf0 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -26,6 +26,7 @@
 #include "hw/pci.h"
 #include "hw/pci_host.h"
 #include "hw/xics.h"
+#include "hw/spapr_iommu_vfio.h"
 
 #define SPAPR_MSIX_MAX_DEVS 32
 
@@ -57,6 +58,11 @@ typedef struct sPAPRPHBState {
         int nvec;
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
+    struct sPAPRVFIOData *vfio_data;
+    int32_t iommugroupid;
+    uint8_t scan; /* 0 don't scan 1 scan only devices 2 scan everything */
+    uint8_t enable_multifunction, force_addr;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 11b43b8..bf071b0 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -45,6 +45,14 @@
 #include "range.h"
 #include "vfio_pci.h"
 
+#ifndef TARGET_PPC64
+#include <sys/io.h>
+#else
+#include "hw/pci_internals.h"
+#include "hw/spapr.h"
+#include "hw/spapr_iommu_vfio.h"
+#endif
+
 //#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
@@ -1089,6 +1097,21 @@ static void vfio_listener_release(VFIOContainer *container)
 }
 
 /*
+ * sPAPR TCE DMA interface
+ */
+static int spapr_tce_map(sPAPRVFIOData *data, struct tce_iommu_dma_map *param)
+{
+    VFIOContainer *container = container_of(data, VFIOContainer, iommu_data.spapr);
+    return ioctl(container->fd, SPAPR_TCE_IOMMU_MAP_DMA, param);
+}
+
+static int spapr_tce_unmap(sPAPRVFIOData *data, struct tce_iommu_dma_map *param)
+{
+    VFIOContainer *container = container_of(data, VFIOContainer, iommu_data.spapr);
+    return ioctl(container->fd, SPAPR_TCE_IOMMU_UNMAP_DMA, param);
+}
+
+/*
  * Interrupt setup
  */
 static void vfio_disable_interrupts(VFIODevice *vdev)
@@ -1591,6 +1614,46 @@ static int vfio_connect_container(VFIOGroup *group)
         memory_listener_register(&container->iommu_data.listener,
                                  get_system_memory());
 
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        container->iommu_data.spapr.info.argsz =
+                sizeof(container->iommu_data.spapr.info);
+        ret = ioctl(fd, SPAPR_TCE_IOMMU_GET_INFO,
+                    &container->iommu_data.spapr.info);
+        if (ret) {
+            error_report("vfio: failed to get iommu info for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        /*
+         * At the moment of adding VFIO for SPAPR (server POWERPC), only one
+         * group per container is supported. This may change later.
+         */
+        container->iommu_data.spapr.map = spapr_tce_map;
+        container->iommu_data.spapr.unmap = spapr_tce_unmap;
+        spapr_register_vfio_container(group->groupid,
+                                      &container->iommu_data.spapr);
+
     } else {
         error_report("vfio: No available IOMMU models\n");
         g_free(container);
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 00bb3dd..35df7e3 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -6,6 +6,7 @@
 #include "pci.h"
 #include "ioapic.h"
 #include "event_notifier.h"
+#include "hw/spapr_iommu_vfio.h"
 
 typedef struct VFIOPCIHostDevice {
     uint16_t seg;
@@ -59,6 +60,7 @@ typedef struct VFIOContainer {
     struct {
         union {
             MemoryListener listener;
+            sPAPRVFIOData spapr;
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 300d49b..27a0501 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -442,4 +442,30 @@ struct vfio_iommu_type1_dma_unmap {
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
 
+/*
+ * Interface to SPAPR TCE (POWERPC Book3S)
+ */
+#define SPAPR_TCE_IOMMU         2
+
+struct tce_iommu_info {
+    __u32 argsz;
+    __u32 flags;
+    __u32 dma32_window_start;
+    __u32 dma32_window_size;
+    __u64 dma64_window_start;
+    __u64 dma64_window_size;
+};
+
+#define SPAPR_TCE_IOMMU_GET_INFO        _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+struct tce_iommu_dma_map {
+    __u32 argsz;
+    __u32 flags;
+    __u64 va;
+    __u64 dmaaddr;
+};
+
+#define SPAPR_TCE_IOMMU_MAP_DMA         _IO(VFIO_TYPE, VFIO_BASE + 13)
+#define SPAPR_TCE_IOMMU_UNMAP_DMA       _IO(VFIO_TYPE, VFIO_BASE + 14)
+
 #endif /* VFIO_H */
diff --git a/trace-events b/trace-events
index e548f86..9100591 100644
--- a/trace-events
+++ b/trace-events
@@ -848,6 +848,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/spapr_pci.c
+spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt
  2012-07-12  5:29                 ` Alex Williamson
  2012-07-12  5:47                   ` Alexey Kardashevskiy
@ 2012-07-23  5:32                   ` Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 1/3] xics: added end-of-interrupt (EOI) handlers Alexey Kardashevskiy
                                       ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Here is a small patchset just to keep moving.

The main problem here is how we implement add_eoi_notifier()
which is supposed to be a callback of a global interrupt controller
which does not exists on QEMUMachine level.

Alexey Kardashevskiy (3):
  xics: added end-of-interrupt (EOI) handlers
  ioapic: removed obsolete ioapic_remove_gsi_eoi_notifier
  vfio-pci: rework of EOI

 hw/ioapic.c   |   19 ++-----------------
 hw/ioapic.h   |    1 -
 hw/vfio_pci.c |   24 ++++++++++++++++--------
 hw/vfio_pci.h |    1 -
 hw/xics.c     |   13 +++++++++++++
 hw/xics.h     |    4 ++++
 6 files changed, 35 insertions(+), 27 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 1/3] xics: added end-of-interrupt (EOI) handlers
  2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
@ 2012-07-23  5:32                     ` Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 2/3] ioapic: removed obsolete ioapic_remove_gsi_eoi_notifier Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 3/3] vfio-pci: rework of EOI Alexey Kardashevskiy
  2 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

The patch adds EOI handler to process h_eoi RTAS call correctly
for PCI legacy interrupts.

This functionality is going to be used in VFIO later.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/xics.c |   13 +++++++++++++
 hw/xics.h |    4 ++++
 2 files changed, 17 insertions(+)

diff --git a/hw/xics.c b/hw/xics.c
index 668a0d6..d36d62c 100644
--- a/hw/xics.c
+++ b/hw/xics.c
@@ -170,6 +170,7 @@ struct ics_irq_state {
     int sent:1;
     int rejected:1;
     int masked_pending:1;
+    NotifierList eoi_notifier;
 };
 
 struct ics_state {
@@ -309,6 +310,8 @@ static void ics_eoi(struct ics_state *ics, int nr)
     if (irq->type == XICS_LSI) {
         irq->sent = 0;
     }
+
+    notifier_list_notify(&irq->eoi_notifier, NULL);
 }
 
 /*
@@ -536,6 +539,7 @@ struct icp_state *xics_system_init(int nr_irqs)
     for (i = 0; i < nr_irqs; i++) {
         ics->irqs[i].priority = 0xff;
         ics->irqs[i].saved_priority = 0xff;
+        notifier_list_init(&ics->irqs[i].eoi_notifier);
     }
 
     ics->qirqs = qemu_allocate_irqs(ics_set_irq, ics, nr_irqs);
@@ -552,3 +556,12 @@ struct icp_state *xics_system_init(int nr_irqs)
 
     return icp;
 }
+
+void xics_add_eoi_notifier(Notifier *notify, uint32_t srcno)
+{
+    struct ics_state *ics = spapr->icp->ics;
+    struct ics_irq_state *irq = &ics->irqs[srcno - ics->offset];
+
+    notifier_list_add(&irq->eoi_notifier, notify);
+}
+
diff --git a/hw/xics.h b/hw/xics.h
index 2080159..ca75fac 100644
--- a/hw/xics.h
+++ b/hw/xics.h
@@ -27,6 +27,8 @@
 #if !defined(__XICS_H__)
 #define __XICS_H__
 
+#include "notify.h"
+
 #define XICS_IPI        0x2
 
 struct icp_state;
@@ -41,4 +43,6 @@ qemu_irq xics_assign_irq(struct icp_state *icp, int irq,
 
 struct icp_state *xics_system_init(int nr_irqs);
 
+void xics_add_eoi_notifier(Notifier *notify, uint32_t srcno);
+
 #endif /* __XICS_H__ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 2/3] ioapic: removed obsolete ioapic_remove_gsi_eoi_notifier
  2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 1/3] xics: added end-of-interrupt (EOI) handlers Alexey Kardashevskiy
@ 2012-07-23  5:32                     ` Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 3/3] vfio-pci: rework of EOI Alexey Kardashevskiy
  2 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

As the Notifier struct contains everything it needs to get removed
from the notifiers list, there is no need in ioapic_remove_gsi_eoi_notifier().
This patch removes it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ioapic.c |   19 ++-----------------
 hw/ioapic.h |    1 -
 2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/hw/ioapic.c b/hw/ioapic.c
index a6e0387..ead1b5f 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -139,8 +139,7 @@ void ioapic_eoi_broadcast(int vector)
     }
 }
 
-static void ioapic_update_gsi_eoi_notifier(Notifier *notify, uint32_t gsi,
-                                           bool add)
+void ioapic_add_gsi_eoi_notifier(Notifier *notify, uint32_t gsi)
 {
     IOAPICCommonState *s;
     int i;
@@ -159,25 +158,11 @@ static void ioapic_update_gsi_eoi_notifier(Notifier *notify, uint32_t gsi,
             continue;
         }
 
-        if (add) {
-            notifier_list_add(&s->eoi_notifiers[pin], notify);
-        } else {
-            notifier_remove(notify);
-        }
+        notifier_list_add(&s->eoi_notifiers[pin], notify);
         return;
     }
 }
 
-void ioapic_add_gsi_eoi_notifier(Notifier *notify, uint32_t gsi)
-{
-    ioapic_update_gsi_eoi_notifier(notify, gsi, true);
-}
-
-void ioapic_remove_gsi_eoi_notifier(Notifier *notify, uint32_t gsi)
-{
-    ioapic_update_gsi_eoi_notifier(notify, gsi, false);
-}
-
 static uint64_t
 ioapic_mem_read(void *opaque, target_phys_addr_t addr, unsigned int size)
 {
diff --git a/hw/ioapic.h b/hw/ioapic.h
index a28fada..2d7d6a2 100644
--- a/hw/ioapic.h
+++ b/hw/ioapic.h
@@ -27,6 +27,5 @@
 
 void ioapic_eoi_broadcast(int vector);
 void ioapic_add_gsi_eoi_notifier(Notifier *notify, uint32_t gsi);
-void ioapic_remove_gsi_eoi_notifier(Notifier *notify, uint32_t gsi);
 
 #endif /* !HW_IOAPIC_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 3/3] vfio-pci: rework of EOI
  2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 1/3] xics: added end-of-interrupt (EOI) handlers Alexey Kardashevskiy
  2012-07-23  5:32                     ` [Qemu-devel] [PATCH 2/3] ioapic: removed obsolete ioapic_remove_gsi_eoi_notifier Alexey Kardashevskiy
@ 2012-07-23  5:32                     ` Alexey Kardashevskiy
  2 siblings, 0 replies; 52+ messages in thread
From: Alexey Kardashevskiy @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Originally VFIO is coded to support IOAPIC only (i.e. x86).
The patch adds XICS (POWERPC interrupt controller) and replaces
ioapic_add_gsi_eoi_notifier with unified macro to have as little
#ifdef TARGET_PPC64 as possible.

Still needs some rework to get rid of #ifdef TARGET_PPC64.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio_pci.c |   24 ++++++++++++++++--------
 hw/vfio_pci.h |    1 -
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index fd65731..cd68fe0 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -21,7 +21,6 @@
 #include <dirent.h>
 #include <stdio.h>
 #include <unistd.h>
-#include <sys/io.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/types.h>
@@ -44,6 +43,15 @@
 #include "range.h"
 #include "vfio_pci.h"
 
+#ifndef TARGET_PPC64
+#include <sys/io.h>
+#include "ioapic.h"
+#define vfio_irq_add_eoi_notifier   ioapic_add_gsi_eoi_notifier
+#else
+#include "xics.h"
+#define vfio_irq_add_eoi_notifier   xics_add_eoi_notifier
+#endif
+
 //#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
@@ -258,7 +266,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
     irqfd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
 
     qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
-    ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    notifier_remove(&vdev->intx.eoi);
     vfio_mask_intx(vdev);
     vdev->intx.pending = false;
     qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
@@ -294,7 +302,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
     return;
 
 fail:
-    ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    vfio_irq_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
     qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_intx(vdev);
 #endif
@@ -341,7 +349,7 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
 
     event_notifier_cleanup(&vdev->intx.unmask);
 
-    ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    vfio_irq_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
     qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_intx(vdev);
 
@@ -366,7 +374,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
             vdev->host.func, vdev->intx.irq, irq);
 
     vfio_disable_intx_kvm(vdev);
-    ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    notifier_remove(&vdev->intx.eoi);
 
     vdev->intx.irq = irq;
 
@@ -375,7 +383,7 @@ static void vfio_update_irq(Notifier *notify, void *data)
         return;
     }
 
-    ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    vfio_irq_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
     vfio_enable_intx_kvm(vdev);
 
     /* Re-enable the interrupt in cased we missed an EOI */
@@ -404,7 +412,7 @@ static int vfio_enable_intx(VFIODevice *vdev)
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     vdev->intx.irq = pci_get_irq(&vdev->pdev, vdev->intx.pin);
     vdev->intx.eoi.notify = vfio_eoi;
-    ioapic_add_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    vfio_irq_add_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
 
     vdev->intx.update_irq.notify = vfio_update_irq;
     pci_add_irq_update_notifier(&vdev->pdev, &vdev->intx.update_irq);
@@ -441,7 +449,7 @@ static void vfio_disable_intx(VFIODevice *vdev)
     vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
 
     pci_remove_irq_update_notifier(&vdev->intx.update_irq);
-    ioapic_remove_gsi_eoi_notifier(&vdev->intx.eoi, vdev->intx.irq);
+    notifier_remove(&vdev->intx.eoi);
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 00bb3dd..d1a7434 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -4,7 +4,6 @@
 #include "qemu-common.h"
 #include "qemu-queue.h"
 #include "pci.h"
-#include "ioapic.h"
 #include "event_notifier.h"
 
 typedef struct VFIOPCIHostDevice {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-07-23  5:33 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-10  5:51 [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alexey Kardashevskiy
2012-07-10  5:51 ` [Qemu-devel] [PATCH 1/2] pseries pci: spapr_finalize_pci_setup introduced Alexey Kardashevskiy
2012-07-10  5:51 ` [Qemu-devel] [PATCH 2/2] vfio-powerpc: added VFIO support Alexey Kardashevskiy
2012-07-10 16:55   ` Alex Williamson
2012-07-10 21:32     ` Benjamin Herrenschmidt
2012-07-10 21:48       ` Alex Williamson
2012-07-10 21:53         ` Benjamin Herrenschmidt
2012-07-11  2:54     ` Alexey Kardashevskiy
2012-07-11  3:10       ` Benjamin Herrenschmidt
2012-07-12  3:11       ` Alex Williamson
2012-07-12  8:47         ` Alexey Kardashevskiy
2012-07-10 22:26   ` Scott Wood
2012-07-10 23:55     ` Alexey Kardashevskiy
2012-07-11  0:04       ` Benjamin Herrenschmidt
2012-07-11  0:17         ` Alexey Kardashevskiy
2012-07-11  0:26           ` Benjamin Herrenschmidt
2012-07-10 16:57 ` [Qemu-devel] [PATCH 0/2] RFC: powerpc-vfio: adding support Alex Williamson
2012-07-11  2:25   ` Alexey Kardashevskiy
2012-07-12  2:54     ` Alex Williamson
2012-07-12  4:16       ` Alexey Kardashevskiy
2012-07-12  4:31         ` Alex Williamson
2012-07-12  4:38           ` Alexey Kardashevskiy
2012-07-12  4:43             ` Alex Williamson
2012-07-12  4:58               ` Alexey Kardashevskiy
2012-07-12  5:29                 ` Alex Williamson
2012-07-12  5:47                   ` Alexey Kardashevskiy
2012-07-16  3:51                     ` Alexey Kardashevskiy
2012-07-23  5:32                   ` [Qemu-devel] [PATCH 0/3] vfio-pci: reworking end-of-interrupt Alexey Kardashevskiy
2012-07-23  5:32                     ` [Qemu-devel] [PATCH 1/3] xics: added end-of-interrupt (EOI) handlers Alexey Kardashevskiy
2012-07-23  5:32                     ` [Qemu-devel] [PATCH 2/3] ioapic: removed obsolete ioapic_remove_gsi_eoi_notifier Alexey Kardashevskiy
2012-07-23  5:32                     ` [Qemu-devel] [PATCH 3/3] vfio-pci: rework of EOI Alexey Kardashevskiy
2012-07-12  8:52 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v2) Alexey Kardashevskiy
2012-07-12 20:54   ` [Qemu-devel] [Qemu-ppc] " Blue Swirl
2012-07-12 21:37     ` Alex Williamson
2012-07-13  5:24     ` Alexey Kardashevskiy
2012-07-13 14:33       ` Blue Swirl
2012-07-12 22:35   ` Scott Wood
2012-07-13  5:31     ` Alexey Kardashevskiy
2012-07-13  3:47   ` [Qemu-devel] " Alex Williamson
2012-07-13  5:03     ` Alexey Kardashevskiy
2012-07-13  7:26 ` [Qemu-devel] [PATCH] RFC: vfio-powerpc: added VFIO support (v3) Alexey Kardashevskiy
2012-07-13 14:38   ` Blue Swirl
2012-07-13 15:07   ` Alex Williamson
2012-07-14  2:34     ` Alexey Kardashevskiy
2012-07-16 14:21       ` Alex Williamson
2012-07-16 21:17         ` Alex Williamson
2012-07-17  7:53         ` Alexey Kardashevskiy
2012-07-17 14:11           ` Alex Williamson
2012-07-18 11:09 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v4) Alexey Kardashevskiy
2012-07-18 14:14   ` Alex Williamson
2012-07-19  4:01     ` Alexey Kardashevskiy
2012-07-19  4:04 ` [Qemu-devel] [PATCH] vfio-powerpc: added VFIO support (v5) Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.