All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough
@ 2014-04-09 15:33 Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device Eric Auger
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Eric Auger, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This is an RFC about QEMU VFIO platform device implementation.
This work aims at enabling KVM platform device passthrough,
where a guest OS is directly assigned a platform device: meaning
it can access the register space without trapping, and receive IRQ
from the device. Also the system MMU downstream to the device is
programmed properly to make sure the device can only see the VM
assigned physical address space.

This work was tested on Calxeda Midway where one xgmac is assigned to
KVM host while the second one is assigned to the guest. The current
status is the guest is able to ping the gateway meaning the basic
functionality is achieved for this device.

Nethertheless some key work remains to be done:
- unbind/migration
- multi-instantiation testing
- multiple IRQ testing
- management of platform devices with more complex device tree node

- The 1st RFC patch works around the limitation of QEMU not being able
to create platform devices from the command line, and should only
be used for testing the second patch.  A discussion of how to address
the limitation is encouraged, however, in the context of the first patch.

- The 2d patch addresses A.Williamson's comment
to have the platform device code separated from the PCI device
code, by firstly moving the existing VFIO PCI code into its
own hw/vfio/.

- the 3d RFC patch achieves MMIO direct access to the device.

- the 4th RFC brings basic IRQ support

- the 5th RFC enables the QEMU end-user to pass the device he wants
  to assign to his guest in the command line.

- the last RFC brings some cleanup and paves the way for cleaner exit

Here are the instructions to test on a Calxeda Midway:

https://wiki.linaro.org/LEG/Engineering/Virtualization/Platform_Device_Passthrough_on_Midway

The code is based on today's: git://git.qemu.org/qemu.git

some patches need to be applied on host linux to correct issues
not yet upstreamed. They can be found on vfio-dev branch of
git://git.linaro.org/people/eric.auger/linux.git

- commit 997691b: enables unmapping stage2 entries on memory region deletion
- commit 731e308: corrects an in read/write function of vfio platform driver
- iommu/arm-smmu commit serie and especially bf5a852 fixing an issue with
  Calxeda Midway sMMU.

QEMU commits can be found on vfio-dev-integ
git://git.linaro.org/people/eric.auger/qemu.git

Best Regards

Eric


Eric Auger (3):
  vfio: Add initial IRQ support in QEMU platform device
  virt: Assign a VFIO platform device to a virt VM in QEMU command line
  vfio: add exit function and IRQ disable functions

Kim Phillips (3):
  hw/arm/virt: add a xgmac device
  vfio: move hw/misc/vfio.c to hw/vfio/pci.c
  vfio: add vfio-platform support

 LICENSE                        |   2 +-
 MAINTAINERS                    |   2 +-
 hw/Makefile.objs               |   1 +
 hw/arm/virt.c                  | 161 ++++++++-
 hw/misc/Makefile.objs          |   1 -
 hw/vfio/Makefile.objs          |   5 +
 hw/vfio/common.c               | 486 ++++++++++++++++++++++++++
 hw/{misc/vfio.c => vfio/pci.c} | 480 +-------------------------
 hw/vfio/platform.c             | 765 +++++++++++++++++++++++++++++++++++++++++
 hw/vfio/vfio-common.h          |  55 +++
 10 files changed, 1480 insertions(+), 478 deletions(-)
 create mode 100644 hw/vfio/Makefile.objs
 create mode 100644 hw/vfio/common.c
 rename hw/{misc/vfio.c => vfio/pci.c} (89%)
 create mode 100644 hw/vfio/platform.c
 create mode 100644 hw/vfio/vfio-common.h

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
  2014-04-10 13:26   ` Peter Crosthwaite
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 2/6] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Kim Phillips, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

This is a hack and only serves as an example of what needs to be
done to make the next RFC - add vfio-platform support - work
for development purposes on a Calxeda Midway system.  We don't want
mach-virt to always create this ethernet device - DO NOT APPLY, etc.

Initial attempts to convince QEMU to create a memory mapped device
on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
would fail with "Parameter 'driver' expects pluggable device type".
Any guidance as to how to overcome this apparent design limitation
is welcome.

RAM is reduced from 30 to 1GiB such as to not overlap the xgmac device's
physical address.  Not sure if the 30GiB RAM (or whatever the user sets
it to with -m) could be set up above 0x1_0000_0000, but there is probably
extra work needed to resolve this type of conflict.

note: vfio-platform interrupt support development may want interrupt
property data filled; here it's omitted for the time being.

Not-signed-off-by: Kim Phillips <kim.phillips@linaro.org>
---
 hw/arm/virt.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 2bbc931..5d43cf0 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -65,6 +65,7 @@ enum {
     VIRT_GIC_CPU,
     VIRT_UART,
     VIRT_MMIO,
+    VIRT_ETHERNET,
 };
 
 typedef struct MemMapEntry {
@@ -106,7 +107,8 @@ static const MemMapEntry a15memmap[] = {
     [VIRT_MMIO] = { 0xa000000, 0x200 },
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
-    [VIRT_MEM] = { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
+    [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
+    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
 };
 
 static const int a15irqmap[] = {
@@ -291,6 +293,25 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     g_free(nodename);
 }
 
+static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
+{
+    char *nodename;
+    hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
+    hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
+    const char compat[] = "calxeda,hb-xgmac";
+
+    sysbus_create_simple("vfio-platform", base, NULL);
+
+    nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
+    qemu_fdt_add_subnode(vbi->fdt, nodename);
+
+    /* Note that we can't use setprop_string because of the embedded NUL */
+    qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
+    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
+
+    g_free(nodename);
+}
+
 static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
 {
     int i;
@@ -425,6 +446,7 @@ static void machvirt_init(QEMUMachineInitArgs *args)
     }
 
     create_uart(vbi, pic);
+    create_ethernet(vbi, pic);
 
     /* Create mmio transports, so the user can create virtio backends
      * (which will be automatically plugged in to the transports). If
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 2/6] vfio: move hw/misc/vfio.c to hw/vfio/pci.c
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 3/6] vfio: add vfio-platform support Eric Auger
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Kim Phillips, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

This is done in preparation for the addition of VFIO platform
device support.

Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
---
 LICENSE                        | 2 +-
 MAINTAINERS                    | 2 +-
 hw/Makefile.objs               | 1 +
 hw/misc/Makefile.objs          | 1 -
 hw/vfio/Makefile.objs          | 3 +++
 hw/{misc/vfio.c => vfio/pci.c} | 0
 6 files changed, 6 insertions(+), 3 deletions(-)
 create mode 100644 hw/vfio/Makefile.objs
 rename hw/{misc/vfio.c => vfio/pci.c} (100%)

diff --git a/LICENSE b/LICENSE
index da70e94..0e0b4b9 100644
--- a/LICENSE
+++ b/LICENSE
@@ -11,7 +11,7 @@ option) any later version.
 
 As of July 2013, contributions under version 2 of the GNU General Public
 License (and no later version) are only accepted for the following files
-or directories: bsd-user/, linux-user/, hw/misc/vfio.c, hw/xen/xen_pt*.
+or directories: bsd-user/, linux-user/, hw/vfio/, hw/xen/xen_pt*.
 
 3) The Tiny Code Generator (TCG) is released under the BSD license
    (see license headers in files).
diff --git a/MAINTAINERS b/MAINTAINERS
index 7d17f83..9b5d569 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -605,7 +605,7 @@ F: hw/usb/*
 VFIO
 M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
-F: hw/misc/vfio.c
+F: hw/vfio/*
 
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index d178b65..b16dada 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -26,6 +26,7 @@ devices-dirs-$(CONFIG_SOFTMMU) += ssi/
 devices-dirs-$(CONFIG_SOFTMMU) += timer/
 devices-dirs-$(CONFIG_TPM) += tpm/
 devices-dirs-$(CONFIG_SOFTMMU) += usb/
+devices-dirs-$(CONFIG_SOFTMMU) += vfio/
 devices-dirs-$(CONFIG_VIRTIO) += virtio/
 devices-dirs-$(CONFIG_SOFTMMU) += watchdog/
 devices-dirs-$(CONFIG_SOFTMMU) += xen/
diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
index f674365..656570c 100644
--- a/hw/misc/Makefile.objs
+++ b/hw/misc/Makefile.objs
@@ -21,7 +21,6 @@ common-obj-$(CONFIG_MACIO) += macio/
 
 ifeq ($(CONFIG_PCI), y)
 obj-$(CONFIG_KVM) += ivshmem.o
-obj-$(CONFIG_LINUX) += vfio.o
 endif
 
 obj-$(CONFIG_REALVIEW) += arm_sysctl.o
diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
new file mode 100644
index 0000000..31c7dab
--- /dev/null
+++ b/hw/vfio/Makefile.objs
@@ -0,0 +1,3 @@
+ifeq ($(CONFIG_LINUX), y)
+obj-$(CONFIG_PCI) += pci.o
+endif
diff --git a/hw/misc/vfio.c b/hw/vfio/pci.c
similarity index 100%
rename from hw/misc/vfio.c
rename to hw/vfio/pci.c
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 3/6] vfio: add vfio-platform support
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 2/6] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
       [not found]   ` <1397832861.3060.93.camel@ul30vt.home>
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device Eric Auger
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Kim Phillips, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

Functions for which PCI and platform device support share are moved
into common.c.  The common vfio_{get,put}_group() get an additional
argument, a pointer to a vfio_reset_handler(), for which to pass on to
qemu_register_reset, but only if it exists (the platform device code
currently passes a NULL as its reset_handler).

For the platform device code, we basically use SysBusDevice
instead of PCIDevice.  Since realize() returns void, unlike
PCIDevice's initfn, error codes are moved into the
error message text with %m.

Currently only MMIO access is supported at this time.

The perceived path for future QEMU development is:

- add support for interrupts
- verify and test platform dev unmap path
- test existing PCI path for any regressions
- add support for creating platform devices on the qemu command line
  - currently device address specification is hardcoded for test
    development on Calxeda Midway's fff51000.ethernet device
- reset is not supported and registration of reset functions is
  bypassed for platform devices.
  - there is no standard means of resetting a platform device,
    unsure if it suffices to be handled at device--VFIO binding time

Signed-off-by: Kim Phillips <kim.phillips@linaro.org>

[1] http://www.spinics.net/lists/kvm-arm/msg08195.html
---
 hw/vfio/Makefile.objs |   2 +
 hw/vfio/common.c      | 486 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c         | 480 ++-----------------------------------------------
 hw/vfio/platform.c    | 381 +++++++++++++++++++++++++++++++++++++++
 hw/vfio/vfio-common.h |  55 ++++++
 5 files changed, 937 insertions(+), 467 deletions(-)
 create mode 100644 hw/vfio/common.c
 create mode 100644 hw/vfio/platform.c
 create mode 100644 hw/vfio/vfio-common.h

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 31c7dab..c5c76fe 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,3 +1,5 @@
 ifeq ($(CONFIG_LINUX), y)
+obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_SOFTMMU) += platform.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
new file mode 100644
index 0000000..9d1f723
--- /dev/null
+++ b/hw/vfio/common.c
@@ -0,0 +1,486 @@
+/*
+ * vfio based device assignment support
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+
+#include "vfio-common.h"
+
+#define DEBUG_VFIO
+#ifdef DEBUG_VFIO
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static QLIST_HEAD(, VFIOContainer)
+    container_list = QLIST_HEAD_INITIALIZER(container_list);
+
+QLIST_HEAD(, VFIOGroup)
+    group_list = QLIST_HEAD_INITIALIZER(group_list);
+
+
+struct VFIODevice;
+
+#ifdef CONFIG_KVM
+/*
+ * We have a single VFIO pseudo device per KVM VM.  Once created it lives
+ * for the life of the VM.  Closing the file descriptor only drops our
+ * reference to it and the device's reference to kvm.  Therefore once
+ * initialized, this file descriptor is only released on QEMU exit and
+ * we'll re-use it should another vfio device be attached before then.
+ */
+static int vfio_kvm_device_fd = -1;
+#endif
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+static int vfio_dma_unmap(VFIOContainer *container,
+                          hwaddr iova, ram_addr_t size)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
+        return -errno;
+    }
+
+    return 0;
+}
+
+static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    /*
+     * Try the mapping, if it fails with EBUSY, unmap the region and try
+     * again.  This shouldn't be necessary, but we sometimes see it in
+     * the the VGA ROM space.
+     */
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
+        return 0;
+    }
+
+    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
+    return -errno;
+}
+
+static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+{
+    return !memory_region_is_ram(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+static void vfio_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+    hwaddr iova, end;
+    void *vaddr;
+    int ret;
+
+    assert(!memory_region_is_iommu(section->mr));
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
+            iova, end - 1, vaddr);
+
+    memory_region_ref(section->mr);
+    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
+    if (ret) {
+        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                     container, iova, end - iova, vaddr, ret);
+
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->iommu_data.type1.initialized) {
+            if (!container->iommu_data.type1.error) {
+                container->iommu_data.type1.error = ret;
+            }
+        } else {
+            hw_error("vfio: DMA mapping failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+    hwaddr iova, end;
+    int ret;
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+            iova, end - 1);
+
+    ret = vfio_dma_unmap(container, iova, end - iova);
+    memory_region_unref(section->mr);
+    if (ret) {
+        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                     "0x%"HWADDR_PRIx") = %d (%m)",
+                     container, iova, end - iova, ret);
+    }
+}
+
+static MemoryListener vfio_memory_listener = {
+    .region_add = vfio_listener_region_add,
+    .region_del = vfio_listener_region_del,
+};
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.type1.listener);
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_ADD,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (!kvm_enabled()) {
+        return;
+    }
+
+    if (vfio_kvm_device_fd < 0) {
+        struct kvm_create_device cd = {
+            .type = KVM_DEV_TYPE_VFIO,
+        };
+
+        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
+            DPRINTF("KVM_CREATE_DEVICE: %m\n");
+            return;
+        }
+
+        vfio_kvm_device_fd = cd.fd;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to add group %d to KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_DEL,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (vfio_kvm_device_fd < 0) {
+        return;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to remove group %d from KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+static int vfio_connect_container(VFIOGroup *group)
+{
+    VFIOContainer *container;
+    int ret, fd;
+
+    if (group->container) {
+        return 0;
+    }
+
+    QLIST_FOREACH(container, &container_list, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            return 0;
+        }
+    }
+
+    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_report("vfio: failed to open /dev/vfio/vfio: %m");
+        return -errno;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_report("vfio: supported vfio version: %d, "
+                     "reported version: %d", VFIO_API_VERSION, ret);
+        close(fd);
+        return -EINVAL;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->fd = fd;
+
+    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.type1.listener,
+                                 &address_space_memory);
+
+        if (container->iommu_data.type1.error) {
+            ret = container->iommu_data.type1.error;
+            vfio_listener_release(container);
+            g_free(container);
+            close(fd);
+            error_report("vfio: memory listener initialization failed for container");
+            return ret;
+        }
+
+        container->iommu_data.type1.initialized = true;
+
+    } else {
+        error_report("vfio: No available IOMMU models");
+        g_free(container);
+        close(fd);
+        return -EINVAL;
+    }
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&container_list, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    return 0;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        if (container->iommu_data.release) {
+            container->iommu_data.release(container);
+        }
+        QLIST_REMOVE(container, next);
+        DPRINTF("vfio_disconnect_container: close container->fd\n");
+        close(container->fd);
+        g_free(container);
+    }
+}
+
+VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &group_list, next) {
+        if (group->groupid == groupid) {
+            return group;
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+    group->fd = qemu_open(path, O_RDWR);
+    if (group->fd < 0) {
+        error_report("vfio: error opening %s: %m", path);
+        g_free(group);
+        return NULL;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_report("vfio: error getting group status: %m");
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_report("vfio: error, group %d is not viable, please ensure "
+                     "all devices within the iommu_group are bound to their "
+                     "vfio bus driver.", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group)) {
+        error_report("vfio: failed to setup container for group %d", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    if (QLIST_EMPTY(&group_list) && reset_handler) {
+        qemu_register_reset(reset_handler, NULL);
+    }
+
+    QLIST_INSERT_HEAD(&group_list, group, next);
+
+    vfio_kvm_device_add_group(group);
+
+    return group;
+}
+
+void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler)
+{
+    if (!QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    DPRINTF("vfio_put_group: close group->fd\n");
+    close(group->fd);
+    g_free(group);
+
+    if (QLIST_EMPTY(&group_list) && reset_handler) {
+        qemu_unregister_reset(reset_handler, NULL);
+    }
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9cf5b84..9e70d68 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1,5 +1,5 @@
 /*
- * vfio based device assignment support
+ * vfio based device assignment support - PCI devices
  *
  * Copyright Red Hat, Inc. 2012
  *
@@ -40,6 +40,8 @@
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
 
+#include "vfio-common.h"
+
 /* #define DEBUG_VFIO */
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
@@ -55,6 +57,8 @@
 #define VFIO_ALLOW_KVM_MSI 1
 #define VFIO_ALLOW_KVM_MSIX 1
 
+extern QLIST_HEAD(, VFIOGroup) group_list;
+
 struct VFIODevice;
 
 typedef struct VFIOQuirk {
@@ -135,25 +139,6 @@ enum {
 
 struct VFIOGroup;
 
-typedef struct VFIOType1 {
-    MemoryListener listener;
-    int error;
-    bool initialized;
-} VFIOType1;
-
-typedef struct VFIOContainer {
-    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    struct {
-        /* enable abstraction to support various iommu backends */
-        union {
-            VFIOType1 type1;
-        };
-        void (*release)(struct VFIOContainer *);
-    } iommu_data;
-    QLIST_HEAD(, VFIOGroup) group_list;
-    QLIST_ENTRY(VFIOContainer) next;
-} VFIOContainer;
-
 /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
 typedef struct VFIOMSIXInfo {
     uint8_t table_bar;
@@ -200,15 +185,6 @@ typedef struct VFIODevice {
     bool rom_read_failed;
 } VFIODevice;
 
-typedef struct VFIOGroup {
-    int fd;
-    int groupid;
-    VFIOContainer *container;
-    QLIST_HEAD(, VFIODevice) device_list;
-    QLIST_ENTRY(VFIOGroup) next;
-    QLIST_ENTRY(VFIOGroup) container_next;
-} VFIOGroup;
-
 typedef struct VFIORomBlacklistEntry {
     uint16_t vendor_id;
     uint16_t device_id;
@@ -234,23 +210,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 
 #define MSIX_CAP_LENGTH 12
 
-static QLIST_HEAD(, VFIOContainer)
-    container_list = QLIST_HEAD_INITIALIZER(container_list);
-
-static QLIST_HEAD(, VFIOGroup)
-    group_list = QLIST_HEAD_INITIALIZER(group_list);
-
-#ifdef CONFIG_KVM
-/*
- * We have a single VFIO pseudo device per KVM VM.  Once created it lives
- * for the life of the VM.  Closing the file descriptor only drops our
- * reference to it and the device's reference to kvm.  Therefore once
- * initialized, this file descriptor is only released on QEMU exit and
- * we'll re-use it should another vfio device be attached before then.
- */
-static int vfio_kvm_device_fd = -1;
-#endif
-
 static void vfio_disable_interrupts(VFIODevice *vdev);
 static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
@@ -2180,183 +2139,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
 }
 
 /*
- * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
- */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
-{
-    struct vfio_iommu_type1_dma_unmap unmap = {
-        .argsz = sizeof(unmap),
-        .flags = 0,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
-        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
-        return -errno;
-    }
-
-    return 0;
-}
-
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
-{
-    struct vfio_iommu_type1_dma_map map = {
-        .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
-        .vaddr = (__u64)(uintptr_t)vaddr,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (!readonly) {
-        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
-    }
-
-    /*
-     * Try the mapping, if it fails with EBUSY, unmap the region and try
-     * again.  This shouldn't be necessary, but we sometimes see it in
-     * the the VGA ROM space.
-     */
-    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
-         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
-        return 0;
-    }
-
-    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
-    return -errno;
-}
-
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
-{
-    return !memory_region_is_ram(section->mr) ||
-           /*
-            * Sizing an enabled 64-bit BAR can cause spurious mappings to
-            * addresses in the upper part of the 64-bit address space.  These
-            * are never accessed by the CPU and beyond the address width of
-            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
-            */
-           section->offset_within_address_space & (1ULL << 63);
-}
-
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
-    hwaddr iova, end;
-    void *vaddr;
-    int ret;
-
-    assert(!memory_region_is_iommu(section->mr));
-
-    if (vfio_listener_skipped_section(section)) {
-        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
-
-    if (iova >= end) {
-        return;
-    }
-
-    vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
-
-    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
-            iova, end - 1, vaddr);
-
-    memory_region_ref(section->mr);
-    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
-    if (ret) {
-        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                     container, iova, end - iova, vaddr, ret);
-
-        /*
-         * On the initfn path, store the first error in the container so we
-         * can gracefully fail.  Runtime, there's not much we can do other
-         * than throw a hardware error.
-         */
-        if (!container->iommu_data.type1.initialized) {
-            if (!container->iommu_data.type1.error) {
-                container->iommu_data.type1.error = ret;
-            }
-        } else {
-            hw_error("vfio: DMA mapping failed, unable to continue");
-        }
-    }
-}
-
-static void vfio_listener_region_del(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
-    hwaddr iova, end;
-    int ret;
-
-    if (vfio_listener_skipped_section(section)) {
-        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
-
-    if (iova >= end) {
-        return;
-    }
-
-    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
-            iova, end - 1);
-
-    ret = vfio_dma_unmap(container, iova, end - iova);
-    memory_region_unref(section->mr);
-    if (ret) {
-        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx") = %d (%m)",
-                     container, iova, end - iova, ret);
-    }
-}
-
-static MemoryListener vfio_memory_listener = {
-    .region_add = vfio_listener_region_add,
-    .region_del = vfio_listener_region_del,
-};
-
-static void vfio_listener_release(VFIOContainer *container)
-{
-    memory_listener_unregister(&container->iommu_data.type1.listener);
-}
-
-/*
  * Interrupt setup
  */
 static void vfio_disable_interrupts(VFIODevice *vdev)
@@ -3221,244 +3003,8 @@ static void vfio_pci_reset_handler(void *opaque)
     }
 }
 
-static void vfio_kvm_device_add_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_ADD,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (!kvm_enabled()) {
-        return;
-    }
-
-    if (vfio_kvm_device_fd < 0) {
-        struct kvm_create_device cd = {
-            .type = KVM_DEV_TYPE_VFIO,
-        };
-
-        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
-            DPRINTF("KVM_CREATE_DEVICE: %m\n");
-            return;
-        }
-
-        vfio_kvm_device_fd = cd.fd;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to add group %d to KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
-}
-
-static void vfio_kvm_device_del_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_DEL,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (vfio_kvm_device_fd < 0) {
-        return;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to remove group %d from KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
-}
-
-static int vfio_connect_container(VFIOGroup *group)
-{
-    VFIOContainer *container;
-    int ret, fd;
-
-    if (group->container) {
-        return 0;
-    }
-
-    QLIST_FOREACH(container, &container_list, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            return 0;
-        }
-    }
-
-    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_report("vfio: failed to open /dev/vfio/vfio: %m");
-        return -errno;
-    }
-
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_report("vfio: supported vfio version: %d, "
-                     "reported version: %d", VFIO_API_VERSION, ret);
-        close(fd);
-        return -EINVAL;
-    }
-
-    container = g_malloc0(sizeof(*container));
-    container->fd = fd;
-
-    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
-        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
-        if (ret) {
-            error_report("vfio: failed to set group container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
-        }
-
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
-        if (ret) {
-            error_report("vfio: failed to set iommu for container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
-        }
-
-        container->iommu_data.type1.listener = vfio_memory_listener;
-        container->iommu_data.release = vfio_listener_release;
-
-        memory_listener_register(&container->iommu_data.type1.listener,
-                                 &address_space_memory);
-
-        if (container->iommu_data.type1.error) {
-            ret = container->iommu_data.type1.error;
-            vfio_listener_release(container);
-            g_free(container);
-            close(fd);
-            error_report("vfio: memory listener initialization failed for container");
-            return ret;
-        }
-
-        container->iommu_data.type1.initialized = true;
-
-    } else {
-        error_report("vfio: No available IOMMU models");
-        g_free(container);
-        close(fd);
-        return -EINVAL;
-    }
-
-    QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&container_list, container, next);
-
-    group->container = container;
-    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-
-    return 0;
-}
-
-static void vfio_disconnect_container(VFIOGroup *group)
-{
-    VFIOContainer *container = group->container;
-
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
-    QLIST_REMOVE(group, container_next);
-    group->container = NULL;
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        if (container->iommu_data.release) {
-            container->iommu_data.release(container);
-        }
-        QLIST_REMOVE(container, next);
-        DPRINTF("vfio_disconnect_container: close container->fd\n");
-        close(container->fd);
-        g_free(container);
-    }
-}
-
-static VFIOGroup *vfio_get_group(int groupid)
-{
-    VFIOGroup *group;
-    char path[32];
-    struct vfio_group_status status = { .argsz = sizeof(status) };
-
-    QLIST_FOREACH(group, &group_list, next) {
-        if (group->groupid == groupid) {
-            return group;
-        }
-    }
-
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open(path, O_RDWR);
-    if (group->fd < 0) {
-        error_report("vfio: error opening %s: %m", path);
-        g_free(group);
-        return NULL;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_report("vfio: error getting group status: %m");
-        close(group->fd);
-        g_free(group);
-        return NULL;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_report("vfio: error, group %d is not viable, please ensure "
-                     "all devices within the iommu_group are bound to their "
-                     "vfio bus driver.", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
-    }
-
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
-    if (vfio_connect_container(group)) {
-        error_report("vfio: failed to setup container for group %d", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
-    }
-
-    if (QLIST_EMPTY(&group_list)) {
-        qemu_register_reset(vfio_pci_reset_handler, NULL);
-    }
-
-    QLIST_INSERT_HEAD(&group_list, group, next);
-
-    vfio_kvm_device_add_group(group);
-
-    return group;
-}
-
-static void vfio_put_group(VFIOGroup *group)
-{
-    if (!QLIST_EMPTY(&group->device_list)) {
-        return;
-    }
-
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    DPRINTF("vfio_put_group: close group->fd\n");
-    close(group->fd);
-    g_free(group);
-
-    if (QLIST_EMPTY(&group_list)) {
-        qemu_unregister_reset(vfio_pci_reset_handler, NULL);
-    }
-}
-
-static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           struct VFIODevice *vdev)
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
@@ -3485,7 +3031,7 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
         goto error;
     }
 
-    DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name,
+    DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
             dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
 
     if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) {
@@ -3768,7 +3314,7 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    group = vfio_get_group(groupid);
+    group = vfio_get_group(groupid, vfio_pci_reset_handler);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -3785,7 +3331,7 @@ static int vfio_initfn(PCIDevice *pdev)
             pvdev->host.function == vdev->host.function) {
 
             error_report("vfio: error: device %s is already attached", path);
-            vfio_put_group(group);
+            vfio_put_group(group, vfio_pci_reset_handler);
             return -EBUSY;
         }
     }
@@ -3793,7 +3339,7 @@ static int vfio_initfn(PCIDevice *pdev)
     ret = vfio_get_device(group, path, vdev);
     if (ret) {
         error_report("vfio: failed to get device %s", path);
-        vfio_put_group(group);
+        vfio_put_group(group, vfio_pci_reset_handler);
         return ret;
     }
 
@@ -3879,7 +3425,7 @@ out_teardown:
 out_put:
     g_free(vdev->emulated_config_bits);
     vfio_put_device(vdev);
-    vfio_put_group(group);
+    vfio_put_group(group, vfio_pci_reset_handler);
     return ret;
 }
 
@@ -3899,7 +3445,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     vfio_put_device(vdev);
-    vfio_put_group(group);
+    vfio_put_group(group, vfio_pci_reset_handler);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
new file mode 100644
index 0000000..138fb13
--- /dev/null
+++ b/hw/vfio/platform.c
@@ -0,0 +1,381 @@
+/*
+ * vfio based device assignment support - platform devices
+ *
+ * Copyright Linaro Limited, 2014
+ *
+ * Authors:
+ *  Kim Phillips <kim.phillips@linaro.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on vfio based PCI device assignment support:
+ *  Copyright Red Hat, Inc. 2012
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "hw/qdev-properties.h"
+#include "migration/vmstate.h"
+#include "hw/hw.h"
+#include "hw/sysbus.h"
+
+#include "vfio-common.h"
+
+#define DEBUG_VFIO
+#ifdef DEBUG_VFIO
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+/* Extra debugging, trap acceleration paths for more logging */
+#define VFIO_ALLOW_MMAP 1
+
+#define TYPE_VFIO_PLATFORM "vfio-platform"
+
+typedef struct VFIORegion {
+    off_t fd_offset; /* offset of region within device fd */
+    int fd; /* device fd, allows us to pass VFIORegion as opaque data */
+    MemoryRegion mem; /* slow, read/write access */
+    MemoryRegion mmap_mem; /* direct mapped access */
+    void *mmap;
+    size_t size;
+    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
+    uint8_t nr; /* cache the region number for debug */
+} VFIORegion;
+
+typedef struct VFIODevice {
+    SysBusDevice sbdev;
+    int fd;
+    int num_regions;
+    VFIORegion *regions;
+    QLIST_ENTRY(VFIODevice) next;
+    struct VFIOGroup *group;
+    char *name;
+} VFIODevice;
+
+static int vfio_mmap_region(VFIODevice *vdev, VFIORegion *region,
+                         MemoryRegion *mem, MemoryRegion *submem,
+                         void **map, size_t size, off_t offset,
+                         const char *name)
+{
+    int ret = 0;
+
+    if (VFIO_ALLOW_MMAP && size && region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        int prot = 0;
+        ret = 0;
+
+        if (region->flags & VFIO_REGION_INFO_FLAG_READ) {
+            prot |= PROT_READ;
+        }
+
+        if (region->flags & VFIO_REGION_INFO_FLAG_WRITE) {
+            prot |= PROT_WRITE;
+        }
+
+        *map = mmap(NULL, size, prot, MAP_SHARED,
+                    region->fd, region->fd_offset + offset);
+        if (*map == MAP_FAILED) {
+            ret = -errno;
+            *map = NULL;
+            goto error;
+        }
+
+        memory_region_init_ram_ptr(submem, OBJECT(vdev), name, size, *map);
+    }
+
+    memory_region_add_subregion(mem, offset, submem);
+
+error:
+    return ret;
+}
+
+/*
+ * IO Port/MMIO - Beware of the endians, VFIO is always little endian
+ */
+static void vfio_region_write(void *opaque, hwaddr addr,
+                              uint64_t data, unsigned size)
+{
+    VFIORegion *region = opaque;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+
+    switch (size) {
+    case 1:
+        buf.byte = data;
+        break;
+    case 2:
+        buf.word = data;
+        break;
+    case 4:
+        buf.dword = data;
+        break;
+    default:
+        hw_error("vfio: unsupported write size, %d bytes\n", size);
+        break;
+    }
+
+    if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
+                     __func__, addr, data, size);
+    }
+
+    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
+             __func__, region->nr, addr, data, size);
+}
+
+static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
+{
+    VFIORegion *region = opaque;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+    uint64_t data = 0;
+
+    if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
+                     __func__, addr, size);
+        return (uint64_t)-1;
+    }
+
+    switch (size) {
+    case 1:
+        data = buf.byte;
+        break;
+    case 2:
+        data = buf.word;
+        break;
+    case 4:
+        data = buf.dword;
+        break;
+    default:
+        hw_error("vfio: unsupported read size, %d bytes\n", size);
+        break;
+    }
+
+    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
+            __func__, region->nr, addr, size, data);
+
+    return data;
+}
+
+static const MemoryRegionOps vfio_region_ops = {
+    .read = vfio_region_read,
+    .write = vfio_region_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+};
+
+static void vfio_map_region(VFIODevice *vdev, int nr)
+{
+    VFIORegion *region = &vdev->regions[nr];
+    unsigned size = region->size;
+    char name[64];
+
+    snprintf(name, sizeof(name), "VFIO %s region %d", vdev->name, nr);
+
+    /* A "slow" read/write mapping underlies all regions  */
+    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
+                          region, name, size);
+
+    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+    if (vfio_mmap_region(vdev, region, &region->mem,
+                         &region->mmap_mem, &region->mmap, size, 0, name)) {
+        error_report("%s unsupported. Performance may be slow", name);
+    }
+}
+
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           struct VFIODevice *vdev)
+{
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
+    int ret, i;
+
+    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (ret < 0) {
+        error_report("vfio: error getting device %s from group %d: %m",
+                     name, group->groupid);
+        error_printf("Verify all devices in group %d are bound to the vfio "
+                     "platform driver and are not already in use\n",
+                     group->groupid);
+        return ret;
+    }
+
+    vdev->fd = ret;
+    vdev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
+
+    /* Sanity check device */
+    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_report("vfio: error getting device info: %m");
+        goto error;
+    }
+
+    DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
+            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
+
+    vdev->regions = g_malloc0(sizeof(VFIORegion) * dev_info.num_regions);
+    if (!vdev->regions) {
+            error_report("vfio: Error allocating space for %d regions",
+                         dev_info.num_regions);
+            ret = -ENOMEM;
+            goto error;
+    }
+
+    vdev->num_regions = dev_info.num_regions;
+
+    for (i = 0; i < dev_info.num_regions; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            goto error;
+        }
+
+        DPRINTF("Device %s region %d:\n", name, i);
+        DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
+                (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
+                (unsigned long)reg_info.flags);
+
+        vdev->regions[i].flags = reg_info.flags;
+        vdev->regions[i].size = reg_info.size;
+        vdev->regions[i].fd_offset = reg_info.offset;
+        vdev->regions[i].fd = vdev->fd;
+        vdev->regions[i].nr = i;
+    }
+
+error:
+    if (ret) {
+        g_free(vdev->regions);
+        QLIST_REMOVE(vdev, next);
+        vdev->group = NULL;
+        close(vdev->fd);
+    }
+    return ret;
+}
+
+static void vfio_platform_realize(DeviceState *dev, Error **errp)
+{
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
+    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, sbdev, sbdev);
+    VFIOGroup *group;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    ssize_t len;
+    struct stat st;
+    int groupid, i, ret;
+
+    /* TODO: pass device name on command line */
+    vdev->name = malloc(PATH_MAX);
+    strcpy(vdev->name, "fff51000.ethernet");
+
+    /* Check that the host device exists */
+    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/", vdev->name);
+    if (stat(path, &st) < 0) {
+        error_report("vfio: error: no such host device: %s", path);
+        return;
+    }
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, PATH_MAX);
+    if (len <= 0) {
+        error_report("vfio: error no iommu_group for device");
+        return;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio: error reading %s: %m", path);
+        return;
+    }
+
+    DPRINTF("%s(%s) group %d\n", __func__, vdev->name, groupid);
+
+    group = vfio_get_group(groupid, NULL);
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return;
+    }
+
+    snprintf(path, sizeof(path), "%s", vdev->name);
+
+    QLIST_FOREACH(pvdev, &group->device_list, next) {
+        if (strcmp(pvdev->name, vdev->name) == 0) {
+            error_report("vfio: error: device %s is already attached", path);
+            vfio_put_group(group, NULL);
+            return;
+        }
+    }
+
+    ret = vfio_get_device(group, path, vdev);
+    if (ret) {
+        error_report("vfio: failed to get device %s", path);
+        vfio_put_group(group, NULL);
+        return;
+    }
+
+    for (i = 0; i < vdev->num_regions; i++) {
+        vfio_map_region(vdev, i);
+        sysbus_init_mmio(sbdev, &vdev->regions[i].mem);
+    }
+}
+
+static const VMStateDescription vfio_platform_vmstate = {
+    .name = TYPE_VFIO_PLATFORM,
+    .unmigratable = 1,
+};
+
+static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = vfio_platform_realize;
+    dc->vmsd = &vfio_platform_vmstate;
+    dc->desc = "VFIO-based platform device assignment";
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+}
+
+static const TypeInfo vfio_platform_dev_info = {
+    .name = TYPE_VFIO_PLATFORM,
+    .parent = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(VFIODevice),
+    .class_init = vfio_platform_dev_class_init,
+};
+
+static void register_vfio_platform_dev_type(void)
+{
+    type_register_static(&vfio_platform_dev_info);
+}
+
+type_init(register_vfio_platform_dev_type)
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
new file mode 100644
index 0000000..21148ef
--- /dev/null
+++ b/hw/vfio/vfio-common.h
@@ -0,0 +1,55 @@
+/*
+ * common header for vfio based device assignment support
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+struct VFIODevice;
+
+struct VFIOGroup;
+
+typedef struct VFIOType1 {
+    MemoryListener listener;
+    int error;
+    bool initialized;
+} VFIOType1;
+
+typedef struct VFIOContainer {
+    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
+    struct {
+        /* enable abstraction to support various iommu backends */
+        union {
+            VFIOType1 type1;
+        };
+        void (*release)(struct VFIOContainer *);
+    } iommu_data;
+    QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_ENTRY(VFIOContainer) next;
+} VFIOContainer;
+
+typedef struct VFIOGroup {
+    int fd;
+    int groupid;
+    VFIOContainer *container;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOGroup) next;
+    QLIST_ENTRY(VFIOGroup) container_next;
+} VFIOGroup;
+
+
+VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler);
+void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler);
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
                   ` (2 preceding siblings ...)
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 3/6] vfio: add vfio-platform support Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
  2014-04-11  1:34   ` Kim Phillips
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 5/6] virt: Assign a VFIO platform device to a virt VM in QEMU command line Eric Auger
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Eric Auger, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This work is inspired of PCI INTx. The code was prepared to support
multiple IRQs but this was not tested at that stage. Similarly to what
is done on PCI, the device register space is RAM unmapped on IRQ hit
in order to trap the end of interrupt (EOI). On mmap timer hit, if the
EOI was trapped, the mmapping is restored. the physical interrupt is
unmasked on the first read/write with the MMIO register space.

Tested on Calxeda Midway xgmac which can be directly assigned to one
virt VM.

Acknowledgements:
- vfio device name, guest physical address and IRQ indexes are
  hardcoded. This will be improved in another patch
- currently tested on a single IRQ (xgmac main IRQ)
- a KVM patch is required to invalidate stage2 entries on RAM memory
  region destruction (see [PATCH] ARM: KVM: Handle IPA unmapping on
  memory region deletion)

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c      |  13 ++-
 hw/vfio/platform.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 304 insertions(+), 25 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 5d43cf0..31ae7d2 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -108,12 +108,13 @@ static const MemMapEntry a15memmap[] = {
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
     [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
-    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
+    [VIRT_ETHERNET] = { 0xfff41000, 0x1000 },
 };
 
 static const int a15irqmap[] = {
     [VIRT_UART] = 1,
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
+    [VIRT_ETHERNET] = 77,
 };
 
 static VirtBoardInfo machines[] = {
@@ -299,8 +300,12 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
     hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
     const char compat[] = "calxeda,hb-xgmac";
+    int irqm = vbi->irqmap[VIRT_ETHERNET];
+    int irqp = irqm+1;
+    int irqlp = irqm+2;
 
-    sysbus_create_simple("vfio-platform", base, NULL);
+    sysbus_create_varargs("vfio-platform", base,
+                          pic[irqm], pic[irqp], pic[irqlp], NULL);
 
     nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
     qemu_fdt_add_subnode(vbi->fdt, nodename);
@@ -308,6 +313,10 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     /* Note that we can't use setprop_string because of the embedded NUL */
     qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
     qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
+    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
+                                0x0, irqm, 0x4,
+                                0x0, irqp, 0x4,
+                                0x0, irqlp, 0x4);
 
     g_free(nodename);
 }
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 138fb13..f148edd 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -24,6 +24,7 @@
 #include "config.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/event_notifier.h"
@@ -31,6 +32,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+
 #include "hw/qdev-properties.h"
 #include "migration/vmstate.h"
 #include "hw/hw.h"
@@ -41,12 +43,15 @@
 #define DEBUG_VFIO
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
-    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+    do { fprintf(stderr, "vfio: %s: " fmt, __func__, ## __VA_ARGS__); } \
+        while (0)
 #else
 #define DPRINTF(fmt, ...) \
     do { } while (0)
 #endif
 
+#define PLATFORM_NUM_REGIONS 10
+
 /* Extra debugging, trap acceleration paths for more logging */
 #define VFIO_ALLOW_MMAP 1
 
@@ -63,16 +68,240 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+
+#define VFIO_INT_INTp 4
+
+typedef struct VFIOINTp {
+    QLIST_ENTRY(VFIOINTp) next;
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    qemu_irq qemuirq;
+    struct VFIODevice *vdev; /* back pointer to device */
+    bool pending; /* interrupt pending */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* index */
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
+} VFIOINTp;
+
+
+
 typedef struct VFIODevice {
     SysBusDevice sbdev;
     int fd;
     int num_regions;
-    VFIORegion *regions;
+    int num_irqs;
+    int interrupt; /* type of the interrupt, might disappear */
+    char *name;
+    uint32_t mmap_timeout; /* mmap timeout value in ms */
+    VFIORegion regions[PLATFORM_NUM_REGIONS];
     QLIST_ENTRY(VFIODevice) next;
     struct VFIOGroup *group;
-    char *name;
+    QLIST_HEAD(, VFIOINTp) intp_list;
 } VFIODevice;
 
+
+
+static void vfio_unmask_intp(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+
+
+
+static void vfio_intp_mmap_enable(void *opaque)
+{
+    VFIOINTp * intp = (VFIOINTp *)opaque;
+    VFIODevice *vdev = intp->vdev;
+
+    if (intp->pending) {
+        DPRINTF("IRQ still pending, re-schedule the mmap timer\n");
+        timer_mod(intp->mmap_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + intp->mmap_timeout);
+        return;
+    }
+
+    DPRINTF("IRQ EOI'ed sets mmap again\n");
+    VFIORegion *region = &vdev->regions[0];
+    memory_region_set_enabled(&region->mmap_mem, true);
+}
+
+
+
+static void vfio_intp_interrupt(void *opaque)
+{
+    int ret;
+    VFIOINTp *intp = (VFIOINTp *)opaque;
+    VFIODevice *vdev = intp->vdev;
+
+    DPRINTF("pin = %d fd = %d\n",
+            intp->pin, event_notifier_get_fd(&intp->interrupt));
+
+    ret = event_notifier_test_and_clear(&intp->interrupt);
+    if (!ret) {
+        DPRINTF("Error when clearing fd=%d\n",
+                event_notifier_get_fd(&intp->interrupt));
+    }
+
+    intp->pending = true;
+
+    /* TODO: fix this number of regions,
+     * currently a single region is handled
+     */
+
+    VFIORegion *region = &vdev->regions[0];
+    memory_region_set_enabled(&region->mmap_mem, false);
+
+    qemu_set_irq(intp->qemuirq, 1);
+
+    /* schedule the mmap timer which will restote mmap path after EOI*/
+    if (intp->mmap_timeout) {
+        timer_mod(intp->mmap_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + intp->mmap_timeout);
+    }
+
+}
+
+
+
+static void vfio_irq_eoi(VFIODevice *vdev)
+{
+
+    VFIOINTp *intp;
+    bool eoi_done = false;
+
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        if (intp->pending) {
+            if (eoi_done) {
+                DPRINTF("several IRQ pending simultaneously: \
+                         this is not a supported case yet\n");
+            }
+            DPRINTF("EOI IRQ #%d fd=%d\n",
+                    intp->pin, event_notifier_get_fd(&intp->interrupt));
+            intp->pending = false;
+            qemu_set_irq(intp->qemuirq, 0);
+            vfio_unmask_intp(vdev, intp->pin);
+            eoi_done = true;
+        }
+    }
+
+    return;
+
+}
+
+
+
+#if 0
+static void vfio_list_intp(VFIODevice *vdev)
+{
+    VFIOINTp *intp;
+    int i = 0;
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        DPRINTF("IRQ #%d\n", i);
+        DPRINTF("- pin = %d\n", intp->pin);
+        DPRINTF("- fd = %d\n", event_notifier_get_fd(&intp->interrupt));
+        DPRINTF("- pending = %d\n", (int)intp->pending);
+        DPRINTF("- kvm_accel = %d\n", (int)intp->kvm_accel);
+        i++;
+    }
+}
+#endif
+
+static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
+{
+    struct vfio_irq_set *irq_set; /* irq structure passed to vfio kernel */
+    int32_t *pfd; /* pointer to the eventfd */
+    int ret, argsz;
+
+    int device = vdev->fd;
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
+
+    vdev->interrupt = VFIO_INT_INTp;
+
+    /* allocate and populate a new VFIOINTp structure put in a queue list */
+    VFIOINTp *intp = g_malloc0(sizeof(*intp));
+    intp->vdev = vdev;
+    intp->pin = index;
+    intp->pending = false;
+    intp->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                    vfio_intp_mmap_enable, intp);
+    intp->mmap_timeout = 1100;
+    /* TO DO: timeout as parameter */
+
+    /* incr sysbus num_irq and sets sysbus->irqp[n] = &intp->qemuirq
+     * only the address of the qemu_irq is set here
+     */
+
+    sysbus_init_irq(sbdev, &intp->qemuirq);
+
+    /* content is set in sysbus_connect_irq (currently in machine definition) */
+
+    ret = event_notifier_init(&intp->interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed ");
+        return ret;
+    }
+
+    /* build the irq_set to be passed to the vfio kernel driver */
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+
+    if (!irq_set) {
+        DPRINTF("failure while allocating memory for irq\n");
+        qemu_log_mask(LOG_GUEST_ERROR,
+                 "VFIO platform: failure while allocating memory for irq");
+        return -errno;
+    }
+
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = index;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&intp->interrupt);
+
+    DPRINTF("register fd=%d/irq index=%d to kernel\n", *pfd, index);
+
+    qemu_set_fd_handler(*pfd, vfio_intp_interrupt, NULL, intp);
+
+    /* pass the index/fd binding to the kernel driver so that it
+     * triggers this fd on HW IRQ
+     */
+    ret = ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    if (ret) {
+        error_report("vfio: Error: Failed to pass Int fd to the driver: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        close(*pfd); /* TO DO : replace by event_notifier_cleanup */
+        return -errno;
+    }
+
+    /* store the new intp in qlist */
+
+    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
+
+    return 0;
+
+}
+
+
+
+
 static int vfio_mmap_region(VFIODevice *vdev, VFIORegion *region,
                          MemoryRegion *mem, MemoryRegion *submem,
                          void **map, size_t size, off_t offset,
@@ -112,6 +341,7 @@ error:
 /*
  * IO Port/MMIO - Beware of the endians, VFIO is always little endian
  */
+
 static void vfio_region_write(void *opaque, hwaddr addr,
                               uint64_t data, unsigned size)
 {
@@ -139,12 +369,15 @@ static void vfio_region_write(void *opaque, hwaddr addr,
     }
 
     if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
-                     __func__, addr, data, size);
+        error_report("(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
+                     addr, data, size);
     }
 
-    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
-             __func__, region->nr, addr, data, size);
+    DPRINTF("(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
+            region->nr, addr, data, size);
+
+    vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
+
 }
 
 static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
@@ -159,8 +392,8 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
     uint64_t data = 0;
 
     if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
-                     __func__, addr, size);
+        error_report("(,0x%"HWADDR_PRIx", %d) failed: %m",
+                     addr, size);
         return (uint64_t)-1;
     }
 
@@ -179,18 +412,22 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
         break;
     }
 
-    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
-            __func__, region->nr, addr, size, data);
+    DPRINTF("(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
+            region->nr, addr, size, data);
+
+    vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
 
     return data;
 }
 
+
 static const MemoryRegionOps vfio_region_ops = {
     .read = vfio_region_read,
     .write = vfio_region_write,
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+
 static void vfio_map_region(VFIODevice *vdev, int nr)
 {
     VFIORegion *region = &vdev->regions[nr];
@@ -242,14 +479,6 @@ static int vfio_get_device(VFIOGroup *group, const char *name,
     DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
             dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
 
-    vdev->regions = g_malloc0(sizeof(VFIORegion) * dev_info.num_regions);
-    if (!vdev->regions) {
-            error_report("vfio: Error allocating space for %d regions",
-                         dev_info.num_regions);
-            ret = -ENOMEM;
-            goto error;
-    }
-
     vdev->num_regions = dev_info.num_regions;
 
     for (i = 0; i < dev_info.num_regions; i++) {
@@ -273,6 +502,38 @@ static int vfio_get_device(VFIOGroup *group, const char *name,
         vdev->regions[i].nr = i;
     }
 
+   /* IRQ */
+
+    DPRINTF("Num IRQS: %d\n", dev_info.num_irqs);
+
+    vdev->num_irqs = dev_info.num_irqs;
+
+    for (i = 0; i < dev_info.num_irqs; i++) {
+
+        struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+        irq.index = i;
+
+        DPRINTF("Retrieve IRQ info from vfio platform driver ...\n");
+
+        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
+
+        if (ret) {
+            error_report("vfio: error getting device %s irq info",
+                         name);
+            error_printf("vfio: error getting device %s irq info",
+                         name);
+        }
+
+        DPRINTF("- IRQ index %d: count %d, flags=0x%x\n",
+                                irq.index,
+                                irq.count,
+                                irq.flags);
+
+        vfio_enable_intp(vdev, irq.index);
+
+    }
+
 error:
     if (ret) {
         g_free(vdev->regions);
@@ -286,13 +547,16 @@ error:
 static void vfio_platform_realize(DeviceState *dev, Error **errp)
 {
     SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
-    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, sbdev, sbdev);
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, sbdev, sbdev);
+    VFIODevice *pvdev;
+
     VFIOGroup *group;
     char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
     ssize_t len;
     struct stat st;
     int groupid, i, ret;
 
+
     /* TODO: pass device name on command line */
     vdev->name = malloc(PATH_MAX);
     strcpy(vdev->name, "fff51000.ethernet");
@@ -320,24 +584,29 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    DPRINTF("%s(%s) group %d\n", __func__, vdev->name, groupid);
+    DPRINTF("%s belongs to VFIO group %d\n", vdev->name, groupid);
 
     group = vfio_get_group(groupid, NULL);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return;
     }
-
     snprintf(path, sizeof(path), "%s", vdev->name);
 
     QLIST_FOREACH(pvdev, &group->device_list, next) {
+        DPRINTF("compare %s versus %s\n", pvdev->name, vdev->name);
         if (strcmp(pvdev->name, vdev->name) == 0) {
-            error_report("vfio: error: device %s is already attached", path);
+
+            DPRINTF("vfio device %s already is attached to group %d\n",
+                    vdev->name, groupid);
+
             vfio_put_group(group, NULL);
             return;
         }
     }
 
+    DPRINTF("Calling vfio_get_device ...\n");
+
     ret = vfio_get_device(group, path, vdev);
     if (ret) {
         error_report("vfio: failed to get device %s", path);
@@ -363,6 +632,7 @@ static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
     dc->realize = vfio_platform_realize;
     dc->vmsd = &vfio_platform_vmstate;
     dc->desc = "VFIO-based platform device assignment";
+    dc->cannot_instantiate_with_device_add_yet = true;
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 }
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 5/6] virt: Assign a VFIO platform device to a virt VM in QEMU command line
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
                   ` (3 preceding siblings ...)
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 6/6] vfio: add exit function and IRQ disable functions Eric Auger
  2014-04-11  1:28 ` [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Kim Phillips
  6 siblings, 0 replies; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Eric Auger, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This patch aims at allowing the end-user to specify the device he
wants to directly assign to his virt VM in the QEMU command line.
The QEMU platform device becomes generic.

Current choice is to reuse the "-device" option.

For example when assigning Calxeda Midway xgmac device this option is
used:
-device vfio-platform,vfio_device="fff51000.ethernet",\
compat="calxeda/hb-xgmac",mmap-timeout-ms=1000

where
- fff51000.ethernet is the name of the device in
  /sys/bus/platform/devices/
- calxeda/hb-xgma is the compatibility where the standard coma
  separator is replaced by "/" since coma is specifically used by QEMU
  command line parser
- mmap-timeout-ms is minimal amount of time (ms) during which the IP
  register space stays MMIO mapped after an IRQ triggers in order to
  trap the end of interrupt (EOI). This is an optional parameter
  (default value set to 1100 ms).

virt machine was modified to interpret this line and automatically
- map the device at a chosen guest physical address in
  [0xa004000, 0x10000000],
- map the device IRQs after 48,
- create the associated guest device tree with the provided
  compatibility.

The "-device" option underlying implementation is not standard
which can be argued. Indeed normaly it induces the call to the QEMU
device realize function once after the virtual machine init execution.
In this case QDEV mappings and device tree creation must happen.
Since virt is the place where the whole memory and IRQ mapping is
known and device tree is created, it was chosen to interpret the option
line there. This means the realize function is called twice, once in
virt.c and once after machine init return. The second call returns
immediatly since the QEMU device is recognized as already existing.
Another way to implement this would be to create a new option in QEMU.

Acknowledgements:
- a single compatibility currently is supported
- IRQ properties set in the device tree should be refined
- More generally devices with more complex device tree nodes must be
  studied and are not currently handled
- cases where multiple VFIO devices are assigned could not be tested

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c      | 178 +++++++++++++++++++++++++++++++++++++++++++----------
 hw/vfio/platform.c |  43 ++++++++++---
 2 files changed, 181 insertions(+), 40 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 31ae7d2..1fb66ef 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -40,6 +40,17 @@
 #include "exec/address-spaces.h"
 #include "qemu/bitops.h"
 #include "qemu/error-report.h"
+#include "monitor/qdev.h"
+#include "qemu/config-file.h"
+
+/*
+ * this function is implemented in vfio/platform.c
+ * it returns the name, compatibility, IRQ number and register set size.
+ * the function only is implemented for VFIO platform devices
+ */
+void vfio_get_props(SysBusDevice *s, char **pname, char **pcompat,
+                    int *pnum_irqs, size_t *psize);
+
 
 #define NUM_VIRTIO_TRANSPORTS 32
 
@@ -65,7 +76,7 @@ enum {
     VIRT_GIC_CPU,
     VIRT_UART,
     VIRT_MMIO,
-    VIRT_ETHERNET,
+    VIRT_VFIO,
 };
 
 typedef struct MemMapEntry {
@@ -79,7 +90,10 @@ typedef struct VirtBoardInfo {
     const char *qdevname;
     const char *gic_compatible;
     const MemMapEntry *memmap;
+    qemu_irq pic[NUM_IRQS];
     const int *irqmap;
+    hwaddr avail_vfio_base;
+    int avail_vfio_irq;
     int smp_cpus;
     void *fdt;
     int fdt_size;
@@ -105,16 +119,16 @@ static const MemMapEntry a15memmap[] = {
     [VIRT_GIC_CPU] = { 0x8002000, 0x1000 },
     [VIRT_UART] = { 0x9000000, 0x1000 },
     [VIRT_MMIO] = { 0xa000000, 0x200 },
+    [VIRT_VFIO] = { 0xa004000, 0x0 }, /* size is dynamically populated */
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
-    [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
-    [VIRT_ETHERNET] = { 0xfff41000, 0x1000 },
+    [VIRT_MEM] = { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
 };
 
 static const int a15irqmap[] = {
     [VIRT_UART] = 1,
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
-    [VIRT_ETHERNET] = 77,
+    [VIRT_VFIO] = 48,
 };
 
 static VirtBoardInfo machines[] = {
@@ -266,7 +280,7 @@ static void fdt_add_gic_node(const VirtBoardInfo *vbi)
     qemu_fdt_setprop_cell(vbi->fdt, "/intc", "phandle", gic_phandle);
 }
 
-static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
+static void create_uart(const VirtBoardInfo *vbi)
 {
     char *nodename;
     hwaddr base = vbi->memmap[VIRT_UART].base;
@@ -275,7 +289,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     const char compat[] = "arm,pl011\0arm,primecell";
     const char clocknames[] = "uartclk\0apb_pclk";
 
-    sysbus_create_simple("pl011", base, pic[irq]);
+    sysbus_create_simple("pl011", base, vbi->pic[irq]);
 
     nodename = g_strdup_printf("/pl011@%" PRIx64, base);
     qemu_fdt_add_subnode(vbi->fdt, nodename);
@@ -294,34 +308,133 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     g_free(nodename);
 }
 
-static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
+/*
+ * Function called for each vfio-platform device option found in the
+ * qemu user command line:
+ * -device vfio-platform,vfio-device="<device>",compat"<compat>"
+ * for instance <device> can be fff51000.ethernet (device unbound from
+ * original driver and bound to vfio driver)
+ * for instance <compat> can be calxeda/hb-xgmac
+ * note "/" replaces normal ",". Indeed "," would be interpreted by QEMU as
+ * a separator
+ */
+
+static int vfio_init_func(QemuOpts *opts, void *opaque)
 {
+    const char *driver;
+    DeviceState *dev;
+    SysBusDevice *s;
+    VirtBoardInfo *vbi = (VirtBoardInfo *)opaque;
+    driver = qemu_opt_get(opts, "driver");
+
+    /* index the first IRQ should be mapped */
+    int irq_start = vbi->avail_vfio_irq;
+
     char *nodename;
-    hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
-    hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
-    const char compat[] = "calxeda,hb-xgmac";
-    int irqm = vbi->irqmap[VIRT_ETHERNET];
-    int irqp = irqm+1;
-    int irqlp = irqm+2;
 
-    sysbus_create_varargs("vfio-platform", base,
-                          pic[irqm], pic[irqp], pic[irqlp], NULL);
+    /* this will contain the capatibility string with the "/"
+     * replaced by ","
+     */
+    char *corrected_compat;
 
-    nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
-    qemu_fdt_add_subnode(vbi->fdt, nodename);
+    char **pname  = g_malloc0(sizeof(char *));
+    char **pcompat = g_malloc0(sizeof(char *));
 
-    /* Note that we can't use setprop_string because of the embedded NUL */
-    qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
-    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
-    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
-                                0x0, irqm, 0x4,
-                                0x0, irqp, 0x4,
-                                0x0, irqlp, 0x4);
+    int num_irqs;
+    size_t size;
+    int i;
+    uint32_t *irq_attr;
 
-    g_free(nodename);
+    if (!driver) {
+        qerror_report(QERR_MISSING_PARAMETER, "driver");
+        return -1 ;
+    }
+
+    if (strcasecmp(driver, "vfio-platform") == 0) {
+        dev = qdev_device_add(opts);
+        if (!dev) {
+            return -1;
+        }
+        s = SYS_BUS_DEVICE(dev);
+
+        vfio_get_props(s, pname, pcompat, &num_irqs, &size);
+
+        if (vbi->avail_vfio_base + size >= 0x10000000) {
+            /* register space size exceeds remaining VFIO space */
+            qerror_report(QERR_DEVICE_INIT_FAILED, *pname);
+        } else if (irq_start + num_irqs >= NUM_IRQS) {
+            /* VFIO IRQ number exceeded */
+            qerror_report(QERR_DEVICE_INIT_FAILED, *pname);
+        }
+
+        /*
+         * process compatibility property string passed by end-user
+         * replaces / by ,
+         * currently a single property compatibility value is supported!
+         */
+        corrected_compat = g_strdup(*pcompat);
+        char *slash = strchr(corrected_compat, '/');
+        *slash = ',';
+
+        sysbus_mmio_map(s, 0, vbi->avail_vfio_base);
+
+        nodename = g_strdup_printf("/%s@%" PRIx64,
+                                   *pname, vbi->avail_vfio_base);
+        qemu_fdt_add_subnode(vbi->fdt, nodename);
+
+        qemu_fdt_setprop(vbi->fdt, nodename, "compatible",
+                             corrected_compat, strlen(corrected_compat));
+
+        qemu_fdt_setprop_sized_cells(vbi->fdt, nodename,
+                             "reg", 2, vbi->avail_vfio_base, 2, size);
+
+        irq_attr = g_malloc0(num_irqs*3*sizeof(uint32_t));
+        for (i = 0; i < num_irqs; i++) {
+            sysbus_connect_irq(s, i, vbi->pic[irq_start+i]);
+
+            irq_attr[3*i] = cpu_to_be32(0);
+            irq_attr[3*i+1] = cpu_to_be32(irq_start+i);
+            irq_attr[3*i+2] = cpu_to_be32(0x4);
+        }
+
+        qemu_fdt_setprop(vbi->fdt, nodename, "interrupts",
+                         irq_attr, num_irqs*3*sizeof(uint32_t));
+
+        /* increment base address and IRQ index for next VFIO device */
+        vbi->avail_vfio_base += size;
+        vbi->avail_vfio_irq += num_irqs;
+
+        g_free(pcompat);
+        g_free(pname);
+        g_free(nodename);
+        g_free(corrected_compat);
+        g_free(irq_attr);
+
+        object_unref(OBJECT(dev));
+
+    }
+
+  return 0;
+}
+
+/*
+ * parses the option line and look for -device option
+ * for each of time vfio_init_func is called.
+ * this later only applies to -device vfio-platform ones
+ */
+
+static void create_vfio_devices(VirtBoardInfo *vbi)
+{
+    vbi->avail_vfio_base = vbi->memmap[VIRT_VFIO].base;
+    vbi->avail_vfio_irq =  vbi->irqmap[VIRT_VFIO];
+
+    if (qemu_opts_foreach(qemu_find_opts("device"),
+                        vfio_init_func, (void *)vbi, 1) != 0)
+        exit(1);
 }
 
-static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
+
+static void create_virtio_devices(const VirtBoardInfo *vbi)
 {
     int i;
     hwaddr size = vbi->memmap[VIRT_MMIO].size;
@@ -335,7 +448,7 @@ static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
         int irq = vbi->irqmap[VIRT_MMIO] + i;
         hwaddr base = vbi->memmap[VIRT_MMIO].base + i * size;
 
-        sysbus_create_simple("virtio-mmio", base, pic[irq]);
+        sysbus_create_simple("virtio-mmio", base, vbi->pic[irq]);
     }
 
     for (i = NUM_VIRTIO_TRANSPORTS - 1; i >= 0; i--) {
@@ -366,7 +479,6 @@ static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size)
 
 static void machvirt_init(QEMUMachineInitArgs *args)
 {
-    qemu_irq pic[NUM_IRQS];
     MemoryRegion *sysmem = get_system_memory();
     int n;
     MemoryRegion *ram = g_new(MemoryRegion, 1);
@@ -451,17 +563,19 @@ static void machvirt_init(QEMUMachineInitArgs *args)
     }
 
     for (n = 0; n < NUM_IRQS; n++) {
-        pic[n] = qdev_get_gpio_in(dev, n);
+        vbi->pic[n] = qdev_get_gpio_in(dev, n);
     }
 
-    create_uart(vbi, pic);
-    create_ethernet(vbi, pic);
+    create_uart(vbi);
+
+    /* create vfio platform devices if any are passed in command line*/
+    create_vfio_devices(vbi);
 
     /* Create mmio transports, so the user can create virtio backends
      * (which will be automatically plugged in to the transports). If
      * no backend is created the transport will just sit harmlessly idle.
      */
-    create_virtio_devices(vbi, pic);
+    create_virtio_devices(vbi);
 
     vbi->bootinfo.ram_size = args->ram_size;
     vbi->bootinfo.kernel_filename = args->kernel_filename;
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index f148edd..8f30d41 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -93,6 +93,7 @@ typedef struct VFIODevice {
     int num_irqs;
     int interrupt; /* type of the interrupt, might disappear */
     char *name;
+    char *compat;
     uint32_t mmap_timeout; /* mmap timeout value in ms */
     VFIORegion regions[PLATFORM_NUM_REGIONS];
     QLIST_ENTRY(VFIODevice) next;
@@ -100,6 +101,22 @@ typedef struct VFIODevice {
     QLIST_HEAD(, VFIOINTp) intp_list;
 } VFIODevice;
 
+/*
+ * returns properties from a QEMU VFIO device such as
+ * name, compatibility, num IRQs, size of the register set
+ */
+void vfio_get_props(SysBusDevice *s, char **pname,
+                    char **pcompat, int *pnum_irqs, size_t *psize);
+
+void vfio_get_props(SysBusDevice *s, char **pname,
+                    char **pcompat, int *pnum_irqs, size_t *psize) {
+
+     VFIODevice *vdev = DO_UPCAST(VFIODevice, sbdev, s);
+     *pname = vdev->name;
+     *pcompat = vdev->compat;
+     *pnum_irqs = vdev->num_irqs;
+     *psize = vdev->regions[0].size;
+}
 
 
 static void vfio_unmask_intp(VFIODevice *vdev, int index)
@@ -556,11 +573,6 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
     struct stat st;
     int groupid, i, ret;
 
-
-    /* TODO: pass device name on command line */
-    vdev->name = malloc(PATH_MAX);
-    strcpy(vdev->name, "fff51000.ethernet");
-
     /* Check that the host device exists */
     snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/", vdev->name);
     if (stat(path, &st) < 0) {
@@ -568,6 +580,8 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    DPRINTF("vfio device %s, compat = %s\n", path, vdev->compat);
+
     strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
 
     len = readlink(path, iommu_group_path, PATH_MAX);
@@ -596,10 +610,15 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
     QLIST_FOREACH(pvdev, &group->device_list, next) {
         DPRINTF("compare %s versus %s\n", pvdev->name, vdev->name);
         if (strcmp(pvdev->name, vdev->name) == 0) {
-
+            /*
+             * in current implementation realize is called twice:
+             * 1) once in the virt. machine (where qdev stuff are done +
+             *    device tree generation,
+             * 2) once in vl.c (-device standard handling)
+             * on 2) realize completes here.
+             */
             DPRINTF("vfio device %s already is attached to group %d\n",
                     vdev->name, groupid);
-
             vfio_put_group(group, NULL);
             return;
         }
@@ -625,14 +644,22 @@ static const VMStateDescription vfio_platform_vmstate = {
     .unmigratable = 1,
 };
 
+static Property vfio_platform_dev_properties[] = {
+DEFINE_PROP_STRING("vfio_device", VFIODevice, name),
+DEFINE_PROP_STRING("compat", VFIODevice, compat),
+DEFINE_PROP_UINT32("mmap-timeout-ms", VFIODevice, mmap_timeout, 1100),
+DEFINE_PROP_END_OF_LIST(),
+};
+
 static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
 
     dc->realize = vfio_platform_realize;
     dc->vmsd = &vfio_platform_vmstate;
+    dc->props = vfio_platform_dev_properties;
     dc->desc = "VFIO-based platform device assignment";
-    dc->cannot_instantiate_with_device_add_yet = true;
+    dc->cannot_instantiate_with_device_add_yet = false;
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 }
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [RFC v2 6/6] vfio: add exit function and IRQ disable functions
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
                   ` (4 preceding siblings ...)
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 5/6] virt: Assign a VFIO platform device to a virt VM in QEMU command line Eric Auger
@ 2014-04-09 15:33 ` Eric Auger
  2014-04-11  1:28 ` [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Kim Phillips
  6 siblings, 0 replies; 14+ messages in thread
From: Eric Auger @ 2014-04-09 15:33 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel
  Cc: peter.maydell, kim.phillips, Eric Auger, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

this patch brings misc improvements:
- improve comments
- remove interrupt field in VFIODevice
- add IRQ disable routines used by new exitfn function

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/vfio/platform.c | 167 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 127 insertions(+), 40 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 8f30d41..c4a4286 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -40,7 +40,6 @@
 
 #include "vfio-common.h"
 
-#define DEBUG_VFIO
 #ifdef DEBUG_VFIO
 #define DPRINTF(fmt, ...) \
     do { fprintf(stderr, "vfio: %s: " fmt, __func__, ## __VA_ARGS__); } \
@@ -69,7 +68,9 @@ typedef struct VFIORegion {
 } VFIORegion;
 
 
-#define VFIO_INT_INTp 4
+/*
+ * The IRQ structure inspired from PCI VFIOINTx
+ */
 
 typedef struct VFIOINTp {
     QLIST_ENTRY(VFIOINTp) next;
@@ -91,9 +92,8 @@ typedef struct VFIODevice {
     int fd;
     int num_regions;
     int num_irqs;
-    int interrupt; /* type of the interrupt, might disappear */
     char *name;
-    char *compat;
+    char *compat; /* compatibility string */
     uint32_t mmap_timeout; /* mmap timeout value in ms */
     VFIORegion regions[PLATFORM_NUM_REGIONS];
     QLIST_ENTRY(VFIODevice) next;
@@ -101,9 +101,11 @@ typedef struct VFIODevice {
     QLIST_HEAD(, VFIOINTp) intp_list;
 } VFIODevice;
 
+
 /*
  * returns properties from a QEMU VFIO device such as
  * name, compatibility, num IRQs, size of the register set
+ * currently used by virt machine
  */
 void vfio_get_props(SysBusDevice *s, char **pname,
                     char **pcompat, int *pnum_irqs, size_t *psize);
@@ -118,6 +120,20 @@ void vfio_get_props(SysBusDevice *s, char **pname,
      *psize = vdev->regions[0].size;
 }
 
+static void vfio_disable_irqindex(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = index,
+        .start = 0,
+        .count = 0,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+
 
 static void vfio_unmask_intp(VFIODevice *vdev, int index)
 {
@@ -132,12 +148,17 @@ static void vfio_unmask_intp(VFIODevice *vdev, int index)
     ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
 }
 
-
+/*
+ * Checks whether the IRQ was EOI'ed. In the positive the fast path
+ * is restored (reg space is mmaped). In the negative the reg space
+ * stays as an MMIO region (ops) and the mmap timer is reprogrammed
+ * to check the same condition after mmap_timeout ms
+ */
 
 
 static void vfio_intp_mmap_enable(void *opaque)
 {
-    VFIOINTp * intp = (VFIOINTp *)opaque;
+    VFIOINTp *intp = (VFIOINTp *)opaque;
     VFIODevice *vdev = intp->vdev;
 
     if (intp->pending) {
@@ -153,6 +174,10 @@ static void vfio_intp_mmap_enable(void *opaque)
 }
 
 
+/*
+ * The fd handler
+ */
+
 
 static void vfio_intp_interrupt(void *opaque)
 {
@@ -176,11 +201,14 @@ static void vfio_intp_interrupt(void *opaque)
      */
 
     VFIORegion *region = &vdev->regions[0];
+
+    /* register space is unmapped to trap EOI */
     memory_region_set_enabled(&region->mmap_mem, false);
 
+    /* trigger the virtual IRQ */
     qemu_set_irq(intp->qemuirq, 1);
 
-    /* schedule the mmap timer which will restote mmap path after EOI*/
+    /* schedule the mmap timer which will restore mmap path after EOI*/
     if (intp->mmap_timeout) {
         timer_mod(intp->mmap_timer,
                   qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + intp->mmap_timeout);
@@ -200,13 +228,18 @@ static void vfio_irq_eoi(VFIODevice *vdev)
         if (intp->pending) {
             if (eoi_done) {
                 DPRINTF("several IRQ pending simultaneously: \
-                         this is not a supported case yet\n");
+                         this case is not tested yet\n");
             }
+
             DPRINTF("EOI IRQ #%d fd=%d\n",
                     intp->pin, event_notifier_get_fd(&intp->interrupt));
+
             intp->pending = false;
+
+            /* deassert the virtual IRQ and unmask physical one */
             qemu_set_irq(intp->qemuirq, 0);
             vfio_unmask_intp(vdev, intp->pin);
+
             eoi_done = true;
         }
     }
@@ -217,22 +250,6 @@ static void vfio_irq_eoi(VFIODevice *vdev)
 
 
 
-#if 0
-static void vfio_list_intp(VFIODevice *vdev)
-{
-    VFIOINTp *intp;
-    int i = 0;
-    QLIST_FOREACH(intp, &vdev->intp_list, next) {
-        DPRINTF("IRQ #%d\n", i);
-        DPRINTF("- pin = %d\n", intp->pin);
-        DPRINTF("- fd = %d\n", event_notifier_get_fd(&intp->interrupt));
-        DPRINTF("- pending = %d\n", (int)intp->pending);
-        DPRINTF("- kvm_accel = %d\n", (int)intp->kvm_accel);
-        i++;
-    }
-}
-#endif
-
 static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
 {
     struct vfio_irq_set *irq_set; /* irq structure passed to vfio kernel */
@@ -242,8 +259,6 @@ static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
     int device = vdev->fd;
     SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
 
-    vdev->interrupt = VFIO_INT_INTp;
-
     /* allocate and populate a new VFIOINTp structure put in a queue list */
     VFIOINTp *intp = g_malloc0(sizeof(*intp));
     intp->vdev = vdev;
@@ -251,17 +266,12 @@ static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
     intp->pending = false;
     intp->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                     vfio_intp_mmap_enable, intp);
-    intp->mmap_timeout = 1100;
-    /* TO DO: timeout as parameter */
 
-    /* incr sysbus num_irq and sets sysbus->irqp[n] = &intp->qemuirq
-     * only the address of the qemu_irq is set here
-     */
+    /* TO DO: currently each IRQ has the same mmap timeout */
+    intp->mmap_timeout = intp->vdev->mmap_timeout;
 
     sysbus_init_irq(sbdev, &intp->qemuirq);
 
-    /* content is set in sysbus_connect_irq (currently in machine definition) */
-
     ret = event_notifier_init(&intp->interrupt, 0);
     if (ret) {
         error_report("vfio: Error: event_notifier_init failed ");
@@ -302,7 +312,7 @@ static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
     g_free(irq_set);
 
     if (ret) {
-        error_report("vfio: Error: Failed to pass Int fd to the driver: %m");
+        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
         qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
         close(*pfd); /* TO DO : replace by event_notifier_cleanup */
         return -errno;
@@ -317,6 +327,30 @@ static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
 }
 
 
+static void vfio_disable_intp(VFIODevice *vdev)
+{
+    VFIOINTp *intp;
+    int fd;
+
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        fd = event_notifier_get_fd(&intp->interrupt);
+        DPRINTF("close IRQ pin=%d fd=%d\n", intp->pin, fd);
+
+        vfio_disable_irqindex(vdev, intp->pin);
+        intp->pending = false;
+        qemu_set_irq(intp->qemuirq, 0);
+
+        qemu_set_fd_handler(fd, NULL, NULL, vdev);
+        event_notifier_cleanup(&intp->interrupt);
+    }
+
+    VFIORegion *region = &vdev->regions[0];
+    memory_region_set_enabled(&region->mmap_mem, true);
+
+}
+
+
+
 
 
 static int vfio_mmap_region(VFIODevice *vdev, VFIORegion *region,
@@ -386,11 +420,11 @@ static void vfio_region_write(void *opaque, hwaddr addr,
     }
 
     if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
+        error_report("(0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
                      addr, data, size);
     }
 
-    DPRINTF("(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
+    DPRINTF("(region %d, addr=0x%"HWADDR_PRIx", data= 0x%"PRIx64", %d)\n",
             region->nr, addr, data, size);
 
     vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
@@ -409,7 +443,7 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
     uint64_t data = 0;
 
     if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("(,0x%"HWADDR_PRIx", %d) failed: %m",
+        error_report("(0x%"HWADDR_PRIx", %d) failed: %m",
                      addr, size);
         return (uint64_t)-1;
     }
@@ -429,7 +463,7 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
         break;
     }
 
-    DPRINTF("(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
+    DPRINTF("(region %d, addr= 0x%"HWADDR_PRIx", data=%d) = 0x%"PRIx64"\n",
             region->nr, addr, size, data);
 
     vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
@@ -464,6 +498,17 @@ static void vfio_map_region(VFIODevice *vdev, int nr)
     }
 }
 
+
+static void vfio_unmap_region(VFIODevice *vdev, int nr)
+{
+VFIORegion *region = &vdev->regions[nr];
+memory_region_del_subregion(&region->mem, &region->mmap_mem);
+munmap(region->mmap, memory_region_size(&region->mmap_mem));
+memory_region_destroy(&region->mmap_mem);
+}
+
+
+
 static int vfio_get_device(VFIOGroup *group, const char *name,
                            struct VFIODevice *vdev)
 {
@@ -624,8 +669,6 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    DPRINTF("Calling vfio_get_device ...\n");
-
     ret = vfio_get_device(group, path, vdev);
     if (ret) {
         error_report("vfio: failed to get device %s", path);
@@ -639,11 +682,50 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
     }
 }
 
+static void vfio_platform_reset_handler(void *opaque)
+{
+}
+
+
+static void vfio_put_device(VFIODevice *vdev)
+{
+    QLIST_REMOVE(vdev, next);
+    vdev->group = NULL;
+    DPRINTF("vfio_put_device: close vdev->fd\n");
+    close(vdev->fd);
+}
+
+static void vfio_platform_exitfn(VFIODevice *dev)
+{
+    VFIOGroup *group = dev->group;
+    VFIOINTp *intp, *next_intp;
+
+    vfio_disable_intp(dev);
+
+    QLIST_FOREACH_SAFE(intp, &dev->intp_list, next, next_intp) {
+        QLIST_REMOVE(intp, next);
+        if (intp->mmap_timer) {
+            timer_free(intp->mmap_timer);
+        }
+        g_free(intp);
+    }
+
+    vfio_unmap_region(dev, 0);
+    vfio_put_device(dev);
+    vfio_put_group(group, vfio_platform_reset_handler);
+}
+
+
 static const VMStateDescription vfio_platform_vmstate = {
     .name = TYPE_VFIO_PLATFORM,
     .unmigratable = 1,
 };
 
+typedef struct VFIOPlatformDeviceClass {
+    DeviceClass parent_class;
+    void (*exit)(VFIODevice *);
+} VFIOPlatformDeviceClass;
+
 static Property vfio_platform_dev_properties[] = {
 DEFINE_PROP_STRING("vfio_device", VFIODevice, name),
 DEFINE_PROP_STRING("compat", VFIODevice, compat),
@@ -651,15 +733,20 @@ DEFINE_PROP_UINT32("mmap-timeout-ms", VFIODevice, mmap_timeout, 1100),
 DEFINE_PROP_END_OF_LIST(),
 };
 
+#define VFIO_PLATFORM_DEVICE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(VFIOPlatformDeviceClass, (klass), TYPE_VFIO_PLATFORM)
+
 static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
+    VFIOPlatformDeviceClass *vc = VFIO_PLATFORM_DEVICE_CLASS(klass);
 
     dc->realize = vfio_platform_realize;
     dc->vmsd = &vfio_platform_vmstate;
     dc->props = vfio_platform_dev_properties;
     dc->desc = "VFIO-based platform device assignment";
     dc->cannot_instantiate_with_device_add_yet = false;
+    vc->exit = vfio_platform_exitfn;
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 }
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device Eric Auger
@ 2014-04-10 13:26   ` Peter Crosthwaite
  2014-04-10 13:48     ` Alexander Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Crosthwaite @ 2014-04-10 13:26 UTC (permalink / raw)
  To: Eric Auger, Alistair Francis
  Cc: Peter Maydell, Kim Phillips, eric.auger, Kim Phillips,
	qemu-devel@nongnu.org Developers, Alexander Graf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm,
	Christoffer Dall

On Thu, Apr 10, 2014 at 1:33 AM, Eric Auger <eric.auger@linaro.org> wrote:
> From: Kim Phillips <kim.phillips@linaro.org>
>
> This is a hack and only serves as an example of what needs to be
> done to make the next RFC - add vfio-platform support - work
> for development purposes on a Calxeda Midway system.  We don't want
> mach-virt to always create this ethernet device - DO NOT APPLY, etc.
>
> Initial attempts to convince QEMU to create a memory mapped device
> on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
> would fail with "Parameter 'driver' expects pluggable device type".

Alistair is working on this. cc.

Regards,
Peter

> Any guidance as to how to overcome this apparent design limitation
> is welcome.
>
> RAM is reduced from 30 to 1GiB such as to not overlap the xgmac device's
> physical address.  Not sure if the 30GiB RAM (or whatever the user sets
> it to with -m) could be set up above 0x1_0000_0000, but there is probably
> extra work needed to resolve this type of conflict.
>
> note: vfio-platform interrupt support development may want interrupt
> property data filled; here it's omitted for the time being.
>
> Not-signed-off-by: Kim Phillips <kim.phillips@linaro.org>
> ---
>  hw/arm/virt.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 2bbc931..5d43cf0 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -65,6 +65,7 @@ enum {
>      VIRT_GIC_CPU,
>      VIRT_UART,
>      VIRT_MMIO,
> +    VIRT_ETHERNET,
>  };
>
>  typedef struct MemMapEntry {
> @@ -106,7 +107,8 @@ static const MemMapEntry a15memmap[] = {
>      [VIRT_MMIO] = { 0xa000000, 0x200 },
>      /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
>      /* 0x10000000 .. 0x40000000 reserved for PCI */
> -    [VIRT_MEM] = { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
> +    [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
> +    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
>  };
>
>  static const int a15irqmap[] = {
> @@ -291,6 +293,25 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
>      g_free(nodename);
>  }
>
> +static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
> +{
> +    char *nodename;
> +    hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
> +    hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
> +    const char compat[] = "calxeda,hb-xgmac";
> +
> +    sysbus_create_simple("vfio-platform", base, NULL);
> +
> +    nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
> +    qemu_fdt_add_subnode(vbi->fdt, nodename);
> +
> +    /* Note that we can't use setprop_string because of the embedded NUL */
> +    qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
> +    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
> +
> +    g_free(nodename);
> +}
> +
>  static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
>  {
>      int i;
> @@ -425,6 +446,7 @@ static void machvirt_init(QEMUMachineInitArgs *args)
>      }
>
>      create_uart(vbi, pic);
> +    create_ethernet(vbi, pic);
>
>      /* Create mmio transports, so the user can create virtio backends
>       * (which will be automatically plugged in to the transports). If
> --
> 1.8.3.2
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device
  2014-04-10 13:26   ` Peter Crosthwaite
@ 2014-04-10 13:48     ` Alexander Graf
  2014-04-11  5:41       ` Alistair Francis
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Graf @ 2014-04-10 13:48 UTC (permalink / raw)
  To: Peter Crosthwaite, Eric Auger, Alistair Francis
  Cc: Peter Maydell, Kim Phillips, eric.auger, Kim Phillips,
	qemu-devel@nongnu.org Developers, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm, Christoffer Dall


On 10.04.14 15:26, Peter Crosthwaite wrote:
> On Thu, Apr 10, 2014 at 1:33 AM, Eric Auger <eric.auger@linaro.org> wrote:
>> From: Kim Phillips <kim.phillips@linaro.org>
>>
>> This is a hack and only serves as an example of what needs to be
>> done to make the next RFC - add vfio-platform support - work
>> for development purposes on a Calxeda Midway system.  We don't want
>> mach-virt to always create this ethernet device - DO NOT APPLY, etc.
>>
>> Initial attempts to convince QEMU to create a memory mapped device
>> on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
>> would fail with "Parameter 'driver' expects pluggable device type".
> Alistair is working on this. cc.

Alaistair, I've had patches tackle this on the mailing list a few months 
ago and received good comments from Anthony on what to change. How far 
in are you already? I'd like to make sure we're on the same page here 
(and don't duplicate work).


Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough
  2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
                   ` (5 preceding siblings ...)
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 6/6] vfio: add exit function and IRQ disable functions Eric Auger
@ 2014-04-11  1:28 ` Kim Phillips
  6 siblings, 0 replies; 14+ messages in thread
From: Kim Phillips @ 2014-04-11  1:28 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, eric.auger, qemu-devel, agraf, alex.williamson,
	christophe.barnichon, stuart.yoder, a.motakis, kvmarm,
	christoffer.dall

On Wed, 9 Apr 2014 16:33:03 +0100
Eric Auger <eric.auger@linaro.org> wrote:

> some patches need to be applied on host linux to correct issues
> not yet upstreamed. They can be found on vfio-dev branch of
> git://git.linaro.org/people/eric.auger/linux.git
> 
> - commit 997691b: enables unmapping stage2 entries on memory region deletion
> - commit 731e308: corrects an in read/write function of vfio platform driver
> - iommu/arm-smmu commit serie and especially bf5a852 fixing an issue with
>   Calxeda Midway sMMU.

also these:

"ARM: KVM: Enable the KVM-VFIO device":
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-March/008629.html

"ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping":
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-March/008630.html

Kim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device
  2014-04-09 15:33 ` [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device Eric Auger
@ 2014-04-11  1:34   ` Kim Phillips
  0 siblings, 0 replies; 14+ messages in thread
From: Kim Phillips @ 2014-04-11  1:34 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, eric.auger, qemu-devel, agraf, alex.williamson,
	christophe.barnichon, stuart.yoder, a.motakis, kvmarm,
	christoffer.dall

On Wed, 9 Apr 2014 16:33:07 +0100
Eric Auger <eric.auger@linaro.org> wrote:

> This work is inspired of PCI INTx. The code was prepared to support
> multiple IRQs but this was not tested at that stage. Similarly to what
> is done on PCI, the device register space is RAM unmapped on IRQ hit
> in order to trap the end of interrupt (EOI). On mmap timer hit, if the
> EOI was trapped, the mmapping is restored. the physical interrupt is
> unmasked on the first read/write with the MMIO register space.
> 
> Tested on Calxeda Midway xgmac which can be directly assigned to one
> virt VM.
> 
> Acknowledgements:
> - vfio device name, guest physical address and IRQ indexes are
>   hardcoded. This will be improved in another patch
> - currently tested on a single IRQ (xgmac main IRQ)
> - a KVM patch is required to invalidate stage2 entries on RAM memory
>   region destruction (see [PATCH] ARM: KVM: Handle IPA unmapping on
>   memory region deletion)
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> ---

thanks for this, Eric.

>  hw/arm/virt.c      |  13 ++-
>  hw/vfio/platform.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 304 insertions(+), 25 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 5d43cf0..31ae7d2 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -108,12 +108,13 @@ static const MemMapEntry a15memmap[] = {
>      /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
>      /* 0x10000000 .. 0x40000000 reserved for PCI */
>      [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
> -    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
> +    [VIRT_ETHERNET] = { 0xfff41000, 0x1000 },

this change isn't explained (the device is at physical 0xfff51000,
not 0xfff41000), and looks like it belongs in the first patch of the
series anyway.   Note: see e.g., commit f5fdcd6e5 "hw/arm: Add
'virt' platform" for an example of how to re-compose commit message
text for patches that undergo a change of author.

>  };
>  
>  static const int a15irqmap[] = {
>      [VIRT_UART] = 1,
>      [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
> +    [VIRT_ETHERNET] = 77,
>  };
>  
>  static VirtBoardInfo machines[] = {
> @@ -299,8 +300,12 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
>      hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
>      hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
>      const char compat[] = "calxeda,hb-xgmac";
> +    int irqm = vbi->irqmap[VIRT_ETHERNET];
> +    int irqp = irqm+1;
> +    int irqlp = irqm+2;
>  
> -    sysbus_create_simple("vfio-platform", base, NULL);
> +    sysbus_create_varargs("vfio-platform", base,
> +                          pic[irqm], pic[irqp], pic[irqlp], NULL);
>  
>      nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
>      qemu_fdt_add_subnode(vbi->fdt, nodename);
> @@ -308,6 +313,10 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
>      /* Note that we can't use setprop_string because of the embedded NUL */
>      qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
>      qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
> +    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
> +                                0x0, irqm, 0x4,
> +                                0x0, irqp, 0x4,
> +                                0x0, irqlp, 0x4);

fwiw, it would have been nice to reuse the irq variable names used
in hw/net/xgmac.c, but whatever...

>      g_free(nodename);
>  }
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> index 138fb13..f148edd 100644
> --- a/hw/vfio/platform.c
> +++ b/hw/vfio/platform.c
> @@ -24,6 +24,7 @@
>  #include "config.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +
>  #include "qemu-common.h"
>  #include "qemu/error-report.h"
>  #include "qemu/event_notifier.h"
> @@ -31,6 +32,7 @@
>  #include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/sysemu.h"
> +
>  #include "hw/qdev-properties.h"
>  #include "migration/vmstate.h"
>  #include "hw/hw.h"
> @@ -41,12 +43,15 @@
>  #define DEBUG_VFIO
>  #ifdef DEBUG_VFIO
>  #define DPRINTF(fmt, ...) \
> -    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> +    do { fprintf(stderr, "vfio: %s: " fmt, __func__, ## __VA_ARGS__); } \
> +        while (0)

these changes are unrelated to the subject of this "add IRQ" patch -
either make them their own patch (in between the last one and this
one), or merge them with the prior patch (that introduces the code),
but since their nature starts to drift from the corresponding PCI
code counterpart, I can't help think why they shouldn't be applied
there, too.  Probably best dropping them for now - these and
cleaning up more of the superfluous DPRINTFs added in this series.

>  #else
>  #define DPRINTF(fmt, ...) \
>      do { } while (0)
>  #endif
>  
> +#define PLATFORM_NUM_REGIONS 10
> +

this is a regression from the third patch in this series - see below.

>  /* Extra debugging, trap acceleration paths for more logging */
>  #define VFIO_ALLOW_MMAP 1
>  
> @@ -63,16 +68,240 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +
> +#define VFIO_INT_INTp 4

I've just cloned your qemu tree, vfio-dev-integ branch, and did a
grep for where this is used, and got no occurrences at all, so this
must be being removed in a later patch in the series...which makes
a proper review, well, difficult.

> +typedef struct VFIOINTp {
> +    QLIST_ENTRY(VFIOINTp) next;
> +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> +    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
> +    qemu_irq qemuirq;
> +    struct VFIODevice *vdev; /* back pointer to device */
> +    bool pending; /* interrupt pending */
> +    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
> +    uint8_t pin; /* index */
> +    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
> +    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
> +} VFIOINTp;
> +

please look at merging this with hw/vfio/pci.c's VFIOINTx definition
and putting the result in hw/vfio/vfio-common.h.

> +
>  typedef struct VFIODevice {
>      SysBusDevice sbdev;
>      int fd;
>      int num_regions;
> -    VFIORegion *regions;
> +    int num_irqs;
> +    int interrupt; /* type of the interrupt, might disappear */

sigh, and it does disappear, presumably in a later patch :(

> +    char *name;
> +    uint32_t mmap_timeout; /* mmap timeout value in ms */
> +    VFIORegion regions[PLATFORM_NUM_REGIONS];

this is a regression over the last version of the third patch I sent
you; 'regions' should remain dynamically allocated - here, they're
being fixed.

>      QLIST_ENTRY(VFIODevice) next;
>      struct VFIOGroup *group;
> -    char *name;
> +    QLIST_HEAD(, VFIOINTp) intp_list;
>  } VFIODevice;
>  
> +
> +
> +static void vfio_unmask_intp(VFIODevice *vdev, int index)
> +{
> +    struct vfio_irq_set irq_set = {
> +        .argsz = sizeof(irq_set),
> +        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
> +        .index = index,
> +        .start = 0,
> +        .count = 1,
> +    };
> +
> +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +}
> +
> +
> +
> +

extra whitespace

> +static void vfio_intp_mmap_enable(void *opaque)
> +{
> +    VFIOINTp * intp = (VFIOINTp *)opaque;
> +    VFIODevice *vdev = intp->vdev;
> +
> +    if (intp->pending) {
> +        DPRINTF("IRQ still pending, re-schedule the mmap timer\n");
> +        timer_mod(intp->mmap_timer,
> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + intp->mmap_timeout);
> +        return;
> +    }
> +
> +    DPRINTF("IRQ EOI'ed sets mmap again\n");
> +    VFIORegion *region = &vdev->regions[0];
> +    memory_region_set_enabled(&region->mmap_mem, true);
> +}
> +
> +
> +
> +static void vfio_intp_interrupt(void *opaque)
> +{
> +    int ret;
> +    VFIOINTp *intp = (VFIOINTp *)opaque;
> +    VFIODevice *vdev = intp->vdev;
> +
> +    DPRINTF("pin = %d fd = %d\n",
> +            intp->pin, event_notifier_get_fd(&intp->interrupt));
> +
> +    ret = event_notifier_test_and_clear(&intp->interrupt);
> +    if (!ret) {
> +        DPRINTF("Error when clearing fd=%d\n",
> +                event_notifier_get_fd(&intp->interrupt));
> +    }
> +
> +    intp->pending = true;
> +
> +    /* TODO: fix this number of regions,
> +     * currently a single region is handled
> +     */

please do :)

> +    VFIORegion *region = &vdev->regions[0];
> +    memory_region_set_enabled(&region->mmap_mem, false);
> +
> +    qemu_set_irq(intp->qemuirq, 1);
> +
> +    /* schedule the mmap timer which will restote mmap path after EOI*/

restore

> +    if (intp->mmap_timeout) {
> +        timer_mod(intp->mmap_timer,
> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + intp->mmap_timeout);
> +    }
> +
> +}
> +
> +
> +
> +static void vfio_irq_eoi(VFIODevice *vdev)
> +{
> +
> +    VFIOINTp *intp;
> +    bool eoi_done = false;
> +
> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
> +        if (intp->pending) {
> +            if (eoi_done) {
> +                DPRINTF("several IRQ pending simultaneously: \
> +                         this is not a supported case yet\n");
> +            }

If the thing to do in this case is not remap MMIO to the 'fast
path', then we should do that.  Otherwise, I think I'd protect this
with DEBUG in the meantime..

> +            DPRINTF("EOI IRQ #%d fd=%d\n",
> +                    intp->pin, event_notifier_get_fd(&intp->interrupt));
> +            intp->pending = false;
> +            qemu_set_irq(intp->qemuirq, 0);
> +            vfio_unmask_intp(vdev, intp->pin);
> +            eoi_done = true;
> +        }
> +    }
> +
> +    return;
> +
> +}
> +
> +
> +
> +#if 0

please, no dead code.

> +static void vfio_list_intp(VFIODevice *vdev)
> +{
> +    VFIOINTp *intp;
> +    int i = 0;
> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
> +        DPRINTF("IRQ #%d\n", i);
> +        DPRINTF("- pin = %d\n", intp->pin);
> +        DPRINTF("- fd = %d\n", event_notifier_get_fd(&intp->interrupt));
> +        DPRINTF("- pending = %d\n", (int)intp->pending);
> +        DPRINTF("- kvm_accel = %d\n", (int)intp->kvm_accel);
> +        i++;
> +    }
> +}
> +#endif
> +
> +static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
> +{
> +    struct vfio_irq_set *irq_set; /* irq structure passed to vfio kernel */
> +    int32_t *pfd; /* pointer to the eventfd */
> +    int ret, argsz;
> +
> +    int device = vdev->fd;
> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
> +
> +    vdev->interrupt = VFIO_INT_INTp;
> +
> +    /* allocate and populate a new VFIOINTp structure put in a queue list */
> +    VFIOINTp *intp = g_malloc0(sizeof(*intp));

mixed declarations - here and elsewhere - have you read the
qemu/CODING_STYLE file?

> +    intp->vdev = vdev;
> +    intp->pin = index;
> +    intp->pending = false;
> +    intp->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> +                                    vfio_intp_mmap_enable, intp);
> +    intp->mmap_timeout = 1100;
> +    /* TO DO: timeout as parameter */

please do :)

> +    /* incr sysbus num_irq and sets sysbus->irqp[n] = &intp->qemuirq
> +     * only the address of the qemu_irq is set here
> +     */

/*
 * multi-line comments are
 * formatted like this
 */


> +    sysbus_init_irq(sbdev, &intp->qemuirq);
> +
> +    /* content is set in sysbus_connect_irq (currently in machine definition) */
> +
> +    ret = event_notifier_init(&intp->interrupt, 0);
> +    if (ret) {
> +        error_report("vfio: Error: event_notifier_init failed ");
> +        return ret;
> +    }
> +
> +    /* build the irq_set to be passed to the vfio kernel driver */
> +
> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> +
> +    irq_set = g_malloc0(argsz);
> +
> +    if (!irq_set) {
> +        DPRINTF("failure while allocating memory for irq\n");
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                 "VFIO platform: failure while allocating memory for irq");
> +        return -errno;
> +    }
> +
> +    irq_set->argsz = argsz;
> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
> +    irq_set->index = index;
> +    irq_set->start = 0;
> +    irq_set->count = 1;
> +    pfd = (int32_t *)&irq_set->data;
> +
> +    *pfd = event_notifier_get_fd(&intp->interrupt);
> +
> +    DPRINTF("register fd=%d/irq index=%d to kernel\n", *pfd, index);
> +
> +    qemu_set_fd_handler(*pfd, vfio_intp_interrupt, NULL, intp);
> +
> +    /* pass the index/fd binding to the kernel driver so that it
> +     * triggers this fd on HW IRQ
> +     */
> +    ret = ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
> +
> +    g_free(irq_set);
> +
> +    if (ret) {
> +        error_report("vfio: Error: Failed to pass Int fd to the driver: %m");
> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
> +        close(*pfd); /* TO DO : replace by event_notifier_cleanup */
> +        return -errno;
> +    }
> +
> +    /* store the new intp in qlist */
> +
> +    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
> +
> +    return 0;
> +
> +}
> +
> +
> +
> +
>  static int vfio_mmap_region(VFIODevice *vdev, VFIORegion *region,
>                           MemoryRegion *mem, MemoryRegion *submem,
>                           void **map, size_t size, off_t offset,
> @@ -112,6 +341,7 @@ error:
>  /*
>   * IO Port/MMIO - Beware of the endians, VFIO is always little endian
>   */
> +
>  static void vfio_region_write(void *opaque, hwaddr addr,
>                                uint64_t data, unsigned size)
>  {
> @@ -139,12 +369,15 @@ static void vfio_region_write(void *opaque, hwaddr addr,
>      }
>  
>      if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
> -        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
> -                     __func__, addr, data, size);
> +        error_report("(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
> +                     addr, data, size);
>      }
>  
> -    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
> -             __func__, region->nr, addr, data, size);
> +    DPRINTF("(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
> +            region->nr, addr, data, size);
> +
> +    vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
> +
>  }
>  
>  static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
> @@ -159,8 +392,8 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
>      uint64_t data = 0;
>  
>      if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
> -        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
> -                     __func__, addr, size);
> +        error_report("(,0x%"HWADDR_PRIx", %d) failed: %m",
> +                     addr, size);



>          return (uint64_t)-1;
>      }
>  
> @@ -179,18 +412,22 @@ static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
>          break;
>      }
>  
> -    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
> -            __func__, region->nr, addr, size, data);
> +    DPRINTF("(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
> +            region->nr, addr, size, data);
> +
> +    vfio_irq_eoi(container_of(region, VFIODevice, regions[region->nr]));
>  
>      return data;
>  }
>  
> +
>  static const MemoryRegionOps vfio_region_ops = {
>      .read = vfio_region_read,
>      .write = vfio_region_write,
>      .endianness = DEVICE_NATIVE_ENDIAN,
>  };
>  
> +
>  static void vfio_map_region(VFIODevice *vdev, int nr)
>  {
>      VFIORegion *region = &vdev->regions[nr];

unrelated whitespace changes

> @@ -242,14 +479,6 @@ static int vfio_get_device(VFIOGroup *group, const char *name,
>      DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
>              dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
>  
> -    vdev->regions = g_malloc0(sizeof(VFIORegion) * dev_info.num_regions);
> -    if (!vdev->regions) {
> -            error_report("vfio: Error allocating space for %d regions",
> -                         dev_info.num_regions);
> -            ret = -ENOMEM;
> -            goto error;
> -    }
> -
>      vdev->num_regions = dev_info.num_regions;
>  
>      for (i = 0; i < dev_info.num_regions; i++) {
> @@ -273,6 +502,38 @@ static int vfio_get_device(VFIOGroup *group, const char *name,
>          vdev->regions[i].nr = i;
>      }
>  
> +   /* IRQ */
> +
> +    DPRINTF("Num IRQS: %d\n", dev_info.num_irqs);
> +
> +    vdev->num_irqs = dev_info.num_irqs;
> +
> +    for (i = 0; i < dev_info.num_irqs; i++) {
> +
> +        struct vfio_irq_info irq = { .argsz = sizeof(irq) };
> +
> +        irq.index = i;
> +
> +        DPRINTF("Retrieve IRQ info from vfio platform driver ...\n");
> +
> +        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
> +
> +        if (ret) {

why is everything in doublespace here?  group statements by
relevance, e.g., there needs not be a blank line in between a "ret
=" and its successive "if (ret)...".

> +            error_report("vfio: error getting device %s irq info",
> +                         name);
> +            error_printf("vfio: error getting device %s irq info",
> +                         name);

pick one :)

> +        }
> +
> +        DPRINTF("- IRQ index %d: count %d, flags=0x%x\n",
> +                                irq.index,
> +                                irq.count,
> +                                irq.flags);
> +
> +        vfio_enable_intp(vdev, irq.index);
> +
> +    }
> +

that DPRINTF needn't occupy four lines.  Please try and restrict
PRINTFs to approximately the same level as how hw/vfio/pci.c does.
Try to stay close to its coding style, too.

I'm stopping my review here - please clean up and resubmit.

Thanks,

Kim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device
  2014-04-10 13:48     ` Alexander Graf
@ 2014-04-11  5:41       ` Alistair Francis
  2014-04-11  9:21         ` Alexander Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Alistair Francis @ 2014-04-11  5:41 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Peter Crosthwaite, Kim Phillips, eric.auger,
	Eric Auger, Kim Phillips, qemu-devel@nongnu.org Developers,
	Alistair Francis, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm, Christoffer Dall

On Thu, Apr 10, 2014 at 11:48 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 10.04.14 15:26, Peter Crosthwaite wrote:
>>
>> On Thu, Apr 10, 2014 at 1:33 AM, Eric Auger <eric.auger@linaro.org> wrote:
>>>
>>> From: Kim Phillips <kim.phillips@linaro.org>
>>>
>>> This is a hack and only serves as an example of what needs to be
>>> done to make the next RFC - add vfio-platform support - work
>>> for development purposes on a Calxeda Midway system.  We don't want
>>> mach-virt to always create this ethernet device - DO NOT APPLY, etc.
>>>
>>> Initial attempts to convince QEMU to create a memory mapped device
>>> on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
>>> would fail with "Parameter 'driver' expects pluggable device type".
>>
>> Alistair is working on this. cc.
>
>
> Alaistair, I've had patches tackle this on the mailing list a few months ago
> and received good comments from Anthony on what to change. How far in are
> you already? I'd like to make sure we're on the same page here (and don't
> duplicate work).
>
>
> Alex
>
>

Hey Alex,

I have a patch I'm about to send to the mailing list. It allows an
entire machine to be added by the -device argument on the command
line. It is a similar implementation to my first version (vl.c: Allow
sysbus devices to be attached via commandline), but much more generic.

Could you point me to your patches so I can compare? At the moment I
don't have much feedback to my implementation

Thanks,

Alistair

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device
  2014-04-11  5:41       ` Alistair Francis
@ 2014-04-11  9:21         ` Alexander Graf
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Graf @ 2014-04-11  9:21 UTC (permalink / raw)
  To: Alistair Francis
  Cc: Peter Maydell, Peter Crosthwaite, Kim Phillips, eric.auger,
	Eric Auger, Kim Phillips, qemu-devel@nongnu.org Developers,
	stuart.yoder, alex.williamson, christophe.barnichon, a.motakis,
	kvmarm, Christoffer Dall


On 11.04.14 07:41, Alistair Francis wrote:
> On Thu, Apr 10, 2014 at 11:48 PM, Alexander Graf <agraf@suse.de> wrote:
>> On 10.04.14 15:26, Peter Crosthwaite wrote:
>>> On Thu, Apr 10, 2014 at 1:33 AM, Eric Auger <eric.auger@linaro.org> wrote:
>>>> From: Kim Phillips <kim.phillips@linaro.org>
>>>>
>>>> This is a hack and only serves as an example of what needs to be
>>>> done to make the next RFC - add vfio-platform support - work
>>>> for development purposes on a Calxeda Midway system.  We don't want
>>>> mach-virt to always create this ethernet device - DO NOT APPLY, etc.
>>>>
>>>> Initial attempts to convince QEMU to create a memory mapped device
>>>> on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
>>>> would fail with "Parameter 'driver' expects pluggable device type".
>>> Alistair is working on this. cc.
>>
>> Alaistair, I've had patches tackle this on the mailing list a few months ago
>> and received good comments from Anthony on what to change. How far in are
>> you already? I'd like to make sure we're on the same page here (and don't
>> duplicate work).
>>
>>
>> Alex
>>
>>
> Hey Alex,
>
> I have a patch I'm about to send to the mailing list. It allows an
> entire machine to be added by the -device argument on the command
> line. It is a similar implementation to my first version (vl.c: Allow
> sysbus devices to be attached via commandline), but much more generic.
>
> Could you point me to your patches so I can compare? At the moment I
> don't have much feedback to my implementation

Sure. It's right here:

https://lists.gnu.org/archive/html/qemu-devel/2013-07/msg03614.html

But as you can see there was quite a bit of discussion on that thread on 
how to do it right.


Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/6] vfio: add vfio-platform support
       [not found]   ` <1397832861.3060.93.camel@ul30vt.home>
@ 2014-05-20  6:44     ` Eric Auger
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Auger @ 2014-05-20  6:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: peter.maydell, kim.phillips, eric.auger, Kim Phillips,
	qemu-devel, agraf, stuart.yoder, christophe.barnichon, a.motakis,
	kvmarm, christoffer.dall

On 04/18/2014 04:54 PM, Alex Williamson wrote:
> On Wed, 2014-04-09 at 16:33 +0100, Eric Auger wrote:
>> From: Kim Phillips <kim.phillips@linaro.org>
>>
>> Functions for which PCI and platform device support share are moved
>> into common.c.  The common vfio_{get,put}_group() get an additional
>> argument, a pointer to a vfio_reset_handler(), for which to pass on to
>> qemu_register_reset, but only if it exists (the platform device code
>> currently passes a NULL as its reset_handler).
>>
>> For the platform device code, we basically use SysBusDevice
>> instead of PCIDevice.  Since realize() returns void, unlike
>> PCIDevice's initfn, error codes are moved into the
>> error message text with %m.
>>
>> Currently only MMIO access is supported at this time.
>>
>> The perceived path for future QEMU development is:
>>
>> - add support for interrupts
>> - verify and test platform dev unmap path
>> - test existing PCI path for any regressions
>> - add support for creating platform devices on the qemu command line
>>   - currently device address specification is hardcoded for test
>>     development on Calxeda Midway's fff51000.ethernet device
>> - reset is not supported and registration of reset functions is
>>   bypassed for platform devices.
>>   - there is no standard means of resetting a platform device,
>>     unsure if it suffices to be handled at device--VFIO binding time
>>
>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>
>> [1] http://www.spinics.net/lists/kvm-arm/msg08195.html
>> ---
>>  hw/vfio/Makefile.objs |   2 +
>>  hw/vfio/common.c      | 486 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/pci.c         | 480 ++-----------------------------------------------
>>  hw/vfio/platform.c    | 381 +++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/vfio-common.h |  55 ++++++
>>  5 files changed, 937 insertions(+), 467 deletions(-)
>>  create mode 100644 hw/vfio/common.c
>>  create mode 100644 hw/vfio/platform.c
>>  create mode 100644 hw/vfio/vfio-common.h
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index 31c7dab..c5c76fe 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,3 +1,5 @@
>>  ifeq ($(CONFIG_LINUX), y)
>> +obj-$(CONFIG_SOFTMMU) += common.o
>>  obj-$(CONFIG_PCI) += pci.o
>> +obj-$(CONFIG_SOFTMMU) += platform.o
>>  endif
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> new file mode 100644
>> index 0000000..9d1f723
>> --- /dev/null
>> +++ b/hw/vfio/common.c
>> @@ -0,0 +1,486 @@
>> +/*
>> + * vfio based device assignment support
>> + *
>> + * Copyright Red Hat, Inc. 2012
>> + *
>> + * Authors:
>> + *  Alex Williamson <alex.williamson@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Based on qemu-kvm device-assignment:
>> + *  Adapted for KVM by Qumranet.
>> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
>> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
>> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
>> + *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
>> + *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
>> + */
>> +
>> +#include <dirent.h>
>> +#include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/stat.h>
>> +#include <sys/types.h>
>> +#include <unistd.h>
>> +
>> +#include "config.h"
>> +#include "exec/address-spaces.h"
>> +#include "exec/memory.h"
>> +#include "hw/pci/msi.h"
>> +#include "hw/pci/msix.h"
>> +#include "hw/pci/pci.h"
> 
> I expect these pci includes aren't really needed

Hi Alex,

thank you for your review and sorry for this reply's delay. I am going
to maintain & update Kim's patch from now on. I am currently preparing
v3 taking into account your comments and Kim's ones.

The headers you mentioned could be removed. Only the following ones are
needed:

#include <linux/vfio.h>
#include <sys/ioctl.h>
#include "sys/mman.h"
#include "exec/address-spaces.h"
#include "qemu/error-report.h"
#include "sysemu/kvm.h"
#include "vfio-common.h"

> 
>> +#include "qemu-common.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/event_notifier.h"
>> +#include "qemu/queue.h"
>> +#include "qemu/range.h"
>> +#include "sysemu/kvm.h"
>> +#include "sysemu/sysemu.h"
>> +
>> +#include "vfio-common.h"
>> +
>> +#define DEBUG_VFIO
>> +#ifdef DEBUG_VFIO
>> +#define DPRINTF(fmt, ...) \
>> +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
>> +#else
>> +#define DPRINTF(fmt, ...) \
>> +    do { } while (0)
>> +#endif
> 
> DEBUG_VFIO should probably be in vfio-common.h
done
> 
>> +
>> +static QLIST_HEAD(, VFIOContainer)
>> +    container_list = QLIST_HEAD_INITIALIZER(container_list);
>> +
>> +QLIST_HEAD(, VFIOGroup)
>> +    group_list = QLIST_HEAD_INITIALIZER(group_list);
>> +
>> +
>> +struct VFIODevice;
> 
> I don't see where this is needed.
removed
> 
>> +
>> +#ifdef CONFIG_KVM
>> +/*
>> + * We have a single VFIO pseudo device per KVM VM.  Once created it lives
>> + * for the life of the VM.  Closing the file descriptor only drops our
>> + * reference to it and the device's reference to kvm.  Therefore once
>> + * initialized, this file descriptor is only released on QEMU exit and
>> + * we'll re-use it should another vfio device be attached before then.
>> + */
>> +static int vfio_kvm_device_fd = -1;
>> +#endif
>> +
>> +/*
>> + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>> + */
>> +static int vfio_dma_unmap(VFIOContainer *container,
>> +                          hwaddr iova, ram_addr_t size)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = 0,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
>> +        return -errno;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> +                        ram_addr_t size, void *vaddr, bool readonly)
>> +{
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_READ,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +
>> +    if (!readonly) {
>> +        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>> +    }
>> +
>> +    /*
>> +     * Try the mapping, if it fails with EBUSY, unmap the region and try
>> +     * again.  This shouldn't be necessary, but we sometimes see it in
>> +     * the the VGA ROM space.
>> +     */
>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
>> +         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>> +        return 0;
>> +    }
>> +
>> +    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
>> +    return -errno;
>> +}
>> +
>> +static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> +{
>> +    return !memory_region_is_ram(section->mr) ||
>> +           /*
>> +            * Sizing an enabled 64-bit BAR can cause spurious mappings to
>> +            * addresses in the upper part of the 64-bit address space.  These
>> +            * are never accessed by the CPU and beyond the address width of
>> +            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>> +            */
>> +           section->offset_within_address_space & (1ULL << 63);
>> +}
>> +
>> +static void vfio_listener_region_add(MemoryListener *listener,
>> +                                     MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.type1.listener);
>> +    hwaddr iova, end;
>> +    void *vaddr;
>> +    int ret;
>> +
>> +    assert(!memory_region_is_iommu(section->mr));
>> +
>> +    if (vfio_listener_skipped_section(section)) {
>> +        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> +                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    end = (section->offset_within_address_space + int128_get64(section->size)) &
>> +          TARGET_PAGE_MASK;
>> +
>> +    if (iova >= end) {
>> +        return;
>> +    }
>> +
>> +    vaddr = memory_region_get_ram_ptr(section->mr) +
>> +            section->offset_within_region +
>> +            (iova - section->offset_within_address_space);
>> +
>> +    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>> +            iova, end - 1, vaddr);
>> +
>> +    memory_region_ref(section->mr);
>> +    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>> +    if (ret) {
>> +        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> +                     container, iova, end - iova, vaddr, ret);
>> +
>> +        /*
>> +         * On the initfn path, store the first error in the container so we
>> +         * can gracefully fail.  Runtime, there's not much we can do other
>> +         * than throw a hardware error.
>> +         */
>> +        if (!container->iommu_data.type1.initialized) {
>> +            if (!container->iommu_data.type1.error) {
>> +                container->iommu_data.type1.error = ret;
>> +            }
>> +        } else {
>> +            hw_error("vfio: DMA mapping failed, unable to continue");
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_listener_region_del(MemoryListener *listener,
>> +                                     MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.type1.listener);
>> +    hwaddr iova, end;
>> +    int ret;
>> +
>> +    if (vfio_listener_skipped_section(section)) {
>> +        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> +                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    end = (section->offset_within_address_space + int128_get64(section->size)) &
>> +          TARGET_PAGE_MASK;
>> +
>> +    if (iova >= end) {
>> +        return;
>> +    }
>> +
>> +    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> +            iova, end - 1);
>> +
>> +    ret = vfio_dma_unmap(container, iova, end - iova);
>> +    memory_region_unref(section->mr);
>> +    if (ret) {
>> +        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                     "0x%"HWADDR_PRIx") = %d (%m)",
>> +                     container, iova, end - iova, ret);
>> +    }
>> +}
>> +
>> +static MemoryListener vfio_memory_listener = {
>> +    .region_add = vfio_listener_region_add,
>> +    .region_del = vfio_listener_region_del,
>> +};
>> +
>> +static void vfio_listener_release(VFIOContainer *container)
>> +{
>> +    memory_listener_unregister(&container->iommu_data.type1.listener);
>> +}
>> +
>> +static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +{
>> +#ifdef CONFIG_KVM
>> +    struct kvm_device_attr attr = {
>> +        .group = KVM_DEV_VFIO_GROUP,
>> +        .attr = KVM_DEV_VFIO_GROUP_ADD,
>> +        .addr = (uint64_t)(unsigned long)&group->fd,
>> +    };
>> +
>> +    if (!kvm_enabled()) {
>> +        return;
>> +    }
>> +
>> +    if (vfio_kvm_device_fd < 0) {
>> +        struct kvm_create_device cd = {
>> +            .type = KVM_DEV_TYPE_VFIO,
>> +        };
>> +
>> +        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>> +            DPRINTF("KVM_CREATE_DEVICE: %m\n");
>> +            return;
>> +        }
>> +
>> +        vfio_kvm_device_fd = cd.fd;
>> +    }
>> +
>> +    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> +        error_report("Failed to add group %d to KVM VFIO device: %m",
>> +                     group->groupid);
>> +    }
>> +#endif
>> +}
>> +
>> +static void vfio_kvm_device_del_group(VFIOGroup *group)
>> +{
>> +#ifdef CONFIG_KVM
>> +    struct kvm_device_attr attr = {
>> +        .group = KVM_DEV_VFIO_GROUP,
>> +        .attr = KVM_DEV_VFIO_GROUP_DEL,
>> +        .addr = (uint64_t)(unsigned long)&group->fd,
>> +    };
>> +
>> +    if (vfio_kvm_device_fd < 0) {
>> +        return;
>> +    }
>> +
>> +    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> +        error_report("Failed to remove group %d from KVM VFIO device: %m",
>> +                     group->groupid);
>> +    }
>> +#endif
>> +}
>> +
>> +static int vfio_connect_container(VFIOGroup *group)
>> +{
>> +    VFIOContainer *container;
>> +    int ret, fd;
>> +
>> +    if (group->container) {
>> +        return 0;
>> +    }
>> +
>> +    QLIST_FOREACH(container, &container_list, next) {
>> +        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +            group->container = container;
>> +            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +            return 0;
>> +        }
>> +    }
>> +
>> +    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
>> +    if (fd < 0) {
>> +        error_report("vfio: failed to open /dev/vfio/vfio: %m");
>> +        return -errno;
>> +    }
>> +
>> +    ret = ioctl(fd, VFIO_GET_API_VERSION);
>> +    if (ret != VFIO_API_VERSION) {
>> +        error_report("vfio: supported vfio version: %d, "
>> +                     "reported version: %d", VFIO_API_VERSION, ret);
>> +        close(fd);
>> +        return -EINVAL;
>> +    }
>> +
>> +    container = g_malloc0(sizeof(*container));
>> +    container->fd = fd;
>> +
>> +    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %m");
>> +            g_free(container);
>> +            close(fd);
>> +            return -errno;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %m");
>> +            g_free(container);
>> +            close(fd);
>> +            return -errno;
>> +        }
>> +
>> +        container->iommu_data.type1.listener = vfio_memory_listener;
>> +        container->iommu_data.release = vfio_listener_release;
>> +
>> +        memory_listener_register(&container->iommu_data.type1.listener,
>> +                                 &address_space_memory);
>> +
>> +        if (container->iommu_data.type1.error) {
>> +            ret = container->iommu_data.type1.error;
>> +            vfio_listener_release(container);
>> +            g_free(container);
>> +            close(fd);
>> +            error_report("vfio: memory listener initialization failed for container");
>> +            return ret;
>> +        }
>> +
>> +        container->iommu_data.type1.initialized = true;
>> +
>> +    } else {
>> +        error_report("vfio: No available IOMMU models");
>> +        g_free(container);
>> +        close(fd);
>> +        return -EINVAL;
>> +    }
>> +
>> +    QLIST_INIT(&container->group_list);
>> +    QLIST_INSERT_HEAD(&container_list, container, next);
>> +
>> +    group->container = container;
>> +    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_disconnect_container(VFIOGroup *group)
>> +{
>> +    VFIOContainer *container = group->container;
>> +
>> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> +        error_report("vfio: error disconnecting group %d from container",
>> +                     group->groupid);
>> +    }
>> +
>> +    QLIST_REMOVE(group, container_next);
>> +    group->container = NULL;
>> +
>> +    if (QLIST_EMPTY(&container->group_list)) {
>> +        if (container->iommu_data.release) {
>> +            container->iommu_data.release(container);
>> +        }
>> +        QLIST_REMOVE(container, next);
>> +        DPRINTF("vfio_disconnect_container: close container->fd\n");
>> +        close(container->fd);
>> +        g_free(container);
>> +    }
>> +}
>> +
>> +VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler)
>> +{
>> +    VFIOGroup *group;
>> +    char path[32];
>> +    struct vfio_group_status status = { .argsz = sizeof(status) };
>> +
>> +    QLIST_FOREACH(group, &group_list, next) {
>> +        if (group->groupid == groupid) {
>> +            return group;
>> +        }
>> +    }
>> +
>> +    group = g_malloc0(sizeof(*group));
>> +
>> +    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> +    group->fd = qemu_open(path, O_RDWR);
>> +    if (group->fd < 0) {
>> +        error_report("vfio: error opening %s: %m", path);
>> +        g_free(group);
>> +        return NULL;
>> +    }
>> +
>> +    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
>> +        error_report("vfio: error getting group status: %m");
>> +        close(group->fd);
>> +        g_free(group);
>> +        return NULL;
>> +    }
>> +
>> +    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
>> +        error_report("vfio: error, group %d is not viable, please ensure "
>> +                     "all devices within the iommu_group are bound to their "
>> +                     "vfio bus driver.", groupid);
>> +        close(group->fd);
>> +        g_free(group);
>> +        return NULL;
>> +    }
>> +
>> +    group->groupid = groupid;
>> +    QLIST_INIT(&group->device_list);
>> +
>> +    if (vfio_connect_container(group)) {
>> +        error_report("vfio: failed to setup container for group %d", groupid);
>> +        close(group->fd);
>> +        g_free(group);
>> +        return NULL;
>> +    }
>> +
>> +    if (QLIST_EMPTY(&group_list) && reset_handler) {
>> +        qemu_register_reset(reset_handler, NULL);
>> +    }
>> +
>> +    QLIST_INSERT_HEAD(&group_list, group, next);
>> +
>> +    vfio_kvm_device_add_group(group);
>> +
>> +    return group;
>> +}
>> +
>> +void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler)
>> +{
>> +    if (!QLIST_EMPTY(&group->device_list)) {
>> +        return;
>> +    }
>> +
>> +    vfio_kvm_device_del_group(group);
>> +    vfio_disconnect_container(group);
>> +    QLIST_REMOVE(group, next);
>> +    DPRINTF("vfio_put_group: close group->fd\n");
>> +    close(group->fd);
>> +    g_free(group);
>> +
>> +    if (QLIST_EMPTY(&group_list) && reset_handler) {
>> +        qemu_unregister_reset(reset_handler, NULL);
>> +    }
> 
> The reset_handler stuff needs work.  We can theoretically support both
> PCI and platform devices simultaneously, but this would only allow one
> to register a reset handler and one to remove it (which may not be the
> same one). 

My understanding is there is a global reset handler for all VFIO
devices, registered before the first group is added and unregistered
when the last group is removed from the group list.

This global reset handler, currently named vfio_pci_reset_handler
handles the reset of all (PCI) devices in all groups.

One possibility is to have a derived reset method for PCI class
(currently vfio_pci_hot_reset_multi) and platform class. As such we
could keep the unique global reset handler while handling both PCI wide
reset handling and platform wide reset handling.

The limitation of this is that user-side can't have platform device
specific reset handlers which may be needed in the future. In case this
were needed we would need to register many reset_handlers, possibly per
VFIO platform device. I suggest this would be the topic of a different
patch. What do you think?

> 
>> +}
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 9cf5b84..9e70d68 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -1,5 +1,5 @@
>>  /*
>> - * vfio based device assignment support
>> + * vfio based device assignment support - PCI devices
>>   *
>>   * Copyright Red Hat, Inc. 2012
>>   *
>> @@ -40,6 +40,8 @@
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/sysemu.h"
> 
> There were probably some includes that could be cleaned out with the
> split.

Done

>  
>> +#include "vfio-common.h"
>> +
>>  /* #define DEBUG_VFIO */
>>  #ifdef DEBUG_VFIO
>>  #define DPRINTF(fmt, ...) \
> 
> DEBUG_VFIO is duplicated here... move to vfio-common.h
ok done
> 
>> @@ -55,6 +57,8 @@
>>  #define VFIO_ALLOW_KVM_MSI 1
>>  #define VFIO_ALLOW_KVM_MSIX 1
>>  
>> +extern QLIST_HEAD(, VFIOGroup) group_list;
> 
> Define in vfio-common.h?
Done
> 
>> +
>>  struct VFIODevice;
>>  
>>  typedef struct VFIOQuirk {
>> @@ -135,25 +139,6 @@ enum {
>>  
>>  struct VFIOGroup;
>>  
>> -typedef struct VFIOType1 {
>> -    MemoryListener listener;
>> -    int error;
>> -    bool initialized;
>> -} VFIOType1;
>> -
>> -typedef struct VFIOContainer {
>> -    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> -    struct {
>> -        /* enable abstraction to support various iommu backends */
>> -        union {
>> -            VFIOType1 type1;
>> -        };
>> -        void (*release)(struct VFIOContainer *);
>> -    } iommu_data;
>> -    QLIST_HEAD(, VFIOGroup) group_list;
>> -    QLIST_ENTRY(VFIOContainer) next;
>> -} VFIOContainer;
>> -
>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>>  typedef struct VFIOMSIXInfo {
>>      uint8_t table_bar;
>> @@ -200,15 +185,6 @@ typedef struct VFIODevice {
>>      bool rom_read_failed;
>>  } VFIODevice;
>>  
>> -typedef struct VFIOGroup {
>> -    int fd;
>> -    int groupid;
>> -    VFIOContainer *container;
>> -    QLIST_HEAD(, VFIODevice) device_list;
>> -    QLIST_ENTRY(VFIOGroup) next;
>> -    QLIST_ENTRY(VFIOGroup) container_next;
>> -} VFIOGroup;
>> -
>>  typedef struct VFIORomBlacklistEntry {
>>      uint16_t vendor_id;
>>      uint16_t device_id;
>> @@ -234,23 +210,6 @@ static const VFIORomBlacklistEntry romblacklist[] = {
>>  
>>  #define MSIX_CAP_LENGTH 12
>>  
>> -static QLIST_HEAD(, VFIOContainer)
>> -    container_list = QLIST_HEAD_INITIALIZER(container_list);
>> -
>> -static QLIST_HEAD(, VFIOGroup)
>> -    group_list = QLIST_HEAD_INITIALIZER(group_list);
>> -
>> -#ifdef CONFIG_KVM
>> -/*
>> - * We have a single VFIO pseudo device per KVM VM.  Once created it lives
>> - * for the life of the VM.  Closing the file descriptor only drops our
>> - * reference to it and the device's reference to kvm.  Therefore once
>> - * initialized, this file descriptor is only released on QEMU exit and
>> - * we'll re-use it should another vfio device be attached before then.
>> - */
>> -static int vfio_kvm_device_fd = -1;
>> -#endif
>> -
>>  static void vfio_disable_interrupts(VFIODevice *vdev);
>>  static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>  static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>> @@ -2180,183 +2139,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>>  }
>>  
>>  /*
>> - * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>> - */
>> -static int vfio_dma_unmap(VFIOContainer *container,
>> -                          hwaddr iova, ram_addr_t size)
>> -{
>> -    struct vfio_iommu_type1_dma_unmap unmap = {
>> -        .argsz = sizeof(unmap),
>> -        .flags = 0,
>> -        .iova = iova,
>> -        .size = size,
>> -    };
>> -
>> -    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> -        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
>> -        return -errno;
>> -    }
>> -
>> -    return 0;
>> -}
>> -
>> -static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                        ram_addr_t size, void *vaddr, bool readonly)
>> -{
>> -    struct vfio_iommu_type1_dma_map map = {
>> -        .argsz = sizeof(map),
>> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>> -        .vaddr = (__u64)(uintptr_t)vaddr,
>> -        .iova = iova,
>> -        .size = size,
>> -    };
>> -
>> -    if (!readonly) {
>> -        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>> -    }
>> -
>> -    /*
>> -     * Try the mapping, if it fails with EBUSY, unmap the region and try
>> -     * again.  This shouldn't be necessary, but we sometimes see it in
>> -     * the the VGA ROM space.
>> -     */
>> -    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
>> -         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>> -        return 0;
>> -    }
>> -
>> -    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
>> -    return -errno;
>> -}
>> -
>> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> -{
>> -    return !memory_region_is_ram(section->mr) ||
>> -           /*
>> -            * Sizing an enabled 64-bit BAR can cause spurious mappings to
>> -            * addresses in the upper part of the 64-bit address space.  These
>> -            * are never accessed by the CPU and beyond the address width of
>> -            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>> -            */
>> -           section->offset_within_address_space & (1ULL << 63);
>> -}
>> -
>> -static void vfio_listener_region_add(MemoryListener *listener,
>> -                                     MemoryRegionSection *section)
>> -{
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            iommu_data.type1.listener);
>> -    hwaddr iova, end;
>> -    void *vaddr;
>> -    int ret;
>> -
>> -    assert(!memory_region_is_iommu(section->mr));
>> -
>> -    if (vfio_listener_skipped_section(section)) {
>> -        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>> -                section->offset_within_address_space,
>> -                section->offset_within_address_space +
>> -                int128_get64(int128_sub(section->size, int128_one())));
>> -        return;
>> -    }
>> -
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> -        error_report("%s received unaligned region", __func__);
>> -        return;
>> -    }
>> -
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> -    end = (section->offset_within_address_space + int128_get64(section->size)) &
>> -          TARGET_PAGE_MASK;
>> -
>> -    if (iova >= end) {
>> -        return;
>> -    }
>> -
>> -    vaddr = memory_region_get_ram_ptr(section->mr) +
>> -            section->offset_within_region +
>> -            (iova - section->offset_within_address_space);
>> -
>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>> -            iova, end - 1, vaddr);
>> -
>> -    memory_region_ref(section->mr);
>> -    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>> -    if (ret) {
>> -        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> -                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> -                     container, iova, end - iova, vaddr, ret);
>> -
>> -        /*
>> -         * On the initfn path, store the first error in the container so we
>> -         * can gracefully fail.  Runtime, there's not much we can do other
>> -         * than throw a hardware error.
>> -         */
>> -        if (!container->iommu_data.type1.initialized) {
>> -            if (!container->iommu_data.type1.error) {
>> -                container->iommu_data.type1.error = ret;
>> -            }
>> -        } else {
>> -            hw_error("vfio: DMA mapping failed, unable to continue");
>> -        }
>> -    }
>> -}
>> -
>> -static void vfio_listener_region_del(MemoryListener *listener,
>> -                                     MemoryRegionSection *section)
>> -{
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            iommu_data.type1.listener);
>> -    hwaddr iova, end;
>> -    int ret;
>> -
>> -    if (vfio_listener_skipped_section(section)) {
>> -        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
>> -                section->offset_within_address_space,
>> -                section->offset_within_address_space +
>> -                int128_get64(int128_sub(section->size, int128_one())));
>> -        return;
>> -    }
>> -
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> -        error_report("%s received unaligned region", __func__);
>> -        return;
>> -    }
>> -
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> -    end = (section->offset_within_address_space + int128_get64(section->size)) &
>> -          TARGET_PAGE_MASK;
>> -
>> -    if (iova >= end) {
>> -        return;
>> -    }
>> -
>> -    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> -            iova, end - 1);
>> -
>> -    ret = vfio_dma_unmap(container, iova, end - iova);
>> -    memory_region_unref(section->mr);
>> -    if (ret) {
>> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> -                     "0x%"HWADDR_PRIx") = %d (%m)",
>> -                     container, iova, end - iova, ret);
>> -    }
>> -}
>> -
>> -static MemoryListener vfio_memory_listener = {
>> -    .region_add = vfio_listener_region_add,
>> -    .region_del = vfio_listener_region_del,
>> -};
>> -
>> -static void vfio_listener_release(VFIOContainer *container)
>> -{
>> -    memory_listener_unregister(&container->iommu_data.type1.listener);
>> -}
>> -
>> -/*
>>   * Interrupt setup
>>   */
>>  static void vfio_disable_interrupts(VFIODevice *vdev)
>> @@ -3221,244 +3003,8 @@ static void vfio_pci_reset_handler(void *opaque)
>>      }
>>  }
>>  
>> -static void vfio_kvm_device_add_group(VFIOGroup *group)
>> -{
>> -#ifdef CONFIG_KVM
>> -    struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> -    };
>> -
>> -    if (!kvm_enabled()) {
>> -        return;
>> -    }
>> -
>> -    if (vfio_kvm_device_fd < 0) {
>> -        struct kvm_create_device cd = {
>> -            .type = KVM_DEV_TYPE_VFIO,
>> -        };
>> -
>> -        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>> -            DPRINTF("KVM_CREATE_DEVICE: %m\n");
>> -            return;
>> -        }
>> -
>> -        vfio_kvm_device_fd = cd.fd;
>> -    }
>> -
>> -    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to add group %d to KVM VFIO device: %m",
>> -                     group->groupid);
>> -    }
>> -#endif
>> -}
>> -
>> -static void vfio_kvm_device_del_group(VFIOGroup *group)
>> -{
>> -#ifdef CONFIG_KVM
>> -    struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_DEL,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> -    };
>> -
>> -    if (vfio_kvm_device_fd < 0) {
>> -        return;
>> -    }
>> -
>> -    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to remove group %d from KVM VFIO device: %m",
>> -                     group->groupid);
>> -    }
>> -#endif
>> -}
>> -
>> -static int vfio_connect_container(VFIOGroup *group)
>> -{
>> -    VFIOContainer *container;
>> -    int ret, fd;
>> -
>> -    if (group->container) {
>> -        return 0;
>> -    }
>> -
>> -    QLIST_FOREACH(container, &container_list, next) {
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> -            group->container = container;
>> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> -            return 0;
>> -        }
>> -    }
>> -
>> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
>> -    if (fd < 0) {
>> -        error_report("vfio: failed to open /dev/vfio/vfio: %m");
>> -        return -errno;
>> -    }
>> -
>> -    ret = ioctl(fd, VFIO_GET_API_VERSION);
>> -    if (ret != VFIO_API_VERSION) {
>> -        error_report("vfio: supported vfio version: %d, "
>> -                     "reported version: %d", VFIO_API_VERSION, ret);
>> -        close(fd);
>> -        return -EINVAL;
>> -    }
>> -
>> -    container = g_malloc0(sizeof(*container));
>> -    container->fd = fd;
>> -
>> -    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
>> -        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> -        if (ret) {
>> -            error_report("vfio: failed to set group container: %m");
>> -            g_free(container);
>> -            close(fd);
>> -            return -errno;
>> -        }
>> -
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
>> -        if (ret) {
>> -            error_report("vfio: failed to set iommu for container: %m");
>> -            g_free(container);
>> -            close(fd);
>> -            return -errno;
>> -        }
>> -
>> -        container->iommu_data.type1.listener = vfio_memory_listener;
>> -        container->iommu_data.release = vfio_listener_release;
>> -
>> -        memory_listener_register(&container->iommu_data.type1.listener,
>> -                                 &address_space_memory);
>> -
>> -        if (container->iommu_data.type1.error) {
>> -            ret = container->iommu_data.type1.error;
>> -            vfio_listener_release(container);
>> -            g_free(container);
>> -            close(fd);
>> -            error_report("vfio: memory listener initialization failed for container");
>> -            return ret;
>> -        }
>> -
>> -        container->iommu_data.type1.initialized = true;
>> -
>> -    } else {
>> -        error_report("vfio: No available IOMMU models");
>> -        g_free(container);
>> -        close(fd);
>> -        return -EINVAL;
>> -    }
>> -
>> -    QLIST_INIT(&container->group_list);
>> -    QLIST_INSERT_HEAD(&container_list, container, next);
>> -
>> -    group->container = container;
>> -    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> -
>> -    return 0;
>> -}
>> -
>> -static void vfio_disconnect_container(VFIOGroup *group)
>> -{
>> -    VFIOContainer *container = group->container;
>> -
>> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> -        error_report("vfio: error disconnecting group %d from container",
>> -                     group->groupid);
>> -    }
>> -
>> -    QLIST_REMOVE(group, container_next);
>> -    group->container = NULL;
>> -
>> -    if (QLIST_EMPTY(&container->group_list)) {
>> -        if (container->iommu_data.release) {
>> -            container->iommu_data.release(container);
>> -        }
>> -        QLIST_REMOVE(container, next);
>> -        DPRINTF("vfio_disconnect_container: close container->fd\n");
>> -        close(container->fd);
>> -        g_free(container);
>> -    }
>> -}
>> -
>> -static VFIOGroup *vfio_get_group(int groupid)
>> -{
>> -    VFIOGroup *group;
>> -    char path[32];
>> -    struct vfio_group_status status = { .argsz = sizeof(status) };
>> -
>> -    QLIST_FOREACH(group, &group_list, next) {
>> -        if (group->groupid == groupid) {
>> -            return group;
>> -        }
>> -    }
>> -
>> -    group = g_malloc0(sizeof(*group));
>> -
>> -    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open(path, O_RDWR);
>> -    if (group->fd < 0) {
>> -        error_report("vfio: error opening %s: %m", path);
>> -        g_free(group);
>> -        return NULL;
>> -    }
>> -
>> -    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
>> -        error_report("vfio: error getting group status: %m");
>> -        close(group->fd);
>> -        g_free(group);
>> -        return NULL;
>> -    }
>> -
>> -    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
>> -        error_report("vfio: error, group %d is not viable, please ensure "
>> -                     "all devices within the iommu_group are bound to their "
>> -                     "vfio bus driver.", groupid);
>> -        close(group->fd);
>> -        g_free(group);
>> -        return NULL;
>> -    }
>> -
>> -    group->groupid = groupid;
>> -    QLIST_INIT(&group->device_list);
>> -
>> -    if (vfio_connect_container(group)) {
>> -        error_report("vfio: failed to setup container for group %d", groupid);
>> -        close(group->fd);
>> -        g_free(group);
>> -        return NULL;
>> -    }
>> -
>> -    if (QLIST_EMPTY(&group_list)) {
>> -        qemu_register_reset(vfio_pci_reset_handler, NULL);
>> -    }
>> -
>> -    QLIST_INSERT_HEAD(&group_list, group, next);
>> -
>> -    vfio_kvm_device_add_group(group);
>> -
>> -    return group;
>> -}
>> -
>> -static void vfio_put_group(VFIOGroup *group)
>> -{
>> -    if (!QLIST_EMPTY(&group->device_list)) {
>> -        return;
>> -    }
>> -
>> -    vfio_kvm_device_del_group(group);
>> -    vfio_disconnect_container(group);
>> -    QLIST_REMOVE(group, next);
>> -    DPRINTF("vfio_put_group: close group->fd\n");
>> -    close(group->fd);
>> -    g_free(group);
>> -
>> -    if (QLIST_EMPTY(&group_list)) {
>> -        qemu_unregister_reset(vfio_pci_reset_handler, NULL);
>> -    }
>> -}
>> -
>> -static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
>> +static int vfio_get_device(VFIOGroup *group, const char *name,
>> +                           struct VFIODevice *vdev)
> 
> Why?
ok reverted
> 
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
>> @@ -3485,7 +3031,7 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
>>          goto error;
>>      }
>>  
>> -    DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name,
>> +    DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
> 
> ???
type correction: irgs -> irqs ;-)
> 
>>              dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
>>  
>>      if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) {
>> @@ -3768,7 +3314,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>>  
>> -    group = vfio_get_group(groupid);
>> +    group = vfio_get_group(groupid, vfio_pci_reset_handler);
>>      if (!group) {
>>          error_report("vfio: failed to get group %d", groupid);
>>          return -ENOENT;
>> @@ -3785,7 +3331,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>              pvdev->host.function == vdev->host.function) {
>>  
>>              error_report("vfio: error: device %s is already attached", path);
>> -            vfio_put_group(group);
>> +            vfio_put_group(group, vfio_pci_reset_handler);
>>              return -EBUSY;
>>          }
>>      }
>> @@ -3793,7 +3339,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>      ret = vfio_get_device(group, path, vdev);
>>      if (ret) {
>>          error_report("vfio: failed to get device %s", path);
>> -        vfio_put_group(group);
>> +        vfio_put_group(group, vfio_pci_reset_handler);
>>          return ret;
>>      }
>>  
>> @@ -3879,7 +3425,7 @@ out_teardown:
>>  out_put:
>>      g_free(vdev->emulated_config_bits);
>>      vfio_put_device(vdev);
>> -    vfio_put_group(group);
>> +    vfio_put_group(group, vfio_pci_reset_handler);
>>      return ret;
>>  }
>>  
>> @@ -3899,7 +3445,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>>      g_free(vdev->emulated_config_bits);
>>      g_free(vdev->rom);
>>      vfio_put_device(vdev);
>> -    vfio_put_group(group);
>> +    vfio_put_group(group, vfio_pci_reset_handler);
>>  }
>>  
>>  static void vfio_pci_reset(DeviceState *dev)
>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>> new file mode 100644
>> index 0000000..138fb13
>> --- /dev/null
>> +++ b/hw/vfio/platform.c
>> @@ -0,0 +1,381 @@
>> +/*
>> + * vfio based device assignment support - platform devices
>> + *
>> + * Copyright Linaro Limited, 2014
>> + *
>> + * Authors:
>> + *  Kim Phillips <kim.phillips@linaro.org>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Based on vfio based PCI device assignment support:
>> + *  Copyright Red Hat, Inc. 2012
>> + */
>> +
>> +#include <dirent.h>
>> +#include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/stat.h>
>> +#include <sys/types.h>
>> +#include <unistd.h>
>> +
>> +#include "config.h"
>> +#include "exec/address-spaces.h"
>> +#include "exec/memory.h"
>> +#include "qemu-common.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/event_notifier.h"
>> +#include "qemu/queue.h"
>> +#include "qemu/range.h"
>> +#include "sysemu/kvm.h"
>> +#include "sysemu/sysemu.h"
>> +#include "hw/qdev-properties.h"
>> +#include "migration/vmstate.h"
>> +#include "hw/hw.h"
>> +#include "hw/sysbus.h"
>> +
>> +#include "vfio-common.h"
>> +
>> +#define DEBUG_VFIO
>> +#ifdef DEBUG_VFIO
>> +#define DPRINTF(fmt, ...) \
>> +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
>> +#else
>> +#define DPRINTF(fmt, ...) \
>> +    do { } while (0)
>> +#endif
>> +
>> +/* Extra debugging, trap acceleration paths for more logging */
>> +#define VFIO_ALLOW_MMAP 1
> 
> Maybe there should be a common debug section in vfio-common.h?
> 
done
>> +
>> +#define TYPE_VFIO_PLATFORM "vfio-platform"
>> +
>> +typedef struct VFIORegion {
>> +    off_t fd_offset; /* offset of region within device fd */
>> +    int fd; /* device fd, allows us to pass VFIORegion as opaque data */
>> +    MemoryRegion mem; /* slow, read/write access */
>> +    MemoryRegion mmap_mem; /* direct mapped access */
>> +    void *mmap;
>> +    size_t size;
>> +    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
>> +    uint8_t nr; /* cache the region number for debug */
>> +} VFIORegion;
>> +
>> +typedef struct VFIODevice {
>> +    SysBusDevice sbdev;
>> +    int fd;
>> +    int num_regions;
>> +    VFIORegion *regions;
>> +    QLIST_ENTRY(VFIODevice) next;
>> +    struct VFIOGroup *group;
>> +    char *name;
>> +} VFIODevice;
> 
> I suspect we really need:
> 
> (in vfio-common.h)
> typedef struct VFIODevice {
> 	QLIST_ENTRY(VFIODevice) next;
> 	struct VFIOGroup *group;
> 	enum VFIODeviceType;
> 	/* maybe some of the other common stuff */
> } VFIODevice;
> 
> (per pci/platform)
> typedef struct VFIOPlatformDevice {
> 	VFIODevice vdev;
> 	/* device specific stuff */
> } VFIOPlatformDevice;
> 
> Otherwise I don't see how the device_list on VFIOGroup works.

I am currently experimenting your idea of building a kind a VFIODevice
base class. Before taking care of interrupts this would look like

typedef struct VFIODevice {
    QLIST_ENTRY(VFIODevice) next;
    struct VFIOGroup *group;
    int num_regions;
    VFIORegion **regions;
    int num_irqs;
    char *name;
    int fd;
    int type;
    bool reset_works;
    VFIODeviceOps *ops;
} VFIODevice;

Now I am stuck by the fact I would need this VFIODevice to be a QOM
object. This is needed because I need to pass the VFIODevice * in
memory_region_init_ram_ptr and memory_region_init as a OBJECT(vdev).
This would make possible to factorize quite a lot of code. However I am
currently doubting this is feasible after reading object.h, ie. I would
need double inheritance
VFIOPCIDevice <- PCIDevice
	      <- VFIODevice
VFIOPlatformDevice <- SysBusDevice
		   <- VFIODevice

Obviously I may do without VFIODevice being a QOM object but this means
less code factorization. If you have guidance/comments on this, I am
definitively interested.

Also I think we might be able to share a VFIORegion as follows.

typedef struct VFIORegion {
    off_t fd_offset; /* offset of region within device fd */
    int fd; /* device fd, allows us to pass VFIORegion as opaque data */
    MemoryRegion mem; /* slow, read/write access */
    MemoryRegion mmap_mem; /* direct mapped access */
    void *mmap;
    size_t size;
    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
    uint8_t nr; /* cache the region number for debug */
} VFIORegion;

typedef struct VFIOBAR {
    VFIORegion region;
    bool ioport;
    bool mem64;
    QLIST_HEAD(, VFIOQuirk) quirks;
} VFIOBAR;

Overall this makes quite a lot of changes in the pci.c file and
currently I do not know how to test the PCI code changes. This is
another serious issue.

> 
>> +
>> +static int vfio_mmap_region(VFIODevice *vdev, VFIORegion *region,
>> +                         MemoryRegion *mem, MemoryRegion *submem,
>> +                         void **map, size_t size, off_t offset,
>> +                         const char *name)
>> +{
>> +    int ret = 0;
>> +
>> +    if (VFIO_ALLOW_MMAP && size && region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
>> +        int prot = 0;
>> +        ret = 0;
>> +
>> +        if (region->flags & VFIO_REGION_INFO_FLAG_READ) {
>> +            prot |= PROT_READ;
>> +        }
>> +
>> +        if (region->flags & VFIO_REGION_INFO_FLAG_WRITE) {
>> +            prot |= PROT_WRITE;
>> +        }
>> +
>> +        *map = mmap(NULL, size, prot, MAP_SHARED,
>> +                    region->fd, region->fd_offset + offset);
>> +        if (*map == MAP_FAILED) {
>> +            ret = -errno;
>> +            *map = NULL;
>> +            goto error;
>> +        }
>> +
>> +        memory_region_init_ram_ptr(submem, OBJECT(vdev), name, size, *map);
>> +    }
>> +
>> +    memory_region_add_subregion(mem, offset, submem);
>> +
>> +error:
>> +    return ret;
>> +}
>> +
>> +/*
>> + * IO Port/MMIO - Beware of the endians, VFIO is always little endian
>> + */
>> +static void vfio_region_write(void *opaque, hwaddr addr,
>> +                              uint64_t data, unsigned size)
>> +{
>> +    VFIORegion *region = opaque;
>> +    union {
>> +        uint8_t byte;
>> +        uint16_t word;
>> +        uint32_t dword;
>> +        uint64_t qword;
>> +    } buf;
>> +
>> +    switch (size) {
>> +    case 1:
>> +        buf.byte = data;
>> +        break;
>> +    case 2:
>> +        buf.word = data;
>> +        break;
>> +    case 4:
>> +        buf.dword = data;
>> +        break;
>> +    default:
>> +        hw_error("vfio: unsupported write size, %d bytes\n", size);
>> +        break;
>> +    }
>> +
>> +    if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
>> +        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
>> +                     __func__, addr, data, size);
>> +    }
>> +
>> +    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", 0x%"PRIx64", %d)\n",
>> +             __func__, region->nr, addr, data, size);
>> +}
>> +
>> +static uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    VFIORegion *region = opaque;
>> +    union {
>> +        uint8_t byte;
>> +        uint16_t word;
>> +        uint32_t dword;
>> +        uint64_t qword;
>> +    } buf;
>> +    uint64_t data = 0;
>> +
>> +    if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
>> +        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
>> +                     __func__, addr, size);
>> +        return (uint64_t)-1;
>> +    }
>> +
>> +    switch (size) {
>> +    case 1:
>> +        data = buf.byte;
>> +        break;
>> +    case 2:
>> +        data = buf.word;
>> +        break;
>> +    case 4:
>> +        data = buf.dword;
>> +        break;
>> +    default:
>> +        hw_error("vfio: unsupported read size, %d bytes\n", size);
>> +        break;
>> +    }
>> +
>> +    DPRINTF("%s(region %d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
>> +            __func__, region->nr, addr, size, data);
>> +
>> +    return data;
>> +}
>> +
>> +static const MemoryRegionOps vfio_region_ops = {
>> +    .read = vfio_region_read,
>> +    .write = vfio_region_write,
>> +    .endianness = DEVICE_NATIVE_ENDIAN,
>> +};
>> +
>> +static void vfio_map_region(VFIODevice *vdev, int nr)
>> +{
>> +    VFIORegion *region = &vdev->regions[nr];
>> +    unsigned size = region->size;
>> +    char name[64];
>> +
>> +    snprintf(name, sizeof(name), "VFIO %s region %d", vdev->name, nr);
>> +
>> +    /* A "slow" read/write mapping underlies all regions  */
>> +    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
>> +                          region, name, size);
>> +
>> +    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
>> +    if (vfio_mmap_region(vdev, region, &region->mem,
>> +                         &region->mmap_mem, &region->mmap, size, 0, name)) {
>> +        error_report("%s unsupported. Performance may be slow", name);
>> +    }
>> +}
>> +
>> +static int vfio_get_device(VFIOGroup *group, const char *name,
>> +                           struct VFIODevice *vdev)
>> +{
>> +    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
>> +    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
>> +    int ret, i;
>> +
>> +    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    if (ret < 0) {
>> +        error_report("vfio: error getting device %s from group %d: %m",
>> +                     name, group->groupid);
>> +        error_printf("Verify all devices in group %d are bound to the vfio "
>> +                     "platform driver and are not already in use\n",
>> +                     group->groupid);
>> +        return ret;
>> +    }
>> +
>> +    vdev->fd = ret;
>> +    vdev->group = group;
>> +    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
>> +
>> +    /* Sanity check device */
>> +    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
>> +    if (ret) {
>> +        error_report("vfio: error getting device info: %m");
>> +        goto error;
>> +    }
>> +
>> +    DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
>> +            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
>> +
>> +    vdev->regions = g_malloc0(sizeof(VFIORegion) * dev_info.num_regions);
>> +    if (!vdev->regions) {
>> +            error_report("vfio: Error allocating space for %d regions",
>> +                         dev_info.num_regions);
>> +            ret = -ENOMEM;
>> +            goto error;
>> +    }
>> +
>> +    vdev->num_regions = dev_info.num_regions;
>> +
>> +    for (i = 0; i < dev_info.num_regions; i++) {
>> +        reg_info.index = i;
>> +
>> +        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
>> +        if (ret) {
>> +            error_report("vfio: Error getting region %d info: %m", i);
>> +            goto error;
>> +        }
>> +
>> +        DPRINTF("Device %s region %d:\n", name, i);
>> +        DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
>> +                (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
>> +                (unsigned long)reg_info.flags);
>> +
>> +        vdev->regions[i].flags = reg_info.flags;
>> +        vdev->regions[i].size = reg_info.size;
>> +        vdev->regions[i].fd_offset = reg_info.offset;
>> +        vdev->regions[i].fd = vdev->fd;
>> +        vdev->regions[i].nr = i;
>> +    }
> 
> Perhaps more of this could be in common.c if we had a generic VFIODevice
> embedded in a VFIOPlatformDevice or VFIOPCIDevice.  We could fill in
> high-level things like number of regions/irqs, fd, name, etc, leaving
> interpretation and probing of each region/irq to device specific code.
> 

This is indeed current target. See comment above on QOM inheritance
problematic.

>> +
>> +error:
>> +    if (ret) {
>> +        g_free(vdev->regions);
>> +        QLIST_REMOVE(vdev, next);
>> +        vdev->group = NULL;
>> +        close(vdev->fd);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static void vfio_platform_realize(DeviceState *dev, Error **errp)
>> +{
>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
>> +    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, sbdev, sbdev);
>> +    VFIOGroup *group;
>> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>> +    ssize_t len;
>> +    struct stat st;
>> +    int groupid, i, ret;
>> +
>> +    /* TODO: pass device name on command line */
>> +    vdev->name = malloc(PATH_MAX);
>> +    strcpy(vdev->name, "fff51000.ethernet");
>> +
>> +    /* Check that the host device exists */
>> +    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/", vdev->name);
>> +    if (stat(path, &st) < 0) {
>> +        error_report("vfio: error: no such host device: %s", path);
>> +        return;
>> +    }
>> +
>> +    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
>> +
>> +    len = readlink(path, iommu_group_path, PATH_MAX);
>> +    if (len <= 0) {
>> +        error_report("vfio: error no iommu_group for device");
>> +        return;
>> +    }
>> +
>> +    iommu_group_path[len] = 0;
>> +    group_name = basename(iommu_group_path);
>> +
>> +    if (sscanf(group_name, "%d", &groupid) != 1) {
>> +        error_report("vfio: error reading %s: %m", path);
>> +        return;
>> +    }
>> +
>> +    DPRINTF("%s(%s) group %d\n", __func__, vdev->name, groupid);
>> +
>> +    group = vfio_get_group(groupid, NULL);
>> +    if (!group) {
>> +        error_report("vfio: failed to get group %d", groupid);
>> +        return;
>> +    }
>> +
>> +    snprintf(path, sizeof(path), "%s", vdev->name);
>> +
>> +    QLIST_FOREACH(pvdev, &group->device_list, next) {
>> +        if (strcmp(pvdev->name, vdev->name) == 0) {
>> +            error_report("vfio: error: device %s is already attached", path);
>> +            vfio_put_group(group, NULL);
>> +            return;
>> +        }
>> +    }
>> +
>> +    ret = vfio_get_device(group, path, vdev);
>> +    if (ret) {
>> +        error_report("vfio: failed to get device %s", path);
>> +        vfio_put_group(group, NULL);
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < vdev->num_regions; i++) {
>> +        vfio_map_region(vdev, i);
>> +        sysbus_init_mmio(sbdev, &vdev->regions[i].mem);
>> +    }
>> +}
>> +
>> +static const VMStateDescription vfio_platform_vmstate = {
>> +    .name = TYPE_VFIO_PLATFORM,
>> +    .unmigratable = 1,
>> +};
>> +
>> +static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->realize = vfio_platform_realize;
>> +    dc->vmsd = &vfio_platform_vmstate;
>> +    dc->desc = "VFIO-based platform device assignment";
>> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>> +}
>> +
>> +static const TypeInfo vfio_platform_dev_info = {
>> +    .name = TYPE_VFIO_PLATFORM,
>> +    .parent = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(VFIODevice),
>> +    .class_init = vfio_platform_dev_class_init,
>> +};
>> +
>> +static void register_vfio_platform_dev_type(void)
>> +{
>> +    type_register_static(&vfio_platform_dev_info);
>> +}
>> +
>> +type_init(register_vfio_platform_dev_type)
>> diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
>> new file mode 100644
>> index 0000000..21148ef
>> --- /dev/null
>> +++ b/hw/vfio/vfio-common.h
>> @@ -0,0 +1,55 @@
>> +/*
>> + * common header for vfio based device assignment support
>> + *
>> + * Copyright Red Hat, Inc. 2012
>> + *
>> + * Authors:
>> + *  Alex Williamson <alex.williamson@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Based on qemu-kvm device-assignment:
>> + *  Adapted for KVM by Qumranet.
>> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
>> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
>> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
>> + *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
>> + *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
>> + */
>> +
>> +struct VFIODevice;
>> +
>> +struct VFIOGroup;
>> +
>> +typedef struct VFIOType1 {
>> +    MemoryListener listener;
>> +    int error;
>> +    bool initialized;
>> +} VFIOType1;
>> +
>> +typedef struct VFIOContainer {
>> +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> +    struct {
>> +        /* enable abstraction to support various iommu backends */
>> +        union {
>> +            VFIOType1 type1;
>> +        };
>> +        void (*release)(struct VFIOContainer *);
>> +    } iommu_data;
>> +    QLIST_HEAD(, VFIOGroup) group_list;
>> +    QLIST_ENTRY(VFIOContainer) next;
>> +} VFIOContainer;
>> +
>> +typedef struct VFIOGroup {
>> +    int fd;
>> +    int groupid;
>> +    VFIOContainer *container;
>> +    QLIST_HEAD(, VFIODevice) device_list;
>> +    QLIST_ENTRY(VFIOGroup) next;
>> +    QLIST_ENTRY(VFIOGroup) container_next;
>> +} VFIOGroup;
>> +
>> +
>> +VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler);
>> +void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler);
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-05-20  6:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-09 15:33 [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Eric Auger
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 1/6] hw/arm/virt: add a xgmac device Eric Auger
2014-04-10 13:26   ` Peter Crosthwaite
2014-04-10 13:48     ` Alexander Graf
2014-04-11  5:41       ` Alistair Francis
2014-04-11  9:21         ` Alexander Graf
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 2/6] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 3/6] vfio: add vfio-platform support Eric Auger
     [not found]   ` <1397832861.3060.93.camel@ul30vt.home>
2014-05-20  6:44     ` Eric Auger
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 4/6] vfio: Add initial IRQ support in QEMU platform device Eric Auger
2014-04-11  1:34   ` Kim Phillips
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 5/6] virt: Assign a VFIO platform device to a virt VM in QEMU command line Eric Auger
2014-04-09 15:33 ` [Qemu-devel] [RFC v2 6/6] vfio: add exit function and IRQ disable functions Eric Auger
2014-04-11  1:28 ` [Qemu-devel] [RFC v2 0/6] KVM platform device passthrough Kim Phillips

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.