All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough
@ 2014-06-02  7:49 Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 01/10] hw/arm/virt: add a xgmac device Eric Auger
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This RFC series aims at enabling KVM platform device passthrough.
It implements a VFIO platform device and offers the capability to
instantiate this VFIO device in mach-virt.

The VFIO platform device uses an host VFIO platform driver which must
be bound to the assigned device prior to the QEMU system start.

- the guest can directly access the device register space
- assigned device IRQs are transparently routed to the guest by
  QEMU/KVM (2 methods currently are supported)
- iommu is transparently programmed to prevent the device from
  accessing physical pages outside of the guest address space

The patch series is made of the following patch files

1) provides a mach_virt implementation where a VFIO device is
   instantiated at a fixed location

2) addresses A.Williamson's comment to have the platform device code
   separated from the PCI device code. The VFIO device is moved in a
   new directory hw/vfio/

3) provides a VFIO platform device that supports MMIO direct accesses.
   the vfio device was reworked to factorize at best the code between
   the PCI device and the platform device.

4) simplifies pci device trace calls using common "name" field

5) provides initial IRQ support. The device IRQ now are routed to the
   guest. IRQ handling is based on eventfds handled on user side.
   End of interrupt is detected by trapping guest access to MMIO.
   Functional but suffers from some performance limitations.

6) enables the QEMU end-user to dynamically assign the device
   from command line, using -device option. Request a minimal
   knowledge from the end-user (vfio driver name and compatibility).
   From that point on the VFIO platform device becomes fully generic.
   a single compat string is supported. A single MMIO region is
   supported.

7) regions are IOMMU mapped as executable. This feature is requested
   for some DMA devices that fetch code from some regions (typically
   the PL330).

8) Add support for multi compat strings. This feature is requested
   for Primecell devices

9) forces eventfd notifying mechanism

10) Introduces a new way of IRQ routing (based on KVM irqfd/GSI
   routing). This method is far more performant than the one
   introduced in 4) since eventfds are handled on host kernel side
   and also interrupt completion is trapped at GIC level.

v1 (Kim Phillips):
Initial versions for 1, 2, 3

v1->v2 changes (Kim Phillips, Eric Auger):
- reworked split between PCI and platform (3)
- IRQ initial support (5)
- dynamic instantiation (6)

v2->v3 changes (Alvise Rigo, Eric Auger):
- Following Alex recommandations, further efforts to factorize the
  code between PCI, platform usage of VFIOPlatform and VFIORegion
  as base classes (3, 4)
- cleanup following Kim's comments
- multiple IRQ support mechanics should be in place although not
  tested
- Better handling of MMIO multiple regions
- New features and fixes by Alvise (7, 8, 9)
- irqfd support (10)

This patch has the following dependencies on kernel side:

- [RFC Patch v5 0/11] VFIO support for platform devices
http://www.spinics.net/lists/kvm/msg102309.html
- [Patch] ARM: KVM: Handle IPA unmapping on memory region deletion
https://patches.linaro.org/27691/
- [PATCH v2] ARM: KVM: add irqfd and irq routing support
https://patches.linaro.org/29896/
- [PATCH] ARM: KVM: Enable the KVM-VFIO device
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-March/008629.html
- [PATCH] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-March/008630.html

The patch series was tested on Calxeda Midway (ARMv7) where one xgmac
is assigned to KVM host while the second one is assigned to the guest.

Tentative Plan:
- further IRQ handling optimizations (removal of maintenance IRQ)
- unbind/migration/reset problematics
- multi-instantiation testing
- multiple IRQ testing
- management of platform devices with more complex device tree node

Here are the instructions to test on a Calxeda Midway:

https://wiki.linaro.org/LEG/Engineering/Virtualization/Platform_Device_Passthrough_on_Midway

git://git.linaro.org/people/eric.auger/linux.git (branch irqfd_integ_v2)
git://git.linaro.org/people/eric.auger/qemu.git (branch vfio-dev-integ-RFCv3)

Best Regards

Eric


Alvise Rigo (3):
  Add EXEC_FLAG to VFIO DMA mappings
  Add AMBA devices support to VFIO
  Always use eventfd as notifying mechanism

Eric Auger (4):
  vfio: simplifed DPRINTF calls using device name
  vfio: Add initial IRQ support in platform device
  virt: Assign a VFIO platform device with -device option
  vfio: Add irqfd support in platform device

Kim Phillips (3):
  hw/arm/virt: add a xgmac device
  vfio: move hw/misc/vfio.c to hw/vfio/pci.c
  vfio: add vfio-platform support

 LICENSE                        |    2 +-
 MAINTAINERS                    |    2 +-
 hw/Makefile.objs               |    1 +
 hw/arm/virt.c                  |  238 +++++-
 hw/intc/arm_gic_kvm.c          |    1 +
 hw/misc/Makefile.objs          |    1 -
 hw/vfio/Makefile.objs          |    5 +
 hw/vfio/common.c               |  854 ++++++++++++++++++++++
 hw/{misc/vfio.c => vfio/pci.c} | 1562 ++++++++++------------------------------
 hw/vfio/platform.c             |  733 +++++++++++++++++++
 hw/vfio/vfio-common.h          |  153 ++++
 linux-headers/linux/vfio.h     |    3 +
 12 files changed, 2378 insertions(+), 1177 deletions(-)
 create mode 100644 hw/vfio/Makefile.objs
 create mode 100644 hw/vfio/common.c
 rename hw/{misc/vfio.c => vfio/pci.c} (65%)
 create mode 100644 hw/vfio/platform.c
 create mode 100644 hw/vfio/vfio-common.h

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 01/10] hw/arm/virt: add a xgmac device
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 02/10] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, Kim Phillips, patches, agraf,
	stuart.yoder, alex.williamson, christophe.barnichon, a.motakis,
	kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

This is a hack and only serves as an example of what needs to be
done to make the next RFC - add vfio-platform support - work
for development purposes on a Calxeda Midway system.  We don't want
mach-virt to always create this ethernet device - DO NOT APPLY, etc.

Initial attempts to convince QEMU to create a memory mapped device
on the command line (e.g., -device vfio-platform,name=fff51000.ethernet)
would fail with "Parameter 'driver' expects pluggable device type".
Any guidance as to how to overcome this apparent design limitation
is welcome.

RAM is reduced from 30 to 1GiB such as to not overlap the xgmac device's
physical address.  Not sure if the 30GiB RAM (or whatever the user sets
it to with -m) could be set up above 0x1_0000_0000, but there is probably
extra work needed to resolve this type of conflict.

note: vfio-platform interrupt support development may want interrupt
property data filled; here it's omitted for the time being.

Not-signed-off-by: Kim Phillips <kim.phillips@linaro.org>
---
 hw/arm/virt.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ea4f02d..becd76b 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -65,6 +65,7 @@ enum {
     VIRT_GIC_CPU,
     VIRT_UART,
     VIRT_MMIO,
+    VIRT_ETHERNET,
 };
 
 typedef struct MemMapEntry {
@@ -104,7 +105,8 @@ static const MemMapEntry a15memmap[] = {
     [VIRT_MMIO] = { 0xa000000, 0x200 },
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
-    [VIRT_MEM] = { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
+    [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
+    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
 };
 
 static const int a15irqmap[] = {
@@ -340,6 +342,25 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     g_free(nodename);
 }
 
+static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
+{
+    char *nodename;
+    hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
+    hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
+    const char compat[] = "calxeda,hb-xgmac";
+
+    sysbus_create_simple("vfio-platform", base, NULL);
+
+    nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
+    qemu_fdt_add_subnode(vbi->fdt, nodename);
+
+    /* Note that we can't use setprop_string because of the embedded NUL */
+    qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
+    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
+
+    g_free(nodename);
+}
+
 static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
 {
     int i;
@@ -454,6 +475,7 @@ static void machvirt_init(QEMUMachineInitArgs *args)
     create_gic(vbi, pic);
 
     create_uart(vbi, pic);
+    create_ethernet(vbi, pic);
 
     /* Create mmio transports, so the user can create virtio backends
      * (which will be automatically plugged in to the transports). If
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 02/10] vfio: move hw/misc/vfio.c to hw/vfio/pci.c
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 01/10] hw/arm/virt: add a xgmac device Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support Eric Auger
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, Kim Phillips, patches, agraf,
	stuart.yoder, alex.williamson, christophe.barnichon, a.motakis,
	kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

This is done in preparation for the addition of VFIO platform
device support.

Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
---
 LICENSE                        | 2 +-
 MAINTAINERS                    | 2 +-
 hw/Makefile.objs               | 1 +
 hw/misc/Makefile.objs          | 1 -
 hw/vfio/Makefile.objs          | 3 +++
 hw/{misc/vfio.c => vfio/pci.c} | 0
 6 files changed, 6 insertions(+), 3 deletions(-)
 create mode 100644 hw/vfio/Makefile.objs
 rename hw/{misc/vfio.c => vfio/pci.c} (100%)

diff --git a/LICENSE b/LICENSE
index da70e94..0e0b4b9 100644
--- a/LICENSE
+++ b/LICENSE
@@ -11,7 +11,7 @@ option) any later version.
 
 As of July 2013, contributions under version 2 of the GNU General Public
 License (and no later version) are only accepted for the following files
-or directories: bsd-user/, linux-user/, hw/misc/vfio.c, hw/xen/xen_pt*.
+or directories: bsd-user/, linux-user/, hw/vfio/, hw/xen/xen_pt*.
 
 3) The Tiny Code Generator (TCG) is released under the BSD license
    (see license headers in files).
diff --git a/MAINTAINERS b/MAINTAINERS
index 97c9fa1..0b58184 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -613,7 +613,7 @@ F: tests/usb-hcd-ehci-test.c
 VFIO
 M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
-F: hw/misc/vfio.c
+F: hw/vfio/*
 
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index d178b65..b16dada 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -26,6 +26,7 @@ devices-dirs-$(CONFIG_SOFTMMU) += ssi/
 devices-dirs-$(CONFIG_SOFTMMU) += timer/
 devices-dirs-$(CONFIG_TPM) += tpm/
 devices-dirs-$(CONFIG_SOFTMMU) += usb/
+devices-dirs-$(CONFIG_SOFTMMU) += vfio/
 devices-dirs-$(CONFIG_VIRTIO) += virtio/
 devices-dirs-$(CONFIG_SOFTMMU) += watchdog/
 devices-dirs-$(CONFIG_SOFTMMU) += xen/
diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
index f674365..656570c 100644
--- a/hw/misc/Makefile.objs
+++ b/hw/misc/Makefile.objs
@@ -21,7 +21,6 @@ common-obj-$(CONFIG_MACIO) += macio/
 
 ifeq ($(CONFIG_PCI), y)
 obj-$(CONFIG_KVM) += ivshmem.o
-obj-$(CONFIG_LINUX) += vfio.o
 endif
 
 obj-$(CONFIG_REALVIEW) += arm_sysctl.o
diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
new file mode 100644
index 0000000..31c7dab
--- /dev/null
+++ b/hw/vfio/Makefile.objs
@@ -0,0 +1,3 @@
+ifeq ($(CONFIG_LINUX), y)
+obj-$(CONFIG_PCI) += pci.o
+endif
diff --git a/hw/misc/vfio.c b/hw/vfio/pci.c
similarity index 100%
rename from hw/misc/vfio.c
rename to hw/vfio/pci.c
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 01/10] hw/arm/virt: add a xgmac device Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 02/10] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-25 21:21   ` Alexander Graf
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name Eric Auger
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, Kim Phillips, patches, agraf,
	stuart.yoder, alex.williamson, christophe.barnichon, a.motakis,
	kvmarm

From: Kim Phillips <kim.phillips@linaro.org>

Functions for which PCI and platform device support share are moved
into common.c.  The common vfio_{get,put}_group() get an additional
argument, a pointer to a vfio_reset_handler(), for which to pass on
to qemu_register_reset, but only if it exists (the platform device
code currently passes a NULL as its reset_handler).

For the platform device code, we basically use SysBusDevice
instead of PCIDevice.  Since realize() returns void, unlike
PCIDevice's initfn, error codes are moved into the
error message text with %m.

Currently only MMIO access is supported at this time.

The perceived path for future QEMU development is:

- add support for interrupts
- verify and test platform dev unmap path
- test existing PCI path for any regressions
- add support for creating platform devices on the qemu command line
  - currently device address specification is hardcoded for test
    development on Calxeda Midway's fff51000.ethernet device
- reset is not supported and registration of reset functions is
  bypassed for platform devices.
  - there is no standard means of resetting a platform device,
    unsure if it suffices to be handled at device--VFIO binding time

[1] http://www.spinics.net/lists/kvm-arm/msg08195.html

Changes (v2 -> v3):
[work done by Eric Auger]

This new version introduces 2 separate VFIO Device objects:
- VFIOPCIDevice
- VFIOPlatformDevice

Both objects share a VFIODevice struct.

Also a VFIORegion shared struct was created. It is embedded in
VFIOBAR struct. VFIOPlatformDevice uses VFIORegion directly.

Introducing those base classes induced quite a lot of tiny
changes in the PCI code. Theoretically PCI and platform
devices can be supported simultaneously. PCI modifications
currently are not tested.

The VFIODevice is not a QOM object due to the single inheritance
model limitation.

The VFIODevice struct embeds an ops structure which is
specialized in each VFIO leaf device. This makes possible to call
device specific functions in common parts, hence achieving better
factorization.

Reset handling typically is handled that way where a unique
generic ResetHandler (in common.c) is used for both derived
classes. It calls device specific methods.

As in the original contribution, only MMIO is supported in that
patch file (in mmap mode). IRQ support comes in a subsequent patch.

Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/vfio/Makefile.objs      |    2 +
 hw/vfio/common.c           |  849 ++++++++++++++++++++++++++++
 hw/vfio/pci.c              | 1316 ++++++++++----------------------------------
 hw/vfio/platform.c         |  267 +++++++++
 hw/vfio/vfio-common.h      |  143 +++++
 linux-headers/linux/vfio.h |    1 +
 6 files changed, 1565 insertions(+), 1013 deletions(-)
 create mode 100644 hw/vfio/common.c
 create mode 100644 hw/vfio/platform.c
 create mode 100644 hw/vfio/vfio-common.h

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 31c7dab..c5c76fe 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,3 +1,5 @@
 ifeq ($(CONFIG_LINUX), y)
+obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_SOFTMMU) += platform.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
new file mode 100644
index 0000000..07dc409
--- /dev/null
+++ b/hw/vfio/common.c
@@ -0,0 +1,849 @@
+/*
+ * vfio based device assignment support
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include "sys/mman.h"
+
+#include "exec/address-spaces.h"
+#include "qemu/error-report.h"
+#include "sysemu/kvm.h"
+
+#include "vfio-common.h"
+
+static QLIST_HEAD(, VFIOContainer)
+    container_list = QLIST_HEAD_INITIALIZER(container_list);
+
+QLIST_HEAD(, VFIOGroup)
+    group_list = QLIST_HEAD_INITIALIZER(group_list);
+
+
+#ifdef CONFIG_KVM
+/*
+ * We have a single VFIO pseudo device per KVM VM.  Once created it lives
+ * for the life of the VM.  Closing the file descriptor only drops our
+ * reference to it and the device's reference to kvm.  Therefore once
+ * initialized, this file descriptor is only released on QEMU exit and
+ * we'll re-use it should another vfio device be attached before then.
+ */
+static int vfio_kvm_device_fd = -1;
+#endif
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+static int vfio_dma_unmap(VFIOContainer *container,
+                          hwaddr iova, ram_addr_t size)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
+        return -errno;
+    }
+
+    return 0;
+}
+
+static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    /*
+     * Try the mapping, if it fails with EBUSY, unmap the region and try
+     * again.  This shouldn't be necessary, but we sometimes see it in
+     * the the VGA ROM space.
+     */
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
+        return 0;
+    }
+
+    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
+    return -errno;
+}
+
+static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+{
+    return !memory_region_is_ram(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+static void vfio_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+    hwaddr iova, end;
+    void *vaddr;
+    int ret;
+
+    assert(!memory_region_is_iommu(section->mr));
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
+            iova, end - 1, vaddr);
+
+    memory_region_ref(section->mr);
+    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
+    if (ret) {
+        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                     container, iova, end - iova, vaddr, ret);
+
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->iommu_data.type1.initialized) {
+            if (!container->iommu_data.type1.error) {
+                container->iommu_data.type1.error = ret;
+            }
+        } else {
+            hw_error("vfio: DMA mapping failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+    hwaddr iova, end;
+    int ret;
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+            iova, end - 1);
+
+    ret = vfio_dma_unmap(container, iova, end - iova);
+    memory_region_unref(section->mr);
+    if (ret) {
+        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                     "0x%"HWADDR_PRIx") = %d (%m)",
+                     container, iova, end - iova, ret);
+    }
+}
+
+static MemoryListener vfio_memory_listener = {
+    .region_add = vfio_listener_region_add,
+    .region_del = vfio_listener_region_del,
+};
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.type1.listener);
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_ADD,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (!kvm_enabled()) {
+        return;
+    }
+
+    if (vfio_kvm_device_fd < 0) {
+        struct kvm_create_device cd = {
+            .type = KVM_DEV_TYPE_VFIO,
+        };
+
+        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
+            DPRINTF("KVM_CREATE_DEVICE: %m\n");
+            return;
+        }
+
+        vfio_kvm_device_fd = cd.fd;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to add group %d to KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_DEL,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (vfio_kvm_device_fd < 0) {
+        return;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to remove group %d from KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+static int vfio_connect_container(VFIOGroup *group)
+{
+    VFIOContainer *container;
+    int ret, fd;
+
+    if (group->container) {
+        return 0;
+    }
+
+    QLIST_FOREACH(container, &container_list, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            return 0;
+        }
+    }
+
+    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_report("vfio: failed to open /dev/vfio/vfio: %m");
+        return -errno;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_report("vfio: supported vfio version: %d, "
+                     "reported version: %d", VFIO_API_VERSION, ret);
+        close(fd);
+        return -EINVAL;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->fd = fd;
+
+    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.type1.listener,
+                                 &address_space_memory);
+
+        if (container->iommu_data.type1.error) {
+            ret = container->iommu_data.type1.error;
+            vfio_listener_release(container);
+            g_free(container);
+            close(fd);
+            error_report("vfio: memory listener initialization failed"
+                         " for container");
+            return ret;
+        }
+
+        container->iommu_data.type1.initialized = true;
+
+    } else {
+        error_report("vfio: No available IOMMU models");
+        g_free(container);
+        close(fd);
+        return -EINVAL;
+    }
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&container_list, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    return 0;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        if (container->iommu_data.release) {
+            container->iommu_data.release(container);
+        }
+        QLIST_REMOVE(container, next);
+        DPRINTF("vfio_disconnect_container: close container->fd\n");
+        close(container->fd);
+        g_free(container);
+    }
+}
+
+VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &group_list, next) {
+        if (group->groupid == groupid) {
+            return group;
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+    group->fd = qemu_open(path, O_RDWR);
+    if (group->fd < 0) {
+        error_report("vfio: error opening %s: %m", path);
+        g_free(group);
+        return NULL;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_report("vfio: error getting group status: %m");
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_report("vfio: error, group %d is not viable, please ensure "
+                     "all devices within the iommu_group are bound to their "
+                     "vfio bus driver.", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group)) {
+        error_report("vfio: failed to setup container for group %d", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    if (QLIST_EMPTY(&group_list) && reset_handler) {
+        qemu_register_reset(reset_handler, NULL);
+    }
+
+    QLIST_INSERT_HEAD(&group_list, group, next);
+
+    vfio_kvm_device_add_group(group);
+
+    return group;
+}
+
+void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler)
+{
+    if (!QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    DPRINTF("vfio_put_group: close group->fd\n");
+    close(group->fd);
+    g_free(group);
+
+    if (QLIST_EMPTY(&group_list) && reset_handler) {
+        qemu_unregister_reset(reset_handler, NULL);
+    }
+}
+
+
+void vfio_unmask_irqindex(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+void vfio_disable_irqindex(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = index,
+        .start = 0,
+        .count = 0,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+#ifdef CONFIG_KVM /* Unused outside of CONFIG_KVM code */
+void vfio_mask_int(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+#endif
+
+int vfio_mmap_region(Object *vdev, VFIORegion *region,
+                     MemoryRegion *mem, MemoryRegion *submem,
+                     void **map, size_t size, off_t offset,
+                     const char *name)
+{
+    int ret = 0;
+
+    if (VFIO_ALLOW_MMAP && size && region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        int prot = 0;
+
+        if (region->flags & VFIO_REGION_INFO_FLAG_READ) {
+            prot |= PROT_READ;
+        }
+
+        if (region->flags & VFIO_REGION_INFO_FLAG_WRITE) {
+            prot |= PROT_WRITE;
+        }
+
+        *map = mmap(NULL, size, prot, MAP_SHARED,
+                    region->fd, region->fd_offset + offset);
+        if (*map == MAP_FAILED) {
+            *map = NULL;
+            ret = -errno;
+            goto empty_region;
+        }
+
+        memory_region_init_ram_ptr(submem, vdev, name, size, *map);
+    } else {
+empty_region:
+        /* Create a zero sized sub-region to make cleanup easy. */
+        memory_region_init(submem, vdev, name, 0);
+    }
+
+    memory_region_add_subregion(mem, offset, submem);
+
+    return ret;
+}
+
+/*
+ * IO Port/MMIO - Beware of the endians, VFIO is always little endian
+ */
+void vfio_region_write(void *opaque, hwaddr addr,
+                       uint64_t data, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vdev = region->vdev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+
+    switch (size) {
+    case 1:
+        buf.byte = data;
+        break;
+    case 2:
+        buf.word = cpu_to_le16(data);
+        break;
+    case 4:
+        buf.dword = cpu_to_le32(data);
+        break;
+    default:
+        hw_error("vfio: unsupported write size, %d bytes", size);
+        break;
+    }
+
+    if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
+                     __func__, addr, data, size);
+    }
+
+#ifdef DEBUG_VFIO
+    {
+        DPRINTF("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
+                ", %d)\n", __func__, vdev->name,
+                region->nr, addr, data, size);
+    }
+#endif
+
+    /*
+     * A read or write to a BAR always signals an INTx EOI.  This will
+     * do nothing if not pending (including not in INTx mode).  We assume
+     * that a BAR access is in response to an interrupt and that BAR
+     * accesses will service the interrupt.  Unfortunately, we don't know
+     * which access will service the interrupt, so we're potentially
+     * getting quite a few host interrupts per guest interrupt.
+     */
+    vdev->ops->vfio_eoi(vdev);
+}
+
+uint64_t vfio_region_read(void *opaque,
+                          hwaddr addr, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vdev = region->vdev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+    uint64_t data = 0;
+
+    if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
+                     __func__, addr, size);
+        return (uint64_t)-1;
+    }
+
+    switch (size) {
+    case 1:
+        data = buf.byte;
+        break;
+    case 2:
+        data = le16_to_cpu(buf.word);
+        break;
+    case 4:
+        data = le32_to_cpu(buf.dword);
+        break;
+    default:
+        hw_error("vfio: unsupported read size, %d bytes", size);
+        break;
+    }
+
+#ifdef DEBUG_VFIO
+    {
+        DPRINTF("%s(%s:region%d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n",
+                __func__, vdev->name,
+                region->nr, addr, size, data);
+    }
+#endif
+
+    /* Same as write above */
+    vdev->ops->vfio_eoi(vdev);
+
+    return data;
+}
+
+
+int vfio_get_base_device(VFIOGroup *group, const char *name,
+                        struct VFIODevice *vdev)
+{
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    int ret;
+    int fd;
+
+    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (fd < 0) {
+        error_report("vfio: error getting device %s from group %d: %m",
+                     name, group->groupid);
+        error_printf("Verify all devices in group %d are bound to the "
+                     "vfio driver and are not already in use\n",
+                     group->groupid);
+        return fd;
+    }
+
+    vdev->fd = fd;
+    vdev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
+
+    /* Sanity check device */
+    ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_report("vfio: error getting device info: %m");
+        goto error;
+    }
+
+    DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name,
+            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
+
+    /* Check type consistency */
+    if (dev_info.flags & VFIO_DEVICE_FLAGS_PCI) {
+        if (vdev->type != VFIO_DEVICE_TYPE_PCI) {
+            goto error;
+        }
+    } else if (dev_info.flags & VFIO_DEVICE_FLAGS_PLATFORM) {
+        if (vdev->type != VFIO_DEVICE_TYPE_PLATFORM) {
+            goto error;
+        }
+    } else {
+        goto error;
+    }
+
+    vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+
+    vdev->num_regions = dev_info.num_regions;
+    vdev->num_irqs = dev_info.num_irqs;
+
+    /* call device specific functions */
+    ret = vdev->ops->vfio_check_device(vdev);
+    if (ret < 0) {
+        DPRINTF("%s -- Error when checking device\n", __func__);
+        goto error;
+    }
+    ret = vdev->ops->vfio_get_device_regions(vdev);
+    if (ret < 0) {
+        DPRINTF("%s -- Error when handling regions\n", __func__);
+        goto error;
+    }
+    vdev->ops->vfio_get_device_interrupts(vdev);
+    if (ret < 0) {
+        DPRINTF("%s -- Error when handling interrupts\n", __func__);
+        goto error;
+    }
+
+    return 0;
+
+error:
+    if (ret) {
+        vfio_put_base_device(vdev);
+    }
+    return ret;
+
+}
+
+
+void vfio_put_base_device(VFIODevice *vdev)
+{
+    QLIST_REMOVE(vdev, next);
+    vdev->group = NULL;
+    DPRINTF("vfio_put_device: close vdev->fd\n");
+    close(vdev->fd);
+}
+
+
+int vfio_base_device_init(VFIODevice *vdev, int type)
+{
+    VFIODevice *tmp;
+    VFIOGroup *group;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    ssize_t len;
+    struct stat st;
+    int groupid;
+    int ret;
+
+    /* name must be set prior to the call */
+    if (vdev->name == NULL) {
+        return -errno;
+    }
+    /* device specific ops must be set prior to the call */
+    if (vdev->ops == NULL) {
+        return -errno;
+    }
+
+    /* Check that the host device exists */
+    if (type == VFIO_DEVICE_TYPE_PCI) {
+        snprintf(path, sizeof(path), "/sys/bus/pci/devices/%s/", vdev->name);
+        vdev->type = VFIO_DEVICE_TYPE_PCI;
+    } else if (type == VFIO_DEVICE_TYPE_PLATFORM) {
+        snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
+                vdev->name);
+        vdev->type = VFIO_DEVICE_TYPE_PLATFORM;
+    } else {
+        return -errno;
+    }
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio: error: no such host device: %s", path);
+        return -errno;
+    }
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio: error no iommu_group for device");
+        return len < 0 ? -errno : ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio: error reading %s: %m", path);
+        return -errno;
+    }
+
+    DPRINTF("%s(%s) group %d\n", __func__, vdev->name, groupid);
+
+    group = vfio_get_group(groupid, vfio_reset_handler);
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s", vdev->name);
+
+    QLIST_FOREACH(tmp, &group->device_list, next) {
+        if (strcmp(tmp->name, vdev->name) == 0) {
+            error_report("vfio: error: device %s is already attached", path);
+            vfio_put_group(group, vfio_reset_handler);
+            return -EBUSY;
+        }
+    }
+
+    ret = vfio_get_base_device(group, path, vdev);
+    if (ret < 0) {
+        error_report("vfio: failed to get device %s", path);
+        vfio_put_group(group, vfio_reset_handler);
+        return ret;
+    }
+
+ return ret;
+
+}
+
+void print_regions(VFIODevice *vdev)
+{
+    int i;
+    DPRINTF("Device \"%s\" counts %d region(s):\n",
+             vdev->name, vdev->num_regions);
+
+    for (i = 0; i < vdev->num_regions; i++) {
+        DPRINTF("- region %d flags = 0x%lx, size = 0x%lx, "
+                "fd= %d, offset = 0x%lx\n",
+                vdev->regions[i]->nr,
+                (unsigned long)vdev->regions[i]->flags,
+                (unsigned long)vdev->regions[i]->size,
+                vdev->regions[i]->fd,
+                (unsigned long)vdev->regions[i]->fd_offset);
+    }
+}
+
+void vfio_reset_handler(void *opaque)
+{
+    VFIOGroup *group;
+    VFIODevice *vdev;
+
+    QLIST_FOREACH(group, &group_list, next) {
+        QLIST_FOREACH(vdev, &group->device_list, next) {
+            vdev->ops->vfio_compute_needs_reset(vdev);
+        }
+    }
+
+    QLIST_FOREACH(group, &group_list, next) {
+        QLIST_FOREACH(vdev, &group->device_list, next) {
+            if (vdev->needs_reset) {
+                vdev->ops->vfio_hot_reset_multi(vdev);
+            }
+        }
+    }
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9cf5b84..a9e4d97 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1,5 +1,5 @@
 /*
- * vfio based device assignment support
+ * vfio based device assignment support - PCI devices
  *
  * Copyright Red Hat, Inc. 2012
  *
@@ -18,48 +18,23 @@
  *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
  */
 
-#include <dirent.h>
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
-#include <sys/stat.h>
-#include <sys/types.h>
-#include <unistd.h>
 
-#include "config.h"
-#include "exec/address-spaces.h"
-#include "exec/memory.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
-#include "qemu-common.h"
 #include "qemu/error-report.h"
-#include "qemu/event_notifier.h"
-#include "qemu/queue.h"
 #include "qemu/range.h"
-#include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
 
-/* #define DEBUG_VFIO */
-#ifdef DEBUG_VFIO
-#define DPRINTF(fmt, ...) \
-    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
-#else
-#define DPRINTF(fmt, ...) \
-    do { } while (0)
-#endif
+#include "vfio-common.h"
 
-/* Extra debugging, trap acceleration paths for more logging */
-#define VFIO_ALLOW_MMAP 1
-#define VFIO_ALLOW_KVM_INTX 1
-#define VFIO_ALLOW_KVM_MSI 1
-#define VFIO_ALLOW_KVM_MSIX 1
-
-struct VFIODevice;
+extern QLIST_HEAD(, VFIOGroup) group_list;
 
 typedef struct VFIOQuirk {
     MemoryRegion mem;
-    struct VFIODevice *vdev;
+    struct VFIOPCIDevice *vdev;
     QLIST_ENTRY(VFIOQuirk) next;
     struct {
         uint32_t base_offset:TARGET_PAGE_BITS;
@@ -81,14 +56,8 @@ typedef struct VFIOQuirk {
 } VFIOQuirk;
 
 typedef struct VFIOBAR {
-    off_t fd_offset; /* offset of BAR within device fd */
-    int fd; /* device fd, allows us to pass VFIOBAR as opaque data */
-    MemoryRegion mem; /* slow, read/write access */
-    MemoryRegion mmap_mem; /* direct mapped access */
-    void *mmap;
-    size_t size;
-    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
-    uint8_t nr; /* cache the BAR number for debug */
+    VFIORegion region;
+
     bool ioport;
     bool mem64;
     QLIST_HEAD(, VFIOQuirk) quirks;
@@ -120,7 +89,7 @@ typedef struct VFIOINTx {
 
 typedef struct VFIOMSIVector {
     EventNotifier interrupt; /* eventfd triggered on interrupt */
-    struct VFIODevice *vdev; /* back pointer to device */
+    struct VFIOPCIDevice *vdev; /* back pointer to device */
     MSIMessage msg; /* cache the MSI message so we know when it changes */
     int virq; /* KVM irqchip route for QEMU bypass */
     bool use;
@@ -133,27 +102,6 @@ enum {
     VFIO_INT_MSIX = 3,
 };
 
-struct VFIOGroup;
-
-typedef struct VFIOType1 {
-    MemoryListener listener;
-    int error;
-    bool initialized;
-} VFIOType1;
-
-typedef struct VFIOContainer {
-    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    struct {
-        /* enable abstraction to support various iommu backends */
-        union {
-            VFIOType1 type1;
-        };
-        void (*release)(struct VFIOContainer *);
-    } iommu_data;
-    QLIST_HEAD(, VFIOGroup) group_list;
-    QLIST_ENTRY(VFIOContainer) next;
-} VFIOContainer;
-
 /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
 typedef struct VFIOMSIXInfo {
     uint8_t table_bar;
@@ -165,9 +113,9 @@ typedef struct VFIOMSIXInfo {
     void *mmap;
 } VFIOMSIXInfo;
 
-typedef struct VFIODevice {
+typedef struct VFIOPCIDevice {
+    VFIODevice vdev;
     PCIDevice pdev;
-    int fd;
     VFIOINTx intx;
     unsigned int config_size;
     uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
@@ -183,31 +131,18 @@ typedef struct VFIODevice {
     VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
     VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */
     PCIHostDeviceAddress host;
-    QLIST_ENTRY(VFIODevice) next;
-    struct VFIOGroup *group;
     EventNotifier err_notifier;
     uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
 #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
     int32_t bootindex;
     uint8_t pm_cap;
-    bool reset_works;
     bool has_vga;
     bool pci_aer;
     bool has_flr;
     bool has_pm_reset;
-    bool needs_reset;
     bool rom_read_failed;
-} VFIODevice;
-
-typedef struct VFIOGroup {
-    int fd;
-    int groupid;
-    VFIOContainer *container;
-    QLIST_HEAD(, VFIODevice) device_list;
-    QLIST_ENTRY(VFIOGroup) next;
-    QLIST_ENTRY(VFIOGroup) container_next;
-} VFIOGroup;
+} VFIOPCIDevice;
 
 typedef struct VFIORomBlacklistEntry {
     uint16_t vendor_id;
@@ -234,75 +169,12 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 
 #define MSIX_CAP_LENGTH 12
 
-static QLIST_HEAD(, VFIOContainer)
-    container_list = QLIST_HEAD_INITIALIZER(container_list);
-
-static QLIST_HEAD(, VFIOGroup)
-    group_list = QLIST_HEAD_INITIALIZER(group_list);
-
-#ifdef CONFIG_KVM
-/*
- * We have a single VFIO pseudo device per KVM VM.  Once created it lives
- * for the life of the VM.  Closing the file descriptor only drops our
- * reference to it and the device's reference to kvm.  Therefore once
- * initialized, this file descriptor is only released on QEMU exit and
- * we'll re-use it should another vfio device be attached before then.
- */
-static int vfio_kvm_device_fd = -1;
-#endif
-
-static void vfio_disable_interrupts(VFIODevice *vdev);
+static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len);
-static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
-
-/*
- * Common VFIO interrupt disable
- */
-static void vfio_disable_irqindex(VFIODevice *vdev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
-        .index = index,
-        .start = 0,
-        .count = 0,
-    };
-
-    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
+static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
-/*
- * INTx
- */
-static void vfio_unmask_intx(VFIODevice *vdev)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
-        .index = VFIO_PCI_INTX_IRQ_INDEX,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-#ifdef CONFIG_KVM /* Unused outside of CONFIG_KVM code */
-static void vfio_mask_intx(VFIODevice *vdev)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
-        .index = VFIO_PCI_INTX_IRQ_INDEX,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-#endif
 
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
@@ -321,11 +193,11 @@ static void vfio_mask_intx(VFIODevice *vdev)
  */
 static void vfio_intx_mmap_enable(void *opaque)
 {
-    VFIODevice *vdev = opaque;
+    VFIOPCIDevice *vdev = opaque;
 
     if (vdev->intx.pending) {
         timer_mod(vdev->intx.mmap_timer,
-                       qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
         return;
     }
 
@@ -334,7 +206,7 @@ static void vfio_intx_mmap_enable(void *opaque)
 
 static void vfio_intx_interrupt(void *opaque)
 {
-    VFIODevice *vdev = opaque;
+    VFIOPCIDevice *vdev = opaque;
 
     if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
         return;
@@ -349,25 +221,27 @@ static void vfio_intx_interrupt(void *opaque)
     vfio_mmap_set_enabled(vdev, false);
     if (vdev->intx.mmap_timeout) {
         timer_mod(vdev->intx.mmap_timer,
-                       qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
     }
 }
 
-static void vfio_eoi(VFIODevice *vdev)
+static void vfio_pci_eoi(VFIODevice *vdev)
 {
-    if (!vdev->intx.pending) {
+    VFIOPCIDevice *vpcidev = container_of(vdev, VFIOPCIDevice, vdev);
+
+    if (!vpcidev->intx.pending) {
         return;
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __func__, vpcidev->host.domain,
+            vpcidev->host.bus, vpcidev->host.slot, vpcidev->host.function);
 
-    vdev->intx.pending = false;
-    pci_irq_deassert(&vdev->pdev);
-    vfio_unmask_intx(vdev);
+    vpcidev->intx.pending = false;
+    pci_irq_deassert(&vpcidev->pdev);
+    vfio_unmask_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
 }
 
-static void vfio_enable_intx_kvm(VFIODevice *vdev)
+static void vfio_enable_intx_kvm(VFIOPCIDevice *vdev)
 {
 #ifdef CONFIG_KVM
     struct kvm_irqfd irqfd = {
@@ -387,7 +261,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
 
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
-    vfio_mask_intx(vdev);
+    vfio_mask_int(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
 
@@ -417,7 +291,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
 
     *pfd = irqfd.resamplefd;
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
     g_free(irq_set);
     if (ret) {
         error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
@@ -425,7 +299,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
     }
 
     /* Let'em rip */
-    vfio_unmask_intx(vdev);
+    vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
 
     vdev->intx.kvm_accel = true;
 
@@ -442,11 +316,11 @@ fail_irqfd:
     event_notifier_cleanup(&vdev->intx.unmask);
 fail:
     qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
-    vfio_unmask_intx(vdev);
+    vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
 #endif
 }
 
-static void vfio_disable_intx_kvm(VFIODevice *vdev)
+static void vfio_disable_intx_kvm(VFIOPCIDevice *vdev)
 {
 #ifdef CONFIG_KVM
     struct kvm_irqfd irqfd = {
@@ -463,7 +337,7 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
      * Get to a known state, hardware masked, QEMU ready to accept new
      * interrupts, QEMU IRQ de-asserted.
      */
-    vfio_mask_intx(vdev);
+    vfio_mask_int(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
 
@@ -481,7 +355,7 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
     vdev->intx.kvm_accel = false;
 
     /* If we've missed an event, let it re-fire through QEMU */
-    vfio_unmask_intx(vdev);
+    vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
 
     DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n",
             __func__, vdev->host.domain, vdev->host.bus,
@@ -491,14 +365,14 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev)
 
 static void vfio_update_irq(PCIDevice *pdev)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     PCIINTxRoute route;
 
     if (vdev->interrupt != VFIO_INT_INTx) {
         return;
     }
 
-    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+    route = pci_device_route_intx_to_irq(pdev, vdev->intx.pin);
 
     if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
         return; /* Nothing changed */
@@ -519,10 +393,10 @@ static void vfio_update_irq(PCIDevice *pdev)
     vfio_enable_intx_kvm(vdev);
 
     /* Re-enable the interrupt in cased we missed an EOI */
-    vfio_eoi(vdev);
+    vfio_pci_eoi(&vdev->vdev);
 }
 
-static int vfio_enable_intx(VFIODevice *vdev)
+static int vfio_enable_intx(VFIOPCIDevice *vdev)
 {
     uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
     int ret, argsz;
@@ -569,7 +443,7 @@ static int vfio_enable_intx(VFIODevice *vdev)
     *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(*pfd, vfio_intx_interrupt, NULL, vdev);
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
     g_free(irq_set);
     if (ret) {
         error_report("vfio: Error: Failed to setup INTx fd: %m");
@@ -588,13 +462,13 @@ static int vfio_enable_intx(VFIODevice *vdev)
     return 0;
 }
 
-static void vfio_disable_intx(VFIODevice *vdev)
+static void vfio_disable_intx(VFIOPCIDevice *vdev)
 {
     int fd;
 
     timer_del(vdev->intx.mmap_timer);
     vfio_disable_intx_kvm(vdev);
-    vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
+    vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
     vfio_mmap_set_enabled(vdev, true);
@@ -615,7 +489,7 @@ static void vfio_disable_intx(VFIODevice *vdev)
 static void vfio_msi_interrupt(void *opaque)
 {
     VFIOMSIVector *vector = opaque;
-    VFIODevice *vdev = vector->vdev;
+    VFIOPCIDevice *vdev = vector->vdev;
     int nr = vector - vdev->msi_vectors;
 
     if (!event_notifier_test_and_clear(&vector->interrupt)) {
@@ -647,7 +521,7 @@ static void vfio_msi_interrupt(void *opaque)
     }
 }
 
-static int vfio_enable_vectors(VFIODevice *vdev, bool msix)
+static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 {
     struct vfio_irq_set *irq_set;
     int ret = 0, i, argsz;
@@ -672,7 +546,7 @@ static int vfio_enable_vectors(VFIODevice *vdev, bool msix)
         fds[i] = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
     }
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
 
     g_free(irq_set);
 
@@ -682,7 +556,7 @@ static int vfio_enable_vectors(VFIODevice *vdev, bool msix)
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     VFIOMSIVector *vector;
     int ret;
 
@@ -723,7 +597,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
      * increase them as needed.
      */
     if (vdev->nr_vectors < nr + 1) {
-        vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX);
+        vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSIX_IRQ_INDEX);
         vdev->nr_vectors = nr + 1;
         ret = vfio_enable_vectors(vdev, true);
         if (ret) {
@@ -747,7 +621,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
 
         *pfd = event_notifier_get_fd(&vector->interrupt);
 
-        ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+        ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
         g_free(irq_set);
         if (ret) {
             error_report("vfio: failed to modify vector, %d", ret);
@@ -765,7 +639,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
 
 static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     int argsz;
     struct vfio_irq_set *irq_set;
@@ -795,7 +669,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
 
     *pfd = -1;
 
-    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
 
     g_free(irq_set);
 
@@ -813,7 +687,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     vector->use = false;
 }
 
-static void vfio_enable_msix(VFIODevice *vdev)
+static void vfio_enable_msix(VFIOPCIDevice *vdev)
 {
     vfio_disable_interrupts(vdev);
 
@@ -846,7 +720,7 @@ static void vfio_enable_msix(VFIODevice *vdev)
             vdev->host.bus, vdev->host.slot, vdev->host.function);
 }
 
-static void vfio_enable_msi(VFIODevice *vdev)
+static void vfio_enable_msi(VFIOPCIDevice *vdev)
 {
     int ret, i;
 
@@ -923,7 +797,7 @@ retry:
             vdev->host.function, vdev->nr_vectors);
 }
 
-static void vfio_disable_msi_common(VFIODevice *vdev)
+static void vfio_disable_msi_common(VFIOPCIDevice *vdev)
 {
     g_free(vdev->msi_vectors);
     vdev->msi_vectors = NULL;
@@ -933,7 +807,7 @@ static void vfio_disable_msi_common(VFIODevice *vdev)
     vfio_enable_intx(vdev);
 }
 
-static void vfio_disable_msix(VFIODevice *vdev)
+static void vfio_disable_msix(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -950,7 +824,7 @@ static void vfio_disable_msix(VFIODevice *vdev)
     }
 
     if (vdev->nr_vectors) {
-        vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX);
+        vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSIX_IRQ_INDEX);
     }
 
     vfio_disable_msi_common(vdev);
@@ -959,11 +833,11 @@ static void vfio_disable_msix(VFIODevice *vdev)
             vdev->host.bus, vdev->host.slot, vdev->host.function);
 }
 
-static void vfio_disable_msi(VFIODevice *vdev)
+static void vfio_disable_msi(VFIOPCIDevice *vdev)
 {
     int i;
 
-    vfio_disable_irqindex(vdev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSI_IRQ_INDEX);
 
     for (i = 0; i < vdev->nr_vectors; i++) {
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
@@ -991,7 +865,7 @@ static void vfio_disable_msi(VFIODevice *vdev)
             vdev->host.bus, vdev->host.slot, vdev->host.function);
 }
 
-static void vfio_update_msi(VFIODevice *vdev)
+static void vfio_update_msi(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -1018,119 +892,17 @@ static void vfio_update_msi(VFIODevice *vdev)
     }
 }
 
-/*
- * IO Port/MMIO - Beware of the endians, VFIO is always little endian
- */
-static void vfio_bar_write(void *opaque, hwaddr addr,
-                           uint64_t data, unsigned size)
-{
-    VFIOBAR *bar = opaque;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-
-    switch (size) {
-    case 1:
-        buf.byte = data;
-        break;
-    case 2:
-        buf.word = cpu_to_le16(data);
-        break;
-    case 4:
-        buf.dword = cpu_to_le32(data);
-        break;
-    default:
-        hw_error("vfio: unsupported write size, %d bytes", size);
-        break;
-    }
-
-    if (pwrite(bar->fd, &buf, size, bar->fd_offset + addr) != size) {
-        error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m",
-                     __func__, addr, data, size);
-    }
-
-#ifdef DEBUG_VFIO
-    {
-        VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar->nr]);
-
-        DPRINTF("%s(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", 0x%"PRIx64
-                ", %d)\n", __func__, vdev->host.domain, vdev->host.bus,
-                vdev->host.slot, vdev->host.function, bar->nr, addr,
-                data, size);
-    }
-#endif
-
-    /*
-     * A read or write to a BAR always signals an INTx EOI.  This will
-     * do nothing if not pending (including not in INTx mode).  We assume
-     * that a BAR access is in response to an interrupt and that BAR
-     * accesses will service the interrupt.  Unfortunately, we don't know
-     * which access will service the interrupt, so we're potentially
-     * getting quite a few host interrupts per guest interrupt.
-     */
-    vfio_eoi(container_of(bar, VFIODevice, bars[bar->nr]));
-}
-
-static uint64_t vfio_bar_read(void *opaque,
-                              hwaddr addr, unsigned size)
-{
-    VFIOBAR *bar = opaque;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-    uint64_t data = 0;
-
-    if (pread(bar->fd, &buf, size, bar->fd_offset + addr) != size) {
-        error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m",
-                     __func__, addr, size);
-        return (uint64_t)-1;
-    }
-
-    switch (size) {
-    case 1:
-        data = buf.byte;
-        break;
-    case 2:
-        data = le16_to_cpu(buf.word);
-        break;
-    case 4:
-        data = le32_to_cpu(buf.dword);
-        break;
-    default:
-        hw_error("vfio: unsupported read size, %d bytes", size);
-        break;
-    }
-
-#ifdef DEBUG_VFIO
-    {
-        VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar->nr]);
-
-        DPRINTF("%s(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx
-                ", %d) = 0x%"PRIx64"\n", __func__, vdev->host.domain,
-                vdev->host.bus, vdev->host.slot, vdev->host.function,
-                bar->nr, addr, size, data);
-    }
-#endif
-
-    /* Same as write above */
-    vfio_eoi(container_of(bar, VFIODevice, bars[bar->nr]));
-
-    return data;
-}
 
 static const MemoryRegionOps vfio_bar_ops = {
-    .read = vfio_bar_read,
-    .write = vfio_bar_write,
+    .read = vfio_region_read,
+    .write = vfio_region_write,
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static void vfio_pci_load_rom(VFIODevice *vdev)
+
+/* PCI ONLY FUNCTIONS */
+
+static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
 {
     struct vfio_region_info reg_info = {
         .argsz = sizeof(reg_info),
@@ -1139,8 +911,9 @@ static void vfio_pci_load_rom(VFIODevice *vdev)
     uint64_t size;
     off_t off = 0;
     size_t bytes;
+    int fd = vdev->vdev.fd;
 
-    if (ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info)) {
+    if (ioctl(fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info)) {
         error_report("vfio: Error getting ROM info: %m");
         return;
     }
@@ -1170,7 +943,7 @@ static void vfio_pci_load_rom(VFIODevice *vdev)
     memset(vdev->rom, 0xff, size);
 
     while (size) {
-        bytes = pread(vdev->fd, vdev->rom + off, size, vdev->rom_offset + off);
+        bytes = pread(fd, vdev->rom + off, size, vdev->rom_offset + off);
         if (bytes == 0) {
             break;
         } else if (bytes > 0) {
@@ -1188,7 +961,7 @@ static void vfio_pci_load_rom(VFIODevice *vdev)
 
 static uint64_t vfio_rom_read(void *opaque, hwaddr addr, unsigned size)
 {
-    VFIODevice *vdev = opaque;
+    VFIOPCIDevice *vdev = opaque;
     uint64_t val = ((uint64_t)1 << (size * 8)) - 1;
 
     /* Load the ROM lazily when the guest tries to read it */
@@ -1217,7 +990,7 @@ static const MemoryRegionOps vfio_rom_ops = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static bool vfio_blacklist_opt_rom(VFIODevice *vdev)
+static bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
     uint16_t vendor_id, device_id;
@@ -1237,12 +1010,13 @@ static bool vfio_blacklist_opt_rom(VFIODevice *vdev)
     return false;
 }
 
-static void vfio_pci_size_rom(VFIODevice *vdev)
+static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 {
     uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
     off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
     DeviceState *dev = DEVICE(vdev);
     char name[32];
+    int fd = vdev->vdev.fd;
 
     if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
         /* Since pci handles romfile, just print a message and return */
@@ -1261,10 +1035,10 @@ static void vfio_pci_size_rom(VFIODevice *vdev)
      * Use the same size ROM BAR as the physical device.  The contents
      * will get filled in later when the guest tries to read it.
      */
-    if (pread(vdev->fd, &orig, 4, offset) != 4 ||
-        pwrite(vdev->fd, &size, 4, offset) != 4 ||
-        pread(vdev->fd, &size, 4, offset) != 4 ||
-        pwrite(vdev->fd, &orig, 4, offset) != 4) {
+    if (pread(fd, &orig, 4, offset) != 4 ||
+        pwrite(fd, &size, 4, offset) != 4 ||
+        pread(fd, &size, 4, offset) != 4 ||
+        pwrite(fd, &orig, 4, offset) != 4) {
         error_report("%s(%04x:%02x:%02x.%x) failed: %m",
                      __func__, vdev->host.domain, vdev->host.bus,
                      vdev->host.slot, vdev->host.function);
@@ -1416,7 +1190,7 @@ static uint64_t vfio_generic_window_quirk_read(void *opaque,
                                                hwaddr addr, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     uint64_t data;
 
     if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) &&
@@ -1438,7 +1212,7 @@ static uint64_t vfio_generic_window_quirk_read(void *opaque,
                 vdev->host.bus, vdev->host.slot, vdev->host.function,
                 quirk->data.bar, addr, size, data);
     } else {
-        data = vfio_bar_read(&vdev->bars[quirk->data.bar],
+        data = vfio_region_read(&vdev->bars[quirk->data.bar].region,
                              addr + quirk->data.base_offset, size);
     }
 
@@ -1449,7 +1223,7 @@ static void vfio_generic_window_quirk_write(void *opaque, hwaddr addr,
                                             uint64_t data, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
 
     if (ranges_overlap(addr, size,
                        quirk->data.address_offset, quirk->data.address_size)) {
@@ -1489,7 +1263,7 @@ static void vfio_generic_window_quirk_write(void *opaque, hwaddr addr,
         return;
     }
 
-    vfio_bar_write(&vdev->bars[quirk->data.bar],
+    vfio_region_write(&vdev->bars[quirk->data.bar].region,
                    addr + quirk->data.base_offset, data, size);
 }
 
@@ -1503,7 +1277,7 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
                                         hwaddr addr, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
     hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
     uint64_t data;
@@ -1523,7 +1297,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
                 vdev->host.bus, vdev->host.slot, vdev->host.function,
                 quirk->data.bar, addr + base, size, data);
     } else {
-        data = vfio_bar_read(&vdev->bars[quirk->data.bar], addr + base, size);
+        data = vfio_region_read(&vdev->bars[quirk->data.bar].region,
+                                addr + base, size);
     }
 
     return data;
@@ -1533,7 +1308,7 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
                                      uint64_t data, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
     hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
 
@@ -1552,7 +1327,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
                 vdev->host.domain, vdev->host.bus, vdev->host.slot,
                 vdev->host.function, quirk->data.bar, addr + base, data, size);
     } else {
-        vfio_bar_write(&vdev->bars[quirk->data.bar], addr + base, data, size);
+        vfio_region_write(&vdev->bars[quirk->data.bar].region, addr + base,
+                            data, size);
     }
 }
 
@@ -1578,7 +1354,7 @@ static uint64_t vfio_ati_3c3_quirk_read(void *opaque,
                                         hwaddr addr, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     uint64_t data = vfio_pci_read_config(&vdev->pdev,
                                          PCI_BASE_ADDRESS_0 + (4 * 4) + 1,
                                          size);
@@ -1592,7 +1368,7 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev)
+static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -1605,7 +1381,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev)
      * As long as the BAR is >= 256 bytes it will be aligned such that the
      * lower byte is always zero.  Filter out anything else, if it exists.
      */
-    if (!vdev->bars[4].ioport || vdev->bars[4].size < 256) {
+    if (!vdev->bars[4].ioport || vdev->bars[4].region.size < 256) {
         return;
     }
 
@@ -1635,7 +1411,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev)
  * that only read-only access is provided, but we drop writes when the window
  * is enabled to config space nonetheless.
  */
-static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr)
+static void vfio_probe_ati_bar4_window_quirk(VFIOPCIDevice *vdev, int nr)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -1658,7 +1434,7 @@ static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr)
     memory_region_init_io(&quirk->mem, OBJECT(vdev),
                           &vfio_generic_window_quirk, quirk,
                           "vfio-ati-bar4-window-quirk", 8);
-    memory_region_add_subregion_overlap(&vdev->bars[nr].mem,
+    memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
                           quirk->data.base_offset, &quirk->mem, 1);
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
@@ -1671,7 +1447,7 @@ static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr)
 /*
  * Trap the BAR2 MMIO window to config space as well.
  */
-static void vfio_probe_ati_bar2_4000_quirk(VFIODevice *vdev, int nr)
+static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -1692,7 +1468,7 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIODevice *vdev, int nr)
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
                           "vfio-ati-bar2-4000-quirk",
                           TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
-    memory_region_add_subregion_overlap(&vdev->bars[nr].mem,
+    memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
                           quirk->data.address_match & TARGET_PAGE_MASK,
                           &quirk->mem, 1);
 
@@ -1739,7 +1515,7 @@ static uint64_t vfio_nvidia_3d0_quirk_read(void *opaque,
                                            hwaddr addr, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     PCIDevice *pdev = &vdev->pdev;
     uint64_t data = vfio_vga_read(&vdev->vga.region[QEMU_PCI_VGA_IO_HI],
                                   addr + quirk->data.base_offset, size);
@@ -1758,7 +1534,7 @@ static void vfio_nvidia_3d0_quirk_write(void *opaque, hwaddr addr,
                                         uint64_t data, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     PCIDevice *pdev = &vdev->pdev;
 
     switch (quirk->data.flags) {
@@ -1805,13 +1581,13 @@ static const MemoryRegionOps vfio_nvidia_3d0_quirk = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static void vfio_vga_probe_nvidia_3d0_quirk(VFIODevice *vdev)
+static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
 
     if (pci_get_word(pdev->config + PCI_VENDOR_ID) != PCI_VENDOR_ID_NVIDIA ||
-        !vdev->bars[1].size) {
+        !vdev->bars[1].region.size) {
         return;
     }
 
@@ -1897,7 +1673,7 @@ static const MemoryRegionOps vfio_nvidia_bar5_window_quirk = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static void vfio_probe_nvidia_bar5_window_quirk(VFIODevice *vdev, int nr)
+static void vfio_probe_nvidia_bar5_window_quirk(VFIOPCIDevice *vdev, int nr)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -1919,7 +1695,8 @@ static void vfio_probe_nvidia_bar5_window_quirk(VFIODevice *vdev, int nr)
     memory_region_init_io(&quirk->mem, OBJECT(vdev),
                           &vfio_nvidia_bar5_window_quirk, quirk,
                           "vfio-nvidia-bar5-window-quirk", 16);
-    memory_region_add_subregion_overlap(&vdev->bars[nr].mem, 0, &quirk->mem, 1);
+    memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, 0,
+                                        &quirk->mem, 1);
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
@@ -1932,7 +1709,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
                                           uint64_t data, unsigned size)
 {
     VFIOQuirk *quirk = opaque;
-    VFIODevice *vdev = quirk->vdev;
+    VFIOPCIDevice *vdev = quirk->vdev;
     PCIDevice *pdev = &vdev->pdev;
     hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
 
@@ -1946,7 +1723,8 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
      */
     if ((pdev->cap_present & QEMU_PCI_CAP_MSI) &&
         vfio_range_contained(addr, size, pdev->msi_cap, PCI_MSI_FLAGS)) {
-        vfio_bar_write(&vdev->bars[quirk->data.bar], addr + base, data, size);
+        vfio_region_write(&vdev->bars[quirk->data.bar].region,
+                            addr + base, data, size);
     }
 }
 
@@ -1965,7 +1743,7 @@ static const MemoryRegionOps vfio_nvidia_88000_quirk = {
  *
  * Here's offset 0x88000...
  */
-static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr)
+static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -1985,7 +1763,7 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr)
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk,
                           quirk, "vfio-nvidia-bar0-88000-quirk",
                           TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
-    memory_region_add_subregion_overlap(&vdev->bars[nr].mem,
+    memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
                           quirk->data.address_match & TARGET_PAGE_MASK,
                           &quirk->mem, 1);
 
@@ -1999,7 +1777,7 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr)
 /*
  * And here's the same for BAR0 offset 0x1800...
  */
-static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr)
+static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
 {
     PCIDevice *pdev = &vdev->pdev;
     VFIOQuirk *quirk;
@@ -2011,7 +1789,8 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr)
 
     /* Log the chipset ID */
     DPRINTF("Nvidia NV%02x\n",
-            (unsigned int)(vfio_bar_read(&vdev->bars[0], 0, 4) >> 20) & 0xff);
+            (unsigned int)(vfio_region_read(&vdev->bars[0].region, 0, 4) >> 20)
+            & 0xff);
 
     quirk = g_malloc0(sizeof(*quirk));
     quirk->vdev = vdev;
@@ -2023,7 +1802,7 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr)
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
                           "vfio-nvidia-bar0-1800-quirk",
                           TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
-    memory_region_add_subregion_overlap(&vdev->bars[nr].mem,
+    memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
                           quirk->data.address_match & TARGET_PAGE_MASK,
                           &quirk->mem, 1);
 
@@ -2043,13 +1822,13 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr)
 /*
  * Common quirk probe entry points.
  */
-static void vfio_vga_quirk_setup(VFIODevice *vdev)
+static void vfio_vga_quirk_setup(VFIOPCIDevice *vdev)
 {
     vfio_vga_probe_ati_3c3_quirk(vdev);
     vfio_vga_probe_nvidia_3d0_quirk(vdev);
 }
 
-static void vfio_vga_quirk_teardown(VFIODevice *vdev)
+static void vfio_vga_quirk_teardown(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -2064,7 +1843,7 @@ static void vfio_vga_quirk_teardown(VFIODevice *vdev)
     }
 }
 
-static void vfio_bar_quirk_setup(VFIODevice *vdev, int nr)
+static void vfio_bar_quirk_setup(VFIOPCIDevice *vdev, int nr)
 {
     vfio_probe_ati_bar4_window_quirk(vdev, nr);
     vfio_probe_ati_bar2_4000_quirk(vdev, nr);
@@ -2073,13 +1852,13 @@ static void vfio_bar_quirk_setup(VFIODevice *vdev, int nr)
     vfio_probe_nvidia_bar0_1800_quirk(vdev, nr);
 }
 
-static void vfio_bar_quirk_teardown(VFIODevice *vdev, int nr)
+static void vfio_bar_quirk_teardown(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
 
     while (!QLIST_EMPTY(&bar->quirks)) {
         VFIOQuirk *quirk = QLIST_FIRST(&bar->quirks);
-        memory_region_del_subregion(&bar->mem, &quirk->mem);
+        memory_region_del_subregion(&bar->region.mem, &quirk->mem);
         memory_region_destroy(&quirk->mem);
         QLIST_REMOVE(quirk, next);
         g_free(quirk);
@@ -2091,7 +1870,7 @@ static void vfio_bar_quirk_teardown(VFIODevice *vdev, int nr)
  */
 static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
 
     memcpy(&emu_bits, vdev->emulated_config_bits + addr, len);
@@ -2104,7 +1883,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
     if (~emu_bits & (0xffffffffU >> (32 - len * 8))) {
         ssize_t ret;
 
-        ret = pread(vdev->fd, &phys_val, len, vdev->config_offset + addr);
+        ret = pread(vdev->vdev.fd, &phys_val, len, vdev->config_offset + addr);
         if (ret != len) {
             error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m",
                          __func__, vdev->host.domain, vdev->host.bus,
@@ -2126,15 +1905,15 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     uint32_t val_le = cpu_to_le32(val);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, addr, val, len);
+    DPRINTF("%s(%s, @0x%x, 0x%x, len=0x%x)\n", __func__, vdev->vdev.name,
+            addr, val, len);
 
     /* Write everything to VFIO, let it filter out what we can't write */
-    if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) {
+    if (pwrite(vdev->vdev.fd, &val_le, len, 
+                vdev->config_offset + addr) != len) {
         error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %m",
                      __func__, vdev->host.domain, vdev->host.bus,
                      vdev->host.slot, vdev->host.function, addr, val, len);
@@ -2180,186 +1959,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
 }
 
 /*
- * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
- */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
-{
-    struct vfio_iommu_type1_dma_unmap unmap = {
-        .argsz = sizeof(unmap),
-        .flags = 0,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
-        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
-        return -errno;
-    }
-
-    return 0;
-}
-
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
-{
-    struct vfio_iommu_type1_dma_map map = {
-        .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
-        .vaddr = (__u64)(uintptr_t)vaddr,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (!readonly) {
-        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
-    }
-
-    /*
-     * Try the mapping, if it fails with EBUSY, unmap the region and try
-     * again.  This shouldn't be necessary, but we sometimes see it in
-     * the the VGA ROM space.
-     */
-    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
-         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
-        return 0;
-    }
-
-    DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
-    return -errno;
-}
-
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
-{
-    return !memory_region_is_ram(section->mr) ||
-           /*
-            * Sizing an enabled 64-bit BAR can cause spurious mappings to
-            * addresses in the upper part of the 64-bit address space.  These
-            * are never accessed by the CPU and beyond the address width of
-            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
-            */
-           section->offset_within_address_space & (1ULL << 63);
-}
-
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
-    hwaddr iova, end;
-    void *vaddr;
-    int ret;
-
-    assert(!memory_region_is_iommu(section->mr));
-
-    if (vfio_listener_skipped_section(section)) {
-        DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
-
-    if (iova >= end) {
-        return;
-    }
-
-    vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
-
-    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
-            iova, end - 1, vaddr);
-
-    memory_region_ref(section->mr);
-    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
-    if (ret) {
-        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                     container, iova, end - iova, vaddr, ret);
-
-        /*
-         * On the initfn path, store the first error in the container so we
-         * can gracefully fail.  Runtime, there's not much we can do other
-         * than throw a hardware error.
-         */
-        if (!container->iommu_data.type1.initialized) {
-            if (!container->iommu_data.type1.error) {
-                container->iommu_data.type1.error = ret;
-            }
-        } else {
-            hw_error("vfio: DMA mapping failed, unable to continue");
-        }
-    }
-}
-
-static void vfio_listener_region_del(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
-    hwaddr iova, end;
-    int ret;
-
-    if (vfio_listener_skipped_section(section)) {
-        DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
-
-    if (iova >= end) {
-        return;
-    }
-
-    DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
-            iova, end - 1);
-
-    ret = vfio_dma_unmap(container, iova, end - iova);
-    memory_region_unref(section->mr);
-    if (ret) {
-        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx") = %d (%m)",
-                     container, iova, end - iova, ret);
-    }
-}
-
-static MemoryListener vfio_memory_listener = {
-    .region_add = vfio_listener_region_add,
-    .region_del = vfio_listener_region_del,
-};
-
-static void vfio_listener_release(VFIOContainer *container)
-{
-    memory_listener_unregister(&container->iommu_data.type1.listener);
-}
-
-/*
  * Interrupt setup
  */
-static void vfio_disable_interrupts(VFIODevice *vdev)
+static void vfio_disable_interrupts(VFIOPCIDevice *vdev)
 {
     switch (vdev->interrupt) {
     case VFIO_INT_INTx:
@@ -2374,13 +1976,13 @@ static void vfio_disable_interrupts(VFIODevice *vdev)
     }
 }
 
-static int vfio_setup_msi(VFIODevice *vdev, int pos)
+static int vfio_setup_msi(VFIOPCIDevice *vdev, int pos)
 {
     uint16_t ctrl;
     bool msi_64bit, msi_maskbit;
     int ret, entries;
 
-    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
+    if (pread(vdev->vdev.fd, &ctrl, sizeof(ctrl),
               vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
         return -errno;
     }
@@ -2414,28 +2016,29 @@ static int vfio_setup_msi(VFIODevice *vdev, int pos)
  * need to first look for where the MSI-X table lives.  So we
  * unfortunately split MSI-X setup across two functions.
  */
-static int vfio_early_setup_msix(VFIODevice *vdev)
+static int vfio_early_setup_msix(VFIOPCIDevice *vdev)
 {
     uint8_t pos;
     uint16_t ctrl;
     uint32_t table, pba;
+    int fd = vdev->vdev.fd;
 
     pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX);
     if (!pos) {
         return 0;
     }
 
-    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
+    if (pread(fd, &ctrl, sizeof(ctrl),
               vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
         return -errno;
     }
 
-    if (pread(vdev->fd, &table, sizeof(table),
+    if (pread(fd, &table, sizeof(table),
               vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
         return -errno;
     }
 
-    if (pread(vdev->fd, &pba, sizeof(pba),
+    if (pread(fd, &pba, sizeof(pba),
               vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
         return -errno;
     }
@@ -2460,14 +2063,14 @@ static int vfio_early_setup_msix(VFIODevice *vdev)
     return 0;
 }
 
-static int vfio_setup_msix(VFIODevice *vdev, int pos)
+static int vfio_setup_msix(VFIOPCIDevice *vdev, int pos)
 {
     int ret;
 
     ret = msix_init(&vdev->pdev, vdev->msix->entries,
-                    &vdev->bars[vdev->msix->table_bar].mem,
+                    &vdev->bars[vdev->msix->table_bar].region.mem,
                     vdev->msix->table_bar, vdev->msix->table_offset,
-                    &vdev->bars[vdev->msix->pba_bar].mem,
+                    &vdev->bars[vdev->msix->pba_bar].region.mem,
                     vdev->msix->pba_bar, vdev->msix->pba_offset, pos);
     if (ret < 0) {
         if (ret == -ENOTSUP) {
@@ -2480,102 +2083,64 @@ static int vfio_setup_msix(VFIODevice *vdev, int pos)
     return 0;
 }
 
-static void vfio_teardown_msi(VFIODevice *vdev)
+static void vfio_teardown_msi(VFIOPCIDevice *vdev)
 {
     msi_uninit(&vdev->pdev);
 
     if (vdev->msix) {
-        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem,
-                    &vdev->bars[vdev->msix->pba_bar].mem);
+        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].region.mem,
+                    &vdev->bars[vdev->msix->pba_bar].region.mem);
     }
 }
 
 /*
  * Resource setup
  */
-static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled)
+static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
 {
     int i;
 
     for (i = 0; i < PCI_ROM_SLOT; i++) {
         VFIOBAR *bar = &vdev->bars[i];
 
-        if (!bar->size) {
+        if (!bar->region.size) {
             continue;
         }
 
-        memory_region_set_enabled(&bar->mmap_mem, enabled);
+        memory_region_set_enabled(&bar->region.mmap_mem, enabled);
         if (vdev->msix && vdev->msix->table_bar == i) {
             memory_region_set_enabled(&vdev->msix->mmap_mem, enabled);
         }
     }
 }
 
-static void vfio_unmap_bar(VFIODevice *vdev, int nr)
+static void vfio_unmap_bar(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
 
-    if (!bar->size) {
+    if (!bar->region.size) {
         return;
     }
 
     vfio_bar_quirk_teardown(vdev, nr);
 
-    memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
-    munmap(bar->mmap, memory_region_size(&bar->mmap_mem));
-    memory_region_destroy(&bar->mmap_mem);
+    memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem);
+    munmap(bar->region.mmap, memory_region_size(&bar->region.mmap_mem));
+    memory_region_destroy(&bar->region.mmap_mem);
 
     if (vdev->msix && vdev->msix->table_bar == nr) {
-        memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
+        memory_region_del_subregion(&bar->region.mem, &vdev->msix->mmap_mem);
         munmap(vdev->msix->mmap, memory_region_size(&vdev->msix->mmap_mem));
         memory_region_destroy(&vdev->msix->mmap_mem);
     }
 
-    memory_region_destroy(&bar->mem);
+    memory_region_destroy(&bar->region.mem);
 }
 
-static int vfio_mmap_bar(VFIODevice *vdev, VFIOBAR *bar,
-                         MemoryRegion *mem, MemoryRegion *submem,
-                         void **map, size_t size, off_t offset,
-                         const char *name)
-{
-    int ret = 0;
-
-    if (VFIO_ALLOW_MMAP && size && bar->flags & VFIO_REGION_INFO_FLAG_MMAP) {
-        int prot = 0;
-
-        if (bar->flags & VFIO_REGION_INFO_FLAG_READ) {
-            prot |= PROT_READ;
-        }
-
-        if (bar->flags & VFIO_REGION_INFO_FLAG_WRITE) {
-            prot |= PROT_WRITE;
-        }
-
-        *map = mmap(NULL, size, prot, MAP_SHARED,
-                    bar->fd, bar->fd_offset + offset);
-        if (*map == MAP_FAILED) {
-            *map = NULL;
-            ret = -errno;
-            goto empty_region;
-        }
-
-        memory_region_init_ram_ptr(submem, OBJECT(vdev), name, size, *map);
-    } else {
-empty_region:
-        /* Create a zero sized sub-region to make cleanup easy. */
-        memory_region_init(submem, OBJECT(vdev), name, 0);
-    }
-
-    memory_region_add_subregion(mem, offset, submem);
-
-    return ret;
-}
-
-static void vfio_map_bar(VFIODevice *vdev, int nr)
+static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
-    unsigned size = bar->size;
+    unsigned size = bar->region.size;
     char name[64];
     uint32_t pci_bar;
     uint8_t type;
@@ -2591,7 +2156,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
              vdev->host.function, nr);
 
     /* Determine what type of BAR this is for registration */
-    ret = pread(vdev->fd, &pci_bar, sizeof(pci_bar),
+    ret = pread(vdev->vdev.fd, &pci_bar, sizeof(pci_bar),
                 vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
     if (ret != sizeof(pci_bar)) {
         error_report("vfio: Failed to read BAR %d (%m)", nr);
@@ -2605,9 +2170,9 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
                                     ~PCI_BASE_ADDRESS_MEM_MASK);
 
     /* A "slow" read/write mapping underlies all BARs */
-    memory_region_init_io(&bar->mem, OBJECT(vdev), &vfio_bar_ops,
-                          bar, name, size);
-    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_bar_ops,
+                          &bar->region, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
 
     /*
      * We can't mmap areas overlapping the MSIX vector table, so we
@@ -2618,8 +2183,9 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
     }
 
     strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
-    if (vfio_mmap_bar(vdev, bar, &bar->mem,
-                      &bar->mmap_mem, &bar->mmap, size, 0, name)) {
+    if (vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                        &bar->region.mmap_mem, &bar->region.mmap,
+                        size, 0, name)) {
         error_report("%s unsupported. Performance may be slow", name);
     }
 
@@ -2629,11 +2195,12 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
         start = HOST_PAGE_ALIGN(vdev->msix->table_offset +
                                 (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
 
-        size = start < bar->size ? bar->size - start : 0;
+        size = start < bar->region.size ? bar->region.size - start : 0;
         strncat(name, " msix-hi", sizeof(name) - strlen(name) - 1);
         /* VFIOMSIXInfo contains another MemoryRegion for this mapping */
-        if (vfio_mmap_bar(vdev, bar, &bar->mem, &vdev->msix->mmap_mem,
-                          &vdev->msix->mmap, size, start, name)) {
+        if (vfio_mmap_region(OBJECT(vdev), &bar->region,
+                            &bar->region.mem, &vdev->msix->mmap_mem,
+                            &vdev->msix->mmap, size, start, name)) {
             error_report("%s unsupported. Performance may be slow", name);
         }
     }
@@ -2641,7 +2208,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
     vfio_bar_quirk_setup(vdev, nr);
 }
 
-static void vfio_map_bars(VFIODevice *vdev)
+static void vfio_map_bars(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -2673,7 +2240,7 @@ static void vfio_map_bars(VFIODevice *vdev)
     }
 }
 
-static void vfio_unmap_bars(VFIODevice *vdev)
+static void vfio_unmap_bars(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -2712,7 +2279,7 @@ static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
     pci_set_word(buf, (pci_get_word(buf) & ~mask) | val);
 }
 
-static void vfio_add_emulated_word(VFIODevice *vdev, int pos,
+static void vfio_add_emulated_word(VFIOPCIDevice *vdev, int pos,
                                    uint16_t val, uint16_t mask)
 {
     vfio_set_word_bits(vdev->pdev.config + pos, val, mask);
@@ -2725,7 +2292,7 @@ static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask)
     pci_set_long(buf, (pci_get_long(buf) & ~mask) | val);
 }
 
-static void vfio_add_emulated_long(VFIODevice *vdev, int pos,
+static void vfio_add_emulated_long(VFIOPCIDevice *vdev, int pos,
                                    uint32_t val, uint32_t mask)
 {
     vfio_set_long_bits(vdev->pdev.config + pos, val, mask);
@@ -2733,7 +2300,7 @@ static void vfio_add_emulated_long(VFIODevice *vdev, int pos,
     vfio_set_long_bits(vdev->emulated_config_bits + pos, mask, mask);
 }
 
-static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
+static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, int pos, uint8_t size)
 {
     uint16_t flags;
     uint8_t type;
@@ -2825,7 +2392,7 @@ static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
     return pos;
 }
 
-static void vfio_check_pcie_flr(VFIODevice *vdev, uint8_t pos)
+static void vfio_check_pcie_flr(VFIOPCIDevice *vdev, uint8_t pos)
 {
     uint32_t cap = pci_get_long(vdev->pdev.config + pos + PCI_EXP_DEVCAP);
 
@@ -2837,7 +2404,7 @@ static void vfio_check_pcie_flr(VFIODevice *vdev, uint8_t pos)
     }
 }
 
-static void vfio_check_pm_reset(VFIODevice *vdev, uint8_t pos)
+static void vfio_check_pm_reset(VFIOPCIDevice *vdev, uint8_t pos)
 {
     uint16_t csr = pci_get_word(vdev->pdev.config + pos + PCI_PM_CTRL);
 
@@ -2849,7 +2416,7 @@ static void vfio_check_pm_reset(VFIODevice *vdev, uint8_t pos)
     }
 }
 
-static void vfio_check_af_flr(VFIODevice *vdev, uint8_t pos)
+static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
 {
     uint8_t cap = pci_get_byte(vdev->pdev.config + pos + PCI_AF_CAP);
 
@@ -2861,7 +2428,7 @@ static void vfio_check_af_flr(VFIODevice *vdev, uint8_t pos)
     }
 }
 
-static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos)
+static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
     uint8_t cap_id, next, size;
@@ -2936,7 +2503,7 @@ static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos)
     return 0;
 }
 
-static int vfio_add_capabilities(VFIODevice *vdev)
+static int vfio_add_capabilities(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
 
@@ -2948,7 +2515,7 @@ static int vfio_add_capabilities(VFIODevice *vdev)
     return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
 }
 
-static void vfio_pci_pre_reset(VFIODevice *vdev)
+static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
     uint16_t cmd;
@@ -2985,7 +2552,7 @@ static void vfio_pci_pre_reset(VFIODevice *vdev)
     vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
 }
 
-static void vfio_pci_post_reset(VFIODevice *vdev)
+static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
 {
     vfio_enable_intx(vdev);
 }
@@ -2997,7 +2564,7 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *host1,
             host1->slot == host2->slot && host1->function == host2->function);
 }
 
-static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
+static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 {
     VFIOGroup *group;
     struct vfio_pci_hot_reset_info *info;
@@ -3006,18 +2573,19 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
     int32_t *fds;
     int ret, i, count;
     bool multi = false;
+    int fd = vdev->vdev.fd;
 
     DPRINTF("%s(%04x:%02x:%02x.%x) %s\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function,
             single ? "one" : "multi");
 
     vfio_pci_pre_reset(vdev);
-    vdev->needs_reset = false;
+    vdev->vdev.needs_reset = false;
 
     info = g_malloc0(sizeof(*info));
     info->argsz = sizeof(*info);
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    ret = ioctl(fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
     if (ret && errno != ENOSPC) {
         ret = -errno;
         if (!vdev->has_pm_reset) {
@@ -3033,7 +2601,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
     info->argsz = sizeof(*info) + (count * sizeof(*devices));
     devices = &info->devices[0];
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    ret = ioctl(fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
     if (ret) {
         ret = -errno;
         error_report("vfio: hot reset info failed: %m");
@@ -3048,6 +2616,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
     for (i = 0; i < info->count; i++) {
         PCIHostDeviceAddress host;
         VFIODevice *tmp;
+        VFIOPCIDevice *vpcidev;
 
         host.domain = devices[i].segment;
         host.bus = devices[i].bus;
@@ -3080,7 +2649,11 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
 
         /* Prep dependent devices for reset and clear our marker. */
         QLIST_FOREACH(tmp, &group->device_list, next) {
-            if (vfio_pci_host_match(&host, &tmp->host)) {
+            if (tmp->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            vpcidev = container_of(tmp, VFIOPCIDevice, vdev);
+            if (vfio_pci_host_match(&host, &vpcidev->host)) {
                 if (single) {
                     DPRINTF("vfio: found another in-use device "
                             "%04x:%02x:%02x.%x\n", host.domain, host.bus,
@@ -3088,8 +2661,8 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
                     ret = -EINVAL;
                     goto out_single;
                 }
-                vfio_pci_pre_reset(tmp);
-                tmp->needs_reset = false;
+                vfio_pci_pre_reset(vpcidev);
+                vpcidev->vdev.needs_reset = false;
                 multi = true;
                 break;
             }
@@ -3128,7 +2701,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
     }
 
     /* Bus reset! */
-    ret = ioctl(vdev->fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
+    ret = ioctl(fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
     g_free(reset);
 
     DPRINTF("%04x:%02x:%02x.%x hot reset: %s\n", vdev->host.domain,
@@ -3140,6 +2713,7 @@ out:
     for (i = 0; i < info->count; i++) {
         PCIHostDeviceAddress host;
         VFIODevice *tmp;
+        VFIOPCIDevice *vpcidev;
 
         host.domain = devices[i].segment;
         host.bus = devices[i].bus;
@@ -3161,8 +2735,12 @@ out:
         }
 
         QLIST_FOREACH(tmp, &group->device_list, next) {
-            if (vfio_pci_host_match(&host, &tmp->host)) {
-                vfio_pci_post_reset(tmp);
+            if (tmp->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            vpcidev = container_of(tmp, VFIOPCIDevice, vdev);
+            if (vfio_pci_host_match(&host, &vpcidev->host)) {
+                vfio_pci_post_reset(vpcidev);
                 break;
             }
         }
@@ -3178,7 +2756,7 @@ out_single:
  * We want to differentiate hot reset of mulitple in-use devices vs hot reset
  * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
  * of doing hot resets when there is only a single device per bus.  The in-use
- * here refers to how many VFIODevices are affected.  A hot reset that affects
+ * here refers to how many VFIOPCIDevices are affected. A hot reset that affects
  * multiple devices, but only a single in-use device, means that we can call
  * it from our bus ->reset() callback since the extent is effectively a single
  * device.  This allows us to make use of it in the hotplug path.  When there
@@ -3189,354 +2767,99 @@ out_single:
  * _one() will only do a hot reset for the one in-use devices case, calling
  * _multi() will do nothing if a _one() would have been sufficient.
  */
-static int vfio_pci_hot_reset_one(VFIODevice *vdev)
+static int vfio_pci_hot_reset_one(VFIOPCIDevice *vdev)
 {
     return vfio_pci_hot_reset(vdev, true);
 }
 
 static int vfio_pci_hot_reset_multi(VFIODevice *vdev)
 {
-    return vfio_pci_hot_reset(vdev, false);
-}
-
-static void vfio_pci_reset_handler(void *opaque)
-{
-    VFIOGroup *group;
-    VFIODevice *vdev;
-
-    QLIST_FOREACH(group, &group_list, next) {
-        QLIST_FOREACH(vdev, &group->device_list, next) {
-            if (!vdev->reset_works || (!vdev->has_flr && vdev->has_pm_reset)) {
-                vdev->needs_reset = true;
-            }
-        }
-    }
-
-    QLIST_FOREACH(group, &group_list, next) {
-        QLIST_FOREACH(vdev, &group->device_list, next) {
-            if (vdev->needs_reset) {
-                vfio_pci_hot_reset_multi(vdev);
-            }
-        }
-    }
-}
-
-static void vfio_kvm_device_add_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_ADD,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (!kvm_enabled()) {
-        return;
-    }
-
-    if (vfio_kvm_device_fd < 0) {
-        struct kvm_create_device cd = {
-            .type = KVM_DEV_TYPE_VFIO,
-        };
-
-        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
-            DPRINTF("KVM_CREATE_DEVICE: %m\n");
-            return;
-        }
-
-        vfio_kvm_device_fd = cd.fd;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to add group %d to KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
-}
-
-static void vfio_kvm_device_del_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_DEL,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (vfio_kvm_device_fd < 0) {
-        return;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to remove group %d from KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
+    VFIOPCIDevice *vpcidev =  container_of(vdev, VFIOPCIDevice, vdev);
+    return vfio_pci_hot_reset(vpcidev, false);
 }
 
-static int vfio_connect_container(VFIOGroup *group)
+static bool vfio_pci_compute_needs_reset(VFIODevice *vdev)
 {
-    VFIOContainer *container;
-    int ret, fd;
-
-    if (group->container) {
-        return 0;
-    }
-
-    QLIST_FOREACH(container, &container_list, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            return 0;
-        }
-    }
-
-    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_report("vfio: failed to open /dev/vfio/vfio: %m");
-        return -errno;
-    }
-
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_report("vfio: supported vfio version: %d, "
-                     "reported version: %d", VFIO_API_VERSION, ret);
-        close(fd);
-        return -EINVAL;
-    }
-
-    container = g_malloc0(sizeof(*container));
-    container->fd = fd;
-
-    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
-        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
-        if (ret) {
-            error_report("vfio: failed to set group container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
-        }
-
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
-        if (ret) {
-            error_report("vfio: failed to set iommu for container: %m");
-            g_free(container);
-            close(fd);
-            return -errno;
-        }
-
-        container->iommu_data.type1.listener = vfio_memory_listener;
-        container->iommu_data.release = vfio_listener_release;
-
-        memory_listener_register(&container->iommu_data.type1.listener,
-                                 &address_space_memory);
-
-        if (container->iommu_data.type1.error) {
-            ret = container->iommu_data.type1.error;
-            vfio_listener_release(container);
-            g_free(container);
-            close(fd);
-            error_report("vfio: memory listener initialization failed for container");
-            return ret;
-        }
-
-        container->iommu_data.type1.initialized = true;
-
-    } else {
-        error_report("vfio: No available IOMMU models");
-        g_free(container);
-        close(fd);
-        return -EINVAL;
-    }
-
-    QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&container_list, container, next);
-
-    group->container = container;
-    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-
-    return 0;
-}
-
-static void vfio_disconnect_container(VFIOGroup *group)
-{
-    VFIOContainer *container = group->container;
-
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
-    QLIST_REMOVE(group, container_next);
-    group->container = NULL;
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        if (container->iommu_data.release) {
-            container->iommu_data.release(container);
-        }
-        QLIST_REMOVE(container, next);
-        DPRINTF("vfio_disconnect_container: close container->fd\n");
-        close(container->fd);
-        g_free(container);
+    VFIOPCIDevice *vpcidev = container_of(vdev, VFIOPCIDevice, vdev);
+    if (!vdev->reset_works || (!vpcidev->has_flr && vpcidev->has_pm_reset)) {
+        vdev->needs_reset = true;
     }
+    return vdev->needs_reset;
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+static int vfio_pci_check_device(VFIODevice *vbasedev)
 {
-    VFIOGroup *group;
-    char path[32];
-    struct vfio_group_status status = { .argsz = sizeof(status) };
 
-    QLIST_FOREACH(group, &group_list, next) {
-        if (group->groupid == groupid) {
-            return group;
-        }
-    }
-
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open(path, O_RDWR);
-    if (group->fd < 0) {
-        error_report("vfio: error opening %s: %m", path);
-        g_free(group);
-        return NULL;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_report("vfio: error getting group status: %m");
-        close(group->fd);
-        g_free(group);
-        return NULL;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_report("vfio: error, group %d is not viable, please ensure "
-                     "all devices within the iommu_group are bound to their "
-                     "vfio bus driver.", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
-    }
-
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
-    if (vfio_connect_container(group)) {
-        error_report("vfio: failed to setup container for group %d", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
+    if (vbasedev->num_regions < VFIO_PCI_CONFIG_REGION_INDEX + 1) {
+        error_report("vfio: unexpected number of io regions %u",
+                     vbasedev->num_regions);
+        goto error;
     }
 
-    if (QLIST_EMPTY(&group_list)) {
-        qemu_register_reset(vfio_pci_reset_handler, NULL);
+    if (vbasedev->num_irqs < VFIO_PCI_MSIX_IRQ_INDEX + 1) {
+        error_report("vfio: unexpected number of irqs %u", vbasedev->num_irqs);
+        goto error;
     }
 
-    QLIST_INSERT_HEAD(&group_list, group, next);
-
-    vfio_kvm_device_add_group(group);
-
-    return group;
+error:
+    vfio_put_base_device(vbasedev);
+    return -errno;
 }
 
-static void vfio_put_group(VFIOGroup *group)
-{
-    if (!QLIST_EMPTY(&group->device_list)) {
-        return;
-    }
-
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    DPRINTF("vfio_put_group: close group->fd\n");
-    close(group->fd);
-    g_free(group);
-
-    if (QLIST_EMPTY(&group_list)) {
-        qemu_unregister_reset(vfio_pci_reset_handler, NULL);
-    }
-}
 
-static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
+static int vfio_pci_get_device_regions(VFIODevice *vbasedev)
 {
-    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
-    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
-    int ret, i;
-
-    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
-    if (ret < 0) {
-        error_report("vfio: error getting device %s from group %d: %m",
-                     name, group->groupid);
-        error_printf("Verify all devices in group %d are bound to vfio-pci "
-                     "or pci-stub and not already in use\n", group->groupid);
-        return ret;
-    }
-
-    vdev->fd = ret;
-    vdev->group = group;
-    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
-
-    /* Sanity check device */
-    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
-    if (ret) {
-        error_report("vfio: error getting device info: %m");
-        goto error;
-    }
-
-    DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name,
-            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
-
-    if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) {
-        error_report("vfio: Um, this isn't a PCI device");
-        goto error;
-    }
-
-    vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
-
-    if (dev_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX + 1) {
-        error_report("vfio: unexpected number of io regions %u",
-                     dev_info.num_regions);
-        goto error;
+    int i, ret;
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vdev);
+
+    vbasedev->regions = g_malloc0(sizeof(VFIORegion *) *
+                                    vbasedev->num_regions);
+    if (!vbasedev->regions) {
+            error_report("vfio: Error allocating space for %d regions",
+                         vbasedev->num_regions);
+            ret = -ENOMEM;
+            goto error;
     }
 
-    if (dev_info.num_irqs < VFIO_PCI_MSIX_IRQ_INDEX + 1) {
-        error_report("vfio: unexpected number of irqs %u", dev_info.num_irqs);
-        goto error;
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        vbasedev->regions[i] = &vdev->bars[i].region;
     }
 
     for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
         reg_info.index = i;
 
-        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
         if (ret) {
             error_report("vfio: Error getting region %d info: %m", i);
             goto error;
         }
 
-        DPRINTF("Device %s region %d:\n", name, i);
+        DPRINTF("Device %s region %d:\n", vbasedev->name, i);
         DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
                 (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
                 (unsigned long)reg_info.flags);
 
-        vdev->bars[i].flags = reg_info.flags;
-        vdev->bars[i].size = reg_info.size;
-        vdev->bars[i].fd_offset = reg_info.offset;
-        vdev->bars[i].fd = vdev->fd;
-        vdev->bars[i].nr = i;
+        vbasedev->regions[i]->flags = reg_info.flags;
+        vbasedev->regions[i]->size = reg_info.size;
+        vbasedev->regions[i]->fd_offset = reg_info.offset;
+        vbasedev->regions[i]->fd = vbasedev->fd;
+        vbasedev->regions[i]->nr = i;
+        vbasedev->regions[i]->vdev = vbasedev;
+
         QLIST_INIT(&vdev->bars[i].quirks);
     }
 
+
     reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
     if (ret) {
         error_report("vfio: Error getting config info: %m");
         goto error;
     }
 
-    DPRINTF("Device %s config:\n", name);
+    DPRINTF("Device %s config:\n", vbasedev->name);
     DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
             (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
             (unsigned long)reg_info.flags);
@@ -3548,13 +2871,13 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
     vdev->config_offset = reg_info.offset;
 
     if ((vdev->features & VFIO_FEATURE_ENABLE_VGA) &&
-        dev_info.num_regions > VFIO_PCI_VGA_REGION_INDEX) {
+        vbasedev->num_regions > VFIO_PCI_VGA_REGION_INDEX) {
         struct vfio_region_info vga_info = {
             .argsz = sizeof(vga_info),
             .index = VFIO_PCI_VGA_REGION_INDEX,
          };
 
-        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &vga_info);
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &vga_info);
         if (ret) {
             error_report(
                 "vfio: Device does not support requested feature x-vga");
@@ -3571,7 +2894,7 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
         }
 
         vdev->vga.fd_offset = vga_info.offset;
-        vdev->vga.fd = vdev->fd;
+        vdev->vga.fd = vbasedev->fd;
 
         vdev->vga.region[QEMU_PCI_VGA_MEM].offset = QEMU_PCI_VGA_MEM_BASE;
         vdev->vga.region[QEMU_PCI_VGA_MEM].nr = QEMU_PCI_VGA_MEM;
@@ -3587,9 +2910,26 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
 
         vdev->has_vga = true;
     }
+
+    return ret;
+
+error:
+    if (ret) {
+        vfio_put_base_device(vbasedev);
+    }
+    return ret;
+
+}
+
+static int vfio_pci_get_device_interrupts(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vdev);
+    int ret;
+
+    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
     if (ret) {
         /* This can fail for an old kernel or legacy PCI dev */
         DPRINTF("VFIO_DEVICE_GET_IRQ_INFO failure: %m\n");
@@ -3597,27 +2937,18 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
     } else if (irq_info.count == 1) {
         vdev->pci_aer = true;
     } else {
-        error_report("vfio: %04x:%02x:%02x.%x "
-                     "Could not enable error recovery for the device",
-                     vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                     vdev->host.function);
+        error_report("vfio: %s Could not enable error recovery for the device",
+                     vdev->vdev.name);
     }
 
-error:
-    if (ret) {
-        QLIST_REMOVE(vdev, next);
-        vdev->group = NULL;
-        close(vdev->fd);
-    }
+
     return ret;
+
 }
 
-static void vfio_put_device(VFIODevice *vdev)
+static void vfio_put_device(VFIOPCIDevice *vdev)
 {
-    QLIST_REMOVE(vdev, next);
-    vdev->group = NULL;
-    DPRINTF("vfio_put_device: close vdev->fd\n");
-    close(vdev->fd);
+    vfio_put_base_device(&vdev->vdev);
     if (vdev->msix) {
         g_free(vdev->msix);
         vdev->msix = NULL;
@@ -3626,7 +2957,7 @@ static void vfio_put_device(VFIODevice *vdev)
 
 static void vfio_err_notifier_handler(void *opaque)
 {
-    VFIODevice *vdev = opaque;
+    VFIOPCIDevice *vdev = opaque;
 
     if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
         return;
@@ -3641,10 +2972,9 @@ static void vfio_err_notifier_handler(void *opaque)
      * guest to contain the error.
      */
 
-    error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected.  "
+    error_report("%s(%s) Unrecoverable error detected.  "
                  "Please collect any data possible and then kill the guest",
-                 __func__, vdev->host.domain, vdev->host.bus,
-                 vdev->host.slot, vdev->host.function);
+                 __func__, vdev->vdev.name);
 
     vm_stop(RUN_STATE_IO_ERROR);
 }
@@ -3655,7 +2985,7 @@ static void vfio_err_notifier_handler(void *opaque)
  * and continue after disabling error recovery support for the
  * device.
  */
-static void vfio_register_err_notifier(VFIODevice *vdev)
+static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
 {
     int ret;
     int argsz;
@@ -3686,7 +3016,7 @@ static void vfio_register_err_notifier(VFIODevice *vdev)
     *pfd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(*pfd, vfio_err_notifier_handler, NULL, vdev);
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
     if (ret) {
         error_report("vfio: Failed to set up error notification");
         qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
@@ -3696,7 +3026,7 @@ static void vfio_register_err_notifier(VFIODevice *vdev)
     g_free(irq_set);
 }
 
-static void vfio_unregister_err_notifier(VFIODevice *vdev)
+static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
 {
     int argsz;
     struct vfio_irq_set *irq_set;
@@ -3719,7 +3049,7 @@ static void vfio_unregister_err_notifier(VFIODevice *vdev)
     pfd = (int32_t *)&irq_set->data;
     *pfd = -1;
 
-    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
     if (ret) {
         error_report("vfio: Failed to de-assign error fd: %m");
     }
@@ -3729,76 +3059,36 @@ static void vfio_unregister_err_notifier(VFIODevice *vdev)
     event_notifier_cleanup(&vdev->err_notifier);
 }
 
+
+static VFIODeviceOps vfio_pci_ops = {
+    .vfio_eoi = vfio_pci_eoi,
+    .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
+    .vfio_check_device = vfio_pci_check_device,
+    .vfio_get_device_regions = vfio_pci_get_device_regions,
+    .vfio_get_device_interrupts = vfio_pci_get_device_interrupts,
+};
+
 static int vfio_initfn(PCIDevice *pdev)
 {
-    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
-    VFIOGroup *group;
-    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
-    ssize_t len;
-    struct stat st;
-    int groupid;
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
+    VFIODevice *vbasedev = &vdev->vdev;
     int ret;
 
-    /* Check that the host device exists */
-    snprintf(path, sizeof(path),
-             "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
-             vdev->host.domain, vdev->host.bus, vdev->host.slot,
-             vdev->host.function);
-    if (stat(path, &st) < 0) {
-        error_report("vfio: error: no such host device: %s", path);
-        return -errno;
-    }
-
-    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
-
-    len = readlink(path, iommu_group_path, sizeof(path));
-    if (len <= 0 || len >= sizeof(path)) {
-        error_report("vfio: error no iommu_group for device");
-        return len < 0 ? -errno : ENAMETOOLONG;
-    }
-
-    iommu_group_path[len] = 0;
-    group_name = basename(iommu_group_path);
-
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_report("vfio: error reading %s: %m", path);
-        return -errno;
-    }
-
-    DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
-
-    group = vfio_get_group(groupid);
-    if (!group) {
-        error_report("vfio: failed to get group %d", groupid);
-        return -ENOENT;
-    }
-
-    snprintf(path, sizeof(path), "%04x:%02x:%02x.%01x",
+    vbasedev->name = malloc(PATH_MAX);
+    snprintf(vbasedev->name, PATH_MAX, "%04x:%02x:%02x.%01x",
             vdev->host.domain, vdev->host.bus, vdev->host.slot,
             vdev->host.function);
 
-    QLIST_FOREACH(pvdev, &group->device_list, next) {
-        if (pvdev->host.domain == vdev->host.domain &&
-            pvdev->host.bus == vdev->host.bus &&
-            pvdev->host.slot == vdev->host.slot &&
-            pvdev->host.function == vdev->host.function) {
+    vbasedev->ops = &vfio_pci_ops;
 
-            error_report("vfio: error: device %s is already attached", path);
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-
-    ret = vfio_get_device(group, path, vdev);
-    if (ret) {
-        error_report("vfio: failed to get device %s", path);
-        vfio_put_group(group);
+    ret = vfio_base_device_init(vbasedev, VFIO_DEVICE_TYPE_PCI);
+    if (ret < 0) {
         return ret;
     }
 
     /* Get a copy of config space */
-    ret = pread(vdev->fd, vdev->pdev.config,
+    ret = pread(vbasedev->fd, vdev->pdev.config,
                 MIN(pci_config_size(&vdev->pdev), vdev->config_size),
                 vdev->config_offset);
     if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
@@ -3879,14 +3169,14 @@ out_teardown:
 out_put:
     g_free(vdev->emulated_config_bits);
     vfio_put_device(vdev);
-    vfio_put_group(group);
+    vfio_put_group(vbasedev->group, vfio_reset_handler);
     return ret;
 }
 
 static void vfio_exitfn(PCIDevice *pdev)
 {
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
-    VFIOGroup *group = vdev->group;
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice,  pdev);
+    VFIOGroup *group = vdev->vdev.group;
 
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
@@ -3899,21 +3189,22 @@ static void vfio_exitfn(PCIDevice *pdev)
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     vfio_put_device(vdev);
-    vfio_put_group(group);
+    vfio_put_group(group, vfio_reset_handler);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
 {
     PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
-    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
+    int fd = vdev->vdev.fd;
 
     DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function);
 
     vfio_pci_pre_reset(vdev);
 
-    if (vdev->reset_works && (vdev->has_flr || !vdev->has_pm_reset) &&
-        !ioctl(vdev->fd, VFIO_DEVICE_RESET)) {
+    if (vdev->vdev.reset_works && (vdev->has_flr || !vdev->has_pm_reset) &&
+        !ioctl(vdev->vdev.fd, VFIO_DEVICE_RESET)) {
         DPRINTF("%04x:%02x:%02x.%x FLR/VFIO_DEVICE_RESET\n", vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function);
         goto post_reset;
@@ -3925,10 +3216,9 @@ static void vfio_pci_reset(DeviceState *dev)
     }
 
     /* If nothing else works and the device supports PM reset, use it */
-    if (vdev->reset_works && vdev->has_pm_reset &&
-        !ioctl(vdev->fd, VFIO_DEVICE_RESET)) {
-        DPRINTF("%04x:%02x:%02x.%x PCI PM Reset\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    if (vdev->vdev.reset_works && vdev->has_pm_reset &&
+        !ioctl(fd, VFIO_DEVICE_RESET)) {
+        DPRINTF("%s PCI PM Reset\n", vdev->vdev.name);
         goto post_reset;
     }
 
@@ -3937,16 +3227,16 @@ post_reset:
 }
 
 static Property vfio_pci_dev_properties[] = {
-    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
-    DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIODevice,
+    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
+    DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
                        intx.mmap_timeout, 1100),
-    DEFINE_PROP_BIT("x-vga", VFIODevice, features,
+    DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_VGA_BIT, false),
-    DEFINE_PROP_INT32("bootindex", VFIODevice, bootindex, -1),
+    DEFINE_PROP_INT32("bootindex", VFIOPCIDevice, bootindex, -1),
     /*
      * TODO - support passed fds... is this necessary?
-     * DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
-     * DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
+     * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
+     * DEFINE_PROP_STRING("vfiogroupfd, VFIOPCIDevice, vfiogroupfd_name),
      */
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3976,7 +3266,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 static const TypeInfo vfio_pci_dev_info = {
     .name = "vfio-pci",
     .parent = TYPE_PCI_DEVICE,
-    .instance_size = sizeof(VFIODevice),
+    .instance_size = sizeof(VFIOPCIDevice),
     .class_init = vfio_pci_dev_class_init,
 };
 
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
new file mode 100644
index 0000000..646aa53
--- /dev/null
+++ b/hw/vfio/platform.c
@@ -0,0 +1,267 @@
+/*
+ * vfio based device assignment support - platform devices
+ *
+ * Copyright Linaro Limited, 2014
+ *
+ * Authors:
+ *  Kim Phillips <kim.phillips@linaro.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on vfio based PCI device assignment support:
+ *  Copyright Red Hat, Inc. 2012
+ */
+
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "sysemu/sysemu.h"
+#include "hw/sysbus.h"
+
+#include "vfio-common.h"
+
+
+typedef struct VFIOPlatformDevice {
+    SysBusDevice sbdev;
+    VFIODevice vdev; /* not a QOM object */
+/* interrupts to come later on */
+} VFIOPlatformDevice;
+
+
+static const MemoryRegionOps vfio_region_ops = {
+    .read = vfio_region_read,
+    .write = vfio_region_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+};
+
+/*
+ * It is mandatory to pass a VFIOPlatformDevice since VFIODevice
+ * is not an Object and cannot be passed to memory region functions
+*/
+
+static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
+{
+    VFIORegion *region = vdev->vdev.regions[nr];
+    unsigned size = region->size;
+    char name[64];
+
+    snprintf(name, sizeof(name), "VFIO %s region %d", vdev->vdev.name, nr);
+
+    /* A "slow" read/write mapping underlies all regions  */
+    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
+                          region, name, size);
+
+    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+
+    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
+                         &region->mmap_mem, &region->mmap, size, 0, name)) {
+        error_report("%s unsupported. Performance may be slow", name);
+    }
+}
+
+
+static void vfio_unmap_region(VFIODevice *vdev, int nr)
+{
+    VFIORegion *region = vdev->regions[nr];
+
+    if (!region->size) {
+        return;
+    }
+
+    memory_region_del_subregion(&region->mem, &region->mmap_mem);
+    munmap(region->mmap, memory_region_size(&region->mmap_mem));
+    memory_region_destroy(&region->mmap_mem);
+
+    memory_region_destroy(&region->mem);
+}
+
+static void vfio_unmap_regions(VFIODevice *vdev)
+{
+    int i;
+    for (i = 0; i < vdev->num_regions; i++) {
+        vfio_unmap_region(vdev, i);
+    }
+}
+
+
+static int vfio_platform_get_device_regions(VFIODevice *vbasedev)
+{
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = errno;
+
+    vbasedev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        vbasedev->regions[i] = g_malloc0(sizeof(VFIORegion));
+
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            goto error;
+        }
+
+        vbasedev->regions[i]->flags = reg_info.flags;
+        vbasedev->regions[i]->size = reg_info.size;
+        vbasedev->regions[i]->fd_offset = reg_info.offset;
+        vbasedev->regions[i]->fd = vbasedev->fd;
+        vbasedev->regions[i]->nr = i;
+        vbasedev->regions[i]->vdev = vbasedev;
+    }
+
+    print_regions(vbasedev);
+
+    return ret;
+
+error:
+    for (i = 0; i < vbasedev->num_regions; i++) {
+            g_free(vbasedev->regions[i]);
+    }
+    g_free(vbasedev->regions);
+    vfio_put_base_device(vbasedev);
+    return ret;
+}
+
+
+/* not implemented yet */
+static int vfio_platform_check_device(VFIODevice *vdev)
+{
+    return 0;
+}
+
+/* not implemented yet */
+static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
+{
+return false;
+}
+
+static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
+{
+return 0;
+}
+
+
+/* not implemented yet */
+static int vfio_platform_get_device_interrupts(VFIODevice *vdev)
+{
+    return 0;
+}
+
+/* not implemented yet */
+static void vfio_platform_eoi(VFIODevice *vdev)
+{
+}
+
+static VFIODeviceOps vfio_platform_ops = {
+    .vfio_eoi = vfio_platform_eoi,
+    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
+    .vfio_check_device = vfio_platform_check_device,
+    .vfio_get_device_regions = vfio_platform_get_device_regions,
+    .vfio_get_device_interrupts = vfio_platform_get_device_interrupts,
+};
+
+
+static void vfio_platform_realize(DeviceState *dev, Error **errp)
+{
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
+    VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev);
+    VFIODevice *vbasedev = &vdev->vdev;
+    int i, ret;
+
+    vbasedev->ops = &vfio_platform_ops;
+
+    /* TODO: pass device name on command line */
+    vbasedev->name = malloc(PATH_MAX);
+    snprintf(vbasedev->name, PATH_MAX, "%s", "fff51000.ethernet");
+
+    ret = vfio_base_device_init(vbasedev, VFIO_DEVICE_TYPE_PLATFORM);
+    if (ret < 0) {
+        return;
+    }
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        vfio_map_region(vdev, i);
+        sysbus_init_mmio(sbdev, &vbasedev->regions[i]->mem);
+    }
+}
+
+static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
+{
+    int i;
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
+    VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev);
+    VFIODevice *vbasedev = &vdev->vdev;
+    VFIOGroup *group = vbasedev->group;
+    /*
+     * placeholder for
+     * vfio_unregister_err_notifier(vdev)
+     * vfio_disable_interrupts(vdev);
+     * timer free
+     * g_free vdev dynamic fields
+    */
+    vfio_unmap_regions(vbasedev);
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        g_free(vbasedev->regions[i]);
+    }
+    g_free(vbasedev->regions);
+
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(group, vfio_reset_handler);
+
+}
+
+static const VMStateDescription vfio_platform_vmstate = {
+    .name = TYPE_VFIO_PLATFORM,
+    .unmigratable = 1,
+};
+
+typedef struct VFIOPlatformDeviceClass {
+    DeviceClass parent_class;
+
+    int (*init)(VFIODevice *dev);
+} VFIOPlatformDeviceClass;
+
+#define VFIO_PLATFORM_DEVICE(obj) \
+     OBJECT_CHECK(VFIOPlatformDevice, (obj), TYPE_VFIO_PLATFORM)
+#define VFIO_PLATFORM_DEVICE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(VFIOPlatformDeviceClass, (klass), TYPE_VFIO_PLATFORM)
+#define VFIO_PLATFORM_DEVICE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(VFIOPlatformDeviceClass, (obj), TYPE_VFIO_PLATFORM)
+
+
+
+static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VFIOPlatformDeviceClass *vdc = VFIO_PLATFORM_DEVICE_CLASS(klass);
+
+    dc->realize = vfio_platform_realize;
+    dc->unrealize = vfio_platform_unrealize;
+    dc->vmsd = &vfio_platform_vmstate;
+    dc->desc = "VFIO-based platform device assignment";
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+
+    vdc->init = NULL;
+}
+
+static const TypeInfo vfio_platform_dev_info = {
+    .name = TYPE_VFIO_PLATFORM,
+    .parent = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(VFIOPlatformDevice),
+    .class_init = vfio_platform_dev_class_init,
+    .class_size = sizeof(VFIOPlatformDeviceClass),
+};
+
+static void register_vfio_platform_dev_type(void)
+{
+    type_register_static(&vfio_platform_dev_info);
+}
+
+type_init(register_vfio_platform_dev_type)
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
new file mode 100644
index 0000000..2699fba
--- /dev/null
+++ b/hw/vfio/vfio-common.h
@@ -0,0 +1,143 @@
+/*
+ * common header for vfio based device assignment support
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "hw/hw.h"
+
+/*#define DEBUG_VFIO*/
+#ifdef DEBUG_VFIO
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+/* Extra debugging, trap acceleration paths for more logging */
+#define VFIO_ALLOW_MMAP 1
+#define VFIO_ALLOW_KVM_INTX 1
+#define VFIO_ALLOW_KVM_MSI 1
+#define VFIO_ALLOW_KVM_MSIX 1
+
+#define TYPE_VFIO_PLATFORM "vfio-platform"
+
+enum {
+    VFIO_DEVICE_TYPE_PCI = 0,
+    VFIO_DEVICE_TYPE_PLATFORM = 1,
+};
+
+struct VFIOGroup;
+struct VFIODevice;
+
+typedef struct VFIODeviceOps VFIODeviceOps;
+
+/* Base Class for a VFIO Region */
+
+typedef struct VFIORegion {
+    struct VFIODevice *vdev;
+    off_t fd_offset; /* offset of region within device fd */
+    int fd; /* device fd, allows us to pass VFIORegion as opaque data */
+    MemoryRegion mem; /* slow, read/write access */
+    MemoryRegion mmap_mem; /* direct mapped access */
+    void *mmap;
+    size_t size;
+    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
+    uint8_t nr; /* cache the region number for debug */
+} VFIORegion;
+
+
+/* Base Class for a VFIO device */
+
+typedef struct VFIODevice {
+    QLIST_ENTRY(VFIODevice) next;
+    struct VFIOGroup *group;
+    unsigned int num_regions;
+    VFIORegion **regions;
+    unsigned int num_irqs;
+    char *name;
+    int fd;
+    int type;
+    bool reset_works;
+    bool needs_reset;
+    VFIODeviceOps *ops;
+} VFIODevice;
+
+
+typedef struct VFIOType1 {
+    MemoryListener listener;
+    int error;
+    bool initialized;
+} VFIOType1;
+
+typedef struct VFIOContainer {
+    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
+    struct {
+        /* enable abstraction to support various iommu backends */
+        union {
+            VFIOType1 type1;
+        };
+        void (*release)(struct VFIOContainer *);
+    } iommu_data;
+    QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_ENTRY(VFIOContainer) next;
+} VFIOContainer;
+
+typedef struct VFIOGroup {
+    int fd;
+    int groupid;
+    VFIOContainer *container;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOGroup) next;
+    QLIST_ENTRY(VFIOGroup) container_next;
+} VFIOGroup;
+
+
+struct VFIODeviceOps {
+    bool (*vfio_compute_needs_reset)(VFIODevice *vdev);
+    int (*vfio_hot_reset_multi)(VFIODevice *vdev);
+    void (*vfio_eoi)(VFIODevice *vdev);
+    int (*vfio_check_device)(VFIODevice *vdev);
+    int (*vfio_get_device_regions)(VFIODevice *vdev);
+    int (*vfio_get_device_interrupts)(VFIODevice *vdev);
+};
+
+
+
+VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler);
+void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler);
+
+void vfio_reset_handler(void *opaque);
+
+void vfio_unmask_irqindex(VFIODevice *vdev, int index);
+void vfio_disable_irqindex(VFIODevice *vdev, int index);
+void vfio_mask_int(VFIODevice *vdev, int index);
+
+void vfio_region_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
+uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size);
+
+int vfio_get_base_device(VFIOGroup *group, const char *name,
+                        struct VFIODevice *vdev);
+void vfio_put_base_device(VFIODevice *vdev);
+int vfio_base_device_init(VFIODevice *vdev, int type);
+void print_regions(VFIODevice *vdev);
+
+int vfio_mmap_region(Object *vdev, VFIORegion *region,
+                     MemoryRegion *mem, MemoryRegion *submem,
+                     void **map, size_t size, off_t offset,
+                     const char *name);
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 26c218e..ef4815d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -154,6 +154,7 @@ struct vfio_device_info {
 	__u32	flags;
 #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM	(1 << 2)	/* vfio-pci device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (2 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-25 21:22   ` Alexander Graf
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device Eric Auger
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This patch gets benefit from the new VFIODevice name field.

Occurences of
DPRINTF("%s(%04x:%02x:%02x.%x) ...", __func__, vdev->host.domain,
        vdev->host.bus, vdev->host.slot, vdev->host.function, ...)
are replaced by
DPRINTF("%s(%s ...", __func__, vdev->vdev.name, ...).

name is built using "%04x:%02x:%02x.%01x" format string.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/vfio/pci.c | 227 ++++++++++++++++++++--------------------------------------
 1 file changed, 78 insertions(+), 149 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a9e4d97..ad0c2a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -212,9 +212,7 @@ static void vfio_intx_interrupt(void *opaque)
         return;
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) Pin %c\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function,
-            'A' + vdev->intx.pin);
+    DPRINTF("%s(%s) Pin %c\n", __func__, vdev->vdev.name, 'A' + vdev->intx.pin);
 
     vdev->intx.pending = true;
     pci_irq_assert(&vdev->pdev);
@@ -233,8 +231,7 @@ static void vfio_pci_eoi(VFIODevice *vdev)
         return;
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __func__, vpcidev->host.domain,
-            vpcidev->host.bus, vpcidev->host.slot, vpcidev->host.function);
+    DPRINTF("%s(%s) EOI\n", __func__, vdev->name);
 
     vpcidev->intx.pending = false;
     pci_irq_deassert(&vpcidev->pdev);
@@ -303,9 +300,7 @@ static void vfio_enable_intx_kvm(VFIOPCIDevice *vdev)
 
     vdev->intx.kvm_accel = true;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel enabled\n",
-            __func__, vdev->host.domain, vdev->host.bus,
-            vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s) KVM INTx accel enabled\n", __func__, vdev->vdev.name);
 
     return;
 
@@ -357,9 +352,7 @@ static void vfio_disable_intx_kvm(VFIOPCIDevice *vdev)
     /* If we've missed an event, let it re-fire through QEMU */
     vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n",
-            __func__, vdev->host.domain, vdev->host.bus,
-            vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s) KVM INTx accel disabled\n", __func__, vdev->vdev.name);
 #endif
 }
 
@@ -378,9 +371,8 @@ static void vfio_update_irq(PCIDevice *pdev)
         return; /* Nothing changed */
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) IRQ moved %d -> %d\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, vdev->intx.route.irq, route.irq);
+    DPRINTF("%s(%s) IRQ moved %d -> %d\n", __func__, vdev->vdev.name,
+            vdev->intx.route.irq, route.irq);
 
     vfio_disable_intx_kvm(vdev);
 
@@ -456,8 +448,7 @@ static int vfio_enable_intx(VFIOPCIDevice *vdev)
 
     vdev->interrupt = VFIO_INT_INTx;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 
     return 0;
 }
@@ -479,8 +470,7 @@ static void vfio_disable_intx(VFIOPCIDevice *vdev)
 
     vdev->interrupt = VFIO_INT_NONE;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 }
 
 /*
@@ -507,9 +497,8 @@ static void vfio_msi_interrupt(void *opaque)
         abort();
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d 0x%"PRIx64"/0x%x\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, nr, msg.address, msg.data);
+    DPRINTF("%s(%s) vector %d 0x%"PRIx64"/0x%x\n", __func__,
+            vdev->vdev.name, nr, msg.address, msg.data);
 #endif
 
     if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -560,9 +549,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     VFIOMSIVector *vector;
     int ret;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d used\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, nr);
+    DPRINTF("%s(%s) vector %d used\n", __func__, vdev->vdev.name, nr);
 
     vector = &vdev->msi_vectors[nr];
     vector->vdev = vdev;
@@ -645,9 +632,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     struct vfio_irq_set *irq_set;
     int32_t *pfd;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d released\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, nr);
+    DPRINTF("%s(%s) vector %d released\n", __func__, vdev->vdev.name, nr);
 
     /*
      * XXX What's the right thing to do here?  This turns off the interrupt
@@ -716,8 +701,7 @@ static void vfio_enable_msix(VFIOPCIDevice *vdev)
         error_report("vfio: msix_set_vector_notifiers failed");
     }
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 }
 
 static void vfio_enable_msi(VFIOPCIDevice *vdev)
@@ -792,9 +776,8 @@ retry:
 
     vdev->interrupt = VFIO_INT_MSI;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) Enabled %d MSI vectors\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, vdev->nr_vectors);
+    DPRINTF("%s(%s) Enabled %d MSI vectors\n", __func__, vdev->vdev.name,
+            vdev->nr_vectors);
 }
 
 static void vfio_disable_msi_common(VFIOPCIDevice *vdev)
@@ -829,8 +812,7 @@ static void vfio_disable_msix(VFIOPCIDevice *vdev)
 
     vfio_disable_msi_common(vdev);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 }
 
 static void vfio_disable_msi(VFIOPCIDevice *vdev)
@@ -861,8 +843,7 @@ static void vfio_disable_msi(VFIOPCIDevice *vdev)
 
     vfio_disable_msi_common(vdev);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 }
 
 static void vfio_update_msi(VFIOPCIDevice *vdev)
@@ -882,9 +863,8 @@ static void vfio_update_msi(VFIOPCIDevice *vdev)
         if (msg.address != vector->msg.address ||
             msg.data != vector->msg.data) {
 
-            DPRINTF("%s(%04x:%02x:%02x.%x) MSI vector %d changed\n",
-                    __func__, vdev->host.domain, vdev->host.bus,
-                    vdev->host.slot, vdev->host.function, i);
+            DPRINTF("%s(%s) MSI vector %d changed\n", __func__,
+                    vdev->vdev.name, i);
 
             kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg);
             vector->msg = msg;
@@ -918,8 +898,7 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
         return;
     }
 
-    DPRINTF("Device %04x:%02x:%02x.%x ROM:\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("Device %s ROM:\n", vdev->vdev.name);
     DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
             (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
             (unsigned long)reg_info.flags);
@@ -929,10 +908,7 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
 
     if (!vdev->rom_size) {
         vdev->rom_read_failed = true;
-        error_report("vfio-pci: Cannot read device rom at "
-                    "%04x:%02x:%02x.%x",
-                    vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                    vdev->host.function);
+        error_report("vfio-pci: Cannot read device rom at %s", vdev->vdev.name);
         error_printf("Device option ROM contents are probably invalid "
                     "(check dmesg).\nSkip option ROM probe with rombar=0, "
                     "or load from file with romfile=\n");
@@ -972,9 +948,8 @@ static uint64_t vfio_rom_read(void *opaque, hwaddr addr, unsigned size)
     memcpy(&val, vdev->rom + addr,
            (addr < vdev->rom_size) ? MIN(size, vdev->rom_size - addr) : 0);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x, 0x%"HWADDR_PRIx", 0x%x) = 0x%"PRIx64"\n",
-            __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, addr, size, val);
+    DPRINTF("%s(%s, 0x%"HWADDR_PRIx", 0x%x) = 0x%"PRIx64"\n",
+            __func__, vdev->vdev.name, addr, size, val);
 
     return val;
 }
@@ -1021,12 +996,11 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
     if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
         /* Since pci handles romfile, just print a message and return */
         if (vfio_blacklist_opt_rom(vdev) && vdev->pdev.romfile) {
-            error_printf("Warning : Device at %04x:%02x:%02x.%x "
+            error_printf("Warning : Device at %s "
                          "is known to cause system instability issues during "
                          "option rom execution. "
                          "Proceeding anyway since user specified romfile\n",
-                         vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                         vdev->host.function);
+                         vdev->vdev.name);
         }
         return;
     }
@@ -1039,9 +1013,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
         pwrite(fd, &size, 4, offset) != 4 ||
         pread(fd, &size, 4, offset) != 4 ||
         pwrite(fd, &orig, 4, offset) != 4) {
-        error_report("%s(%04x:%02x:%02x.%x) failed: %m",
-                     __func__, vdev->host.domain, vdev->host.bus,
-                     vdev->host.slot, vdev->host.function);
+        error_report("%s(%s) failed: %m", __func__, vdev->vdev.name);
         return;
     }
 
@@ -1053,30 +1025,25 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 
     if (vfio_blacklist_opt_rom(vdev)) {
         if (dev->opts && qemu_opt_get(dev->opts, "rombar")) {
-            error_printf("Warning : Device at %04x:%02x:%02x.%x "
+            error_printf("Warning : Device at %s "
                          "is known to cause system instability issues during "
                          "option rom execution. "
                          "Proceeding anyway since user specified non zero value for "
                          "rombar\n",
-                         vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                         vdev->host.function);
+                         vdev->vdev.name);
         } else {
             error_printf("Warning : Rom loading for device at "
-                         "%04x:%02x:%02x.%x has been disabled due to "
+                         "%s has been disabled due to "
                          "system instability issues. "
                          "Specify rombar=1 or romfile to force\n",
-                         vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                         vdev->host.function);
+                         vdev->vdev.name);
             return;
         }
     }
 
-    DPRINTF("%04x:%02x:%02x.%x ROM size 0x%x\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function, size);
+    DPRINTF("%s ROM size 0x%x\n", vdev->vdev.name, size);
 
-    snprintf(name, sizeof(name), "vfio[%04x:%02x:%02x.%x].rom",
-             vdev->host.domain, vdev->host.bus, vdev->host.slot,
-             vdev->host.function);
+    snprintf(name, sizeof(name), "vfio[%s].rom", vdev->vdev.name);
 
     memory_region_init_io(&vdev->pdev.rom, OBJECT(vdev),
                           &vfio_rom_ops, vdev, name, size);
@@ -1207,9 +1174,8 @@ static uint64_t vfio_generic_window_quirk_read(void *opaque,
         data = vfio_pci_read_config(&vdev->pdev,
                                     quirk->data.address_val + offset, size);
 
-        DPRINTF("%s read(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", %d) = 0x%"
-                PRIx64"\n", memory_region_name(&quirk->mem), vdev->host.domain,
-                vdev->host.bus, vdev->host.slot, vdev->host.function,
+        DPRINTF("%s read(%s:BAR%d+0x%"HWADDR_PRIx", %d) = 0x%"
+                PRIx64"\n", memory_region_name(&quirk->mem), vdev->vdev.name,
                 quirk->data.bar, addr, size, data);
     } else {
         data = vfio_region_read(&vdev->bars[quirk->data.bar].region,
@@ -1256,10 +1222,9 @@ static void vfio_generic_window_quirk_write(void *opaque, hwaddr addr,
 
         vfio_pci_write_config(&vdev->pdev,
                               quirk->data.address_val + offset, data, size);
-        DPRINTF("%s write(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", 0x%"
+        DPRINTF("%s write(%s:BAR%d+0x%"HWADDR_PRIx", 0x%"
                 PRIx64", %d)\n", memory_region_name(&quirk->mem),
-                vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                vdev->host.function, quirk->data.bar, addr, data, size);
+                vdev->vdev.name, quirk->data.bar, addr, data, size);
         return;
     }
 
@@ -1292,9 +1257,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
 
         data = vfio_pci_read_config(&vdev->pdev, addr - offset, size);
 
-        DPRINTF("%s read(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", %d) = 0x%"
-                PRIx64"\n", memory_region_name(&quirk->mem), vdev->host.domain,
-                vdev->host.bus, vdev->host.slot, vdev->host.function,
+        DPRINTF("%s read(%s:BAR%d+0x%"HWADDR_PRIx", %d) = 0x%"
+                PRIx64"\n", memory_region_name(&quirk->mem), vdev->vdev.name,
                 quirk->data.bar, addr + base, size, data);
     } else {
         data = vfio_region_read(&vdev->bars[quirk->data.bar].region,
@@ -1322,10 +1286,9 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
 
         vfio_pci_write_config(&vdev->pdev, addr - offset, data, size);
 
-        DPRINTF("%s write(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", 0x%"
+        DPRINTF("%s write(%s:BAR%d+0x%"HWADDR_PRIx", 0x%"
                 PRIx64", %d)\n", memory_region_name(&quirk->mem),
-                vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                vdev->host.function, quirk->data.bar, addr + base, data, size);
+                vdev->vdev.name, quirk->data.bar, addr + base, data, size);
     } else {
         vfio_region_write(&vdev->bars[quirk->data.bar].region, addr + base,
                             data, size);
@@ -1396,9 +1359,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
     QLIST_INSERT_HEAD(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].quirks,
                       quirk, next);
 
-    DPRINTF("Enabled ATI/AMD quirk 0x3c3 BAR4for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled ATI/AMD quirk 0x3c3 BAR4for device %s\n", vdev->vdev.name);
 }
 
 /*
@@ -1439,9 +1400,8 @@ static void vfio_probe_ati_bar4_window_quirk(VFIOPCIDevice *vdev, int nr)
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
-    DPRINTF("Enabled ATI/AMD BAR4 window quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled ATI/AMD BAR4 window quirk for device %s\n",
+            vdev->vdev.name);
 }
 
 /*
@@ -1474,9 +1434,8 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
-    DPRINTF("Enabled ATI/AMD BAR2 0x4000 quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled ATI/AMD BAR2 0x4000 quirk for device %s\n",
+            vdev->vdev.name);
 }
 
 /*
@@ -1609,9 +1568,7 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
     QLIST_INSERT_HEAD(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].quirks,
                       quirk, next);
 
-    DPRINTF("Enabled NVIDIA VGA 0x3d0 quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled NVIDIA VGA 0x3d0 quirk for device %s\n", vdev->vdev.name);
 }
 
 /*
@@ -1700,9 +1657,8 @@ static void vfio_probe_nvidia_bar5_window_quirk(VFIOPCIDevice *vdev, int nr)
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
-    DPRINTF("Enabled NVIDIA BAR5 window quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled NVIDIA BAR5 window quirk for device %s\n",
+            vdev->vdev.name);
 }
 
 static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
@@ -1769,9 +1725,8 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
-    DPRINTF("Enabled NVIDIA BAR0 0x88000 quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled NVIDIA BAR0 0x88000 quirk for device %s\n",
+            vdev->vdev.name);
 }
 
 /*
@@ -1808,9 +1763,8 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
 
-    DPRINTF("Enabled NVIDIA BAR0 0x1800 quirk for device %04x:%02x:%02x.%x\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("Enabled NVIDIA BAR0 0x1800 quirk for device %s\n",
+            vdev->vdev.name);
 }
 
 /*
@@ -1885,9 +1839,8 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 
         ret = pread(vdev->vdev.fd, &phys_val, len, vdev->config_offset + addr);
         if (ret != len) {
-            error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m",
-                         __func__, vdev->host.domain, vdev->host.bus,
-                         vdev->host.slot, vdev->host.function, addr, len);
+            error_report("%s(%s, 0x%x, 0x%x) failed: %m",
+                         __func__, vdev->vdev.name, addr, len);
             return -errno;
         }
         phys_val = le32_to_cpu(phys_val);
@@ -1895,9 +1848,8 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 
     val = (emu_val & emu_bits) | (phys_val & ~emu_bits);
 
-    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n", __func__,
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, addr, len, val);
+    DPRINTF("%s(%s, @0x%x, len=0x%x) %x\n", __func__, vdev->vdev.name,
+            addr, len, val);
 
     return val;
 }
@@ -1912,11 +1864,10 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
             addr, val, len);
 
     /* Write everything to VFIO, let it filter out what we can't write */
-    if (pwrite(vdev->vdev.fd, &val_le, len, 
+    if (pwrite(vdev->vdev.fd, &val_le, len,
                 vdev->config_offset + addr) != len) {
-        error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %m",
-                     __func__, vdev->host.domain, vdev->host.bus,
-                     vdev->host.slot, vdev->host.function, addr, val, len);
+        error_report("%s(%s, 0x%x, 0x%x, 0x%x) failed: %m",
+                    __func__, vdev->vdev.name, addr, val, len);
     }
 
     /* MSI/MSI-X Enabling/Disabling */
@@ -1992,8 +1943,7 @@ static int vfio_setup_msi(VFIOPCIDevice *vdev, int pos)
     msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
     entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
 
-    DPRINTF("%04x:%02x:%02x.%x PCI MSI CAP @0x%x\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function, pos);
+    DPRINTF("%s PCI MSI CAP @0x%x\n", vdev->vdev.name, pos);
 
     ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
     if (ret < 0) {
@@ -2054,10 +2004,8 @@ static int vfio_early_setup_msix(VFIOPCIDevice *vdev)
     vdev->msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
     vdev->msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
 
-    DPRINTF("%04x:%02x:%02x.%x "
-            "PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function, pos, vdev->msix->table_bar,
+    DPRINTF("%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d\n",
+            vdev->vdev.name, pos, vdev->msix->table_bar,
             vdev->msix->table_offset, vdev->msix->entries);
 
     return 0;
@@ -2151,9 +2099,7 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    snprintf(name, sizeof(name), "VFIO %04x:%02x:%02x.%x BAR %d",
-             vdev->host.domain, vdev->host.bus, vdev->host.slot,
-             vdev->host.function, nr);
+    snprintf(name, sizeof(name), "VFIO %s BAR %d", vdev->vdev.name, nr);
 
     /* Determine what type of BAR this is for registration */
     ret = pread(vdev->vdev.fd, &pci_bar, sizeof(pci_bar),
@@ -2397,9 +2343,7 @@ static void vfio_check_pcie_flr(VFIOPCIDevice *vdev, uint8_t pos)
     uint32_t cap = pci_get_long(vdev->pdev.config + pos + PCI_EXP_DEVCAP);
 
     if (cap & PCI_EXP_DEVCAP_FLR) {
-        DPRINTF("%04x:%02x:%02x.%x Supports FLR via PCIe cap\n",
-                vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                vdev->host.function);
+        DPRINTF("%s Supports FLR via PCIe cap\n", vdev->vdev.name);
         vdev->has_flr = true;
     }
 }
@@ -2409,9 +2353,7 @@ static void vfio_check_pm_reset(VFIOPCIDevice *vdev, uint8_t pos)
     uint16_t csr = pci_get_word(vdev->pdev.config + pos + PCI_PM_CTRL);
 
     if (!(csr & PCI_PM_CTRL_NO_SOFT_RESET)) {
-        DPRINTF("%04x:%02x:%02x.%x Supports PM reset\n",
-                vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                vdev->host.function);
+        DPRINTF("%s Supports PM reset\n", vdev->vdev.name);
         vdev->has_pm_reset = true;
     }
 }
@@ -2421,9 +2363,7 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     uint8_t cap = pci_get_byte(vdev->pdev.config + pos + PCI_AF_CAP);
 
     if ((cap & PCI_AF_CAP_TP) && (cap & PCI_AF_CAP_FLR)) {
-        DPRINTF("%04x:%02x:%02x.%x Supports FLR via AF cap\n",
-                vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                vdev->host.function);
+        DPRINTF("%s Supports FLR via AF cap\n", vdev->vdev.name);
         vdev->has_flr = true;
     }
 }
@@ -2493,9 +2433,8 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     }
 
     if (ret < 0) {
-        error_report("vfio: %04x:%02x:%02x.%x Error adding PCI capability "
-                     "0x%x[0x%x]@0x%x: %d", vdev->host.domain,
-                     vdev->host.bus, vdev->host.slot, vdev->host.function,
+        error_report("vfio: %s Error adding PCI capability "
+                     "0x%x[0x%x]@0x%x: %d", vdev->vdev.name,
                      cap_id, size, pos, ret);
         return ret;
     }
@@ -2575,9 +2514,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     bool multi = false;
     int fd = vdev->vdev.fd;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x) %s\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function,
-            single ? "one" : "multi");
+    DPRINTF("%s(%s) %s\n", __func__, vdev->vdev.name, single ? "one" : "multi");
 
     vfio_pci_pre_reset(vdev);
     vdev->vdev.needs_reset = false;
@@ -2589,9 +2526,8 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     if (ret && errno != ENOSPC) {
         ret = -errno;
         if (!vdev->has_pm_reset) {
-            error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
-                         "no available reset mechanism.", vdev->host.domain,
-                         vdev->host.bus, vdev->host.slot, vdev->host.function);
+            error_report("vfio: Cannot reset device %s, "
+                         "no available reset mechanism.", vdev->vdev.name);
         }
         goto out_single;
     }
@@ -2608,9 +2544,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
         goto out_single;
     }
 
-    DPRINTF("%04x:%02x:%02x.%x: hot reset dependent devices:\n",
-            vdev->host.domain, vdev->host.bus, vdev->host.slot,
-            vdev->host.function);
+    DPRINTF("%s: hot reset dependent devices:\n", vdev->vdev.name);
 
     /* Verify that we have all the groups required */
     for (i = 0; i < info->count; i++) {
@@ -2638,10 +2572,9 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 
         if (!group) {
             if (!vdev->has_pm_reset) {
-                error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
+                error_report("vfio: Cannot reset device %s, "
                              "depends on group %d which is not owned.",
-                             vdev->host.domain, vdev->host.bus, vdev->host.slot,
-                             vdev->host.function, devices[i].group_id);
+                             vdev->vdev.name, devices[i].group_id);
             }
             ret = -EPERM;
             goto out;
@@ -2704,9 +2637,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     ret = ioctl(fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
     g_free(reset);
 
-    DPRINTF("%04x:%02x:%02x.%x hot reset: %s\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function,
-            ret ? "%m" : "Success");
+    DPRINTF("%s hot reset: %s\n", vdev->vdev.name, ret ? "%m" : "Success");
 
 out:
     /* Re-enable INTx on affected devices */
@@ -3198,15 +3129,13 @@ static void vfio_pci_reset(DeviceState *dev)
     VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev);
     int fd = vdev->vdev.fd;
 
-    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+    DPRINTF("%s(%s)\n", __func__, vdev->vdev.name);
 
     vfio_pci_pre_reset(vdev);
 
     if (vdev->vdev.reset_works && (vdev->has_flr || !vdev->has_pm_reset) &&
-        !ioctl(vdev->vdev.fd, VFIO_DEVICE_RESET)) {
-        DPRINTF("%04x:%02x:%02x.%x FLR/VFIO_DEVICE_RESET\n", vdev->host.domain,
-            vdev->host.bus, vdev->host.slot, vdev->host.function);
+        !ioctl(fd, VFIO_DEVICE_RESET)) {
+        DPRINTF("%s FLR/VFIO_DEVICE_RESET\n", vdev->vdev.name);
         goto post_reset;
     }
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (3 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-25 21:28   ` Alexander Graf
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option Eric Auger
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This patch brings a first support for device IRQ assignment to a
KVM guest. Code is inspired of PCI INTx code.

General principle of IRQ handling:

when a physical IRQ occurs, VFIO driver signals an eventfd that was
registered by the QEMU VFIO platform device. The eventfd handler
(vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
disables MMIO region fast path (where MMIO regions are mapped as
RAM). The purpose is to trap the IRQ status register guest reset.
The physical interrupt is unmasked on the first read/write in any
MMIO region. It was masked in the VFIO driver at the instant it
signaled the eventfd.

A single IRQ can be forwarded to the guest at a time, ie. before a
new virtual IRQ to be injected, the previous active one must have
completed.

When no IRQ is pending anymore, fast path can be restored. This is
done on mmap_timer scheduling.

irqfd support will be added in a subsequent patch. irqfd brings a
framework where the eventfd is handled on kernel side instead of in
user-side as currently done, hence improving the performance.

Although the code is prepared to support multiple IRQs, this is not
tested at that stage.

Tested on Calxeda Midway xgmac which can be directly assigned to one
guest (unfortunately only the main IRQ is exercised). A KVM patch is
required to invalidate stage2 entries on RAM memory region destruction
(https://patches.linaro.org/27691/). Without that patch, slow/fast path
switch cannot work.

change v2 -> v3:

- Move mmap_timer and mmap_timeout in new VFIODevice struct as
  PCI/platform factorization.
- multiple IRQ handling (a pending IRQ queue is added) - not tested -
- create vfio_mmap_set_enabled as in PCI code
- name of irq changed in virt

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c         |  13 +-
 hw/vfio/pci.c         |  22 ++--
 hw/vfio/platform.c    | 323 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/vfio-common.h |  10 +-
 4 files changed, 346 insertions(+), 22 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index becd76b..f5693aa 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -112,6 +112,7 @@ static const MemMapEntry a15memmap[] = {
 static const int a15irqmap[] = {
     [VIRT_UART] = 1,
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
+    [VIRT_ETHERNET] = 77,
 };
 
 static VirtBoardInfo machines[] = {
@@ -348,8 +349,14 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
     hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
     const char compat[] = "calxeda,hb-xgmac";
+    int main_irq = vbi->irqmap[VIRT_ETHERNET];
+    int power_irq = main_irq+1;
+    int low_power_irq = main_irq+2;
 
-    sysbus_create_simple("vfio-platform", base, NULL);
+    sysbus_create_varargs("vfio-platform", base,
+                          pic[main_irq],
+                          pic[power_irq],
+                          pic[low_power_irq], NULL);
 
     nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
     qemu_fdt_add_subnode(vbi->fdt, nodename);
@@ -357,6 +364,10 @@ static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     /* Note that we can't use setprop_string because of the embedded NUL */
     qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
     qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
+    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
+                                0x0, main_irq, 0x4,
+                                0x0, power_irq, 0x4,
+                                0x0, low_power_irq, 0x4);
 
     g_free(nodename);
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ad0c2a0..1b49205 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -83,8 +83,6 @@ typedef struct VFIOINTx {
     EventNotifier interrupt; /* eventfd triggered on interrupt */
     EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
     PCIINTxRoute route; /* routing info for QEMU bypass */
-    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
 } VFIOINTx;
 
 typedef struct VFIOMSIVector {
@@ -196,8 +194,8 @@ static void vfio_intx_mmap_enable(void *opaque)
     VFIOPCIDevice *vdev = opaque;
 
     if (vdev->intx.pending) {
-        timer_mod(vdev->intx.mmap_timer,
-               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+        timer_mod(vdev->vdev.mmap_timer,
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->vdev.mmap_timeout);
         return;
     }
 
@@ -217,9 +215,9 @@ static void vfio_intx_interrupt(void *opaque)
     vdev->intx.pending = true;
     pci_irq_assert(&vdev->pdev);
     vfio_mmap_set_enabled(vdev, false);
-    if (vdev->intx.mmap_timeout) {
-        timer_mod(vdev->intx.mmap_timer,
-               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+    if (vdev->vdev.mmap_timeout) {
+        timer_mod(vdev->vdev.mmap_timer,
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->vdev.mmap_timeout);
     }
 }
 
@@ -457,7 +455,7 @@ static void vfio_disable_intx(VFIOPCIDevice *vdev)
 {
     int fd;
 
-    timer_del(vdev->intx.mmap_timer);
+    timer_del(vdev->vdev.mmap_timer);
     vfio_disable_intx_kvm(vdev);
     vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
@@ -3079,7 +3077,7 @@ static int vfio_initfn(PCIDevice *pdev)
     }
 
     if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
-        vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+        vdev->vdev.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                                   vfio_intx_mmap_enable, vdev);
         pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_update_irq);
         ret = vfio_enable_intx(vdev);
@@ -3112,8 +3110,8 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     vfio_disable_interrupts(vdev);
-    if (vdev->intx.mmap_timer) {
-        timer_free(vdev->intx.mmap_timer);
+    if (vdev->vdev.mmap_timer) {
+        timer_free(vdev->vdev.mmap_timer);
     }
     vfio_teardown_msi(vdev);
     vfio_unmap_bars(vdev);
@@ -3158,7 +3156,7 @@ post_reset:
 static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
-                       intx.mmap_timeout, 1100),
+                       vdev.mmap_timeout, 1100),
     DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_VGA_BIT, false),
     DEFINE_PROP_INT32("bootindex", VFIOPCIDevice, bootindex, -1),
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 646aa53..5b9451f 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -24,11 +24,25 @@
 
 #include "vfio-common.h"
 
+typedef struct VFIOINTp {
+    QLIST_ENTRY(VFIOINTp) next; /* entry for IRQ list */
+    QSIMPLEQ_ENTRY(VFIOINTp) pqnext; /* entry for pending IRQ queue */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    qemu_irq qemuirq;
+    struct VFIOPlatformDevice *vdev; /* back pointer to device */
+    int state; /* inactive, pending, active */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* index */
+} VFIOINTp;
+
 
 typedef struct VFIOPlatformDevice {
     SysBusDevice sbdev;
     VFIODevice vdev; /* not a QOM object */
-/* interrupts to come later on */
+    QLIST_HEAD(, VFIOINTp) intp_list; /* list of IRQ */
+    /* queue of pending IRQ */
+    QSIMPLEQ_HEAD(pending_intp_queue, VFIOINTp) pending_intp_queue;
 } VFIOPlatformDevice;
 
 
@@ -38,9 +52,11 @@ static const MemoryRegionOps vfio_region_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+static void vfio_intp_interrupt(void *opaque);
+
 /*
  * It is mandatory to pass a VFIOPlatformDevice since VFIODevice
- * is not an Object and cannot be passed to memory region functions
+ * is not a QOM Object and cannot be passed to memory region functions
 */
 
 static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
@@ -51,7 +67,7 @@ static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
 
     snprintf(name, sizeof(name), "VFIO %s region %d", vdev->vdev.name, nr);
 
-    /* A "slow" read/write mapping underlies all regions  */
+    /* A "slow" read/write mapping underlies all regions */
     memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
                           region, name, size);
 
@@ -145,18 +161,292 @@ static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
 return 0;
 }
 
+/*
+ * eoi function is called on the first access to any MMIO region
+ * after an IRQ was triggered. It is assumed this access corresponds
+ * to the IRQ status register reset.
+ * With such a mechanism, a single IRQ can be handled at a time since
+ * there is no way to know which IRQ was completed by the guest.
+ * (we would need additional details about the IRQ status register mask)
+ */
+
+static void vfio_platform_eoi(VFIODevice *vdev)
+{
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    bool eoi_done = false;
+
+    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
+        if (intp->state == VFIO_IRQ_ACTIVE) {
+            if (eoi_done) {
+                error_report("several IRQ pending: "
+                             "this case should not happen!\n");
+            }
+            DPRINTF("EOI IRQ #%d fd=%d\n",
+                    intp->pin, event_notifier_get_fd(&intp->interrupt));
+            intp->state = VFIO_IRQ_INACTIVE;
+
+            /* deassert the virtual IRQ and unmask physical one */
+            qemu_set_irq(intp->qemuirq, 0);
+            vfio_unmask_irqindex(vdev, intp->pin);
+            eoi_done = true;
+        }
+    }
+
+    /*
+     * in case there are pending IRQs, handle them one at a time */
+     if (!QSIMPLEQ_EMPTY(&vplatdev->pending_intp_queue)) {
+            intp = QSIMPLEQ_FIRST(&vplatdev->pending_intp_queue);
+            vfio_intp_interrupt(intp);
+            QSIMPLEQ_REMOVE_HEAD(&vplatdev->pending_intp_queue, pqnext);
+     }
+
+    return;
+}
+
+/*
+ * enable/disable the fast path mode
+ * fast path = MMIO region is mmaped (no KVM TRAP)
+ * slow path = MMIO region is trapped and region callbacks are called
+ * slow path enables to trap the IRQ status register guest reset
+*/
+
+static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled)
+{
+    VFIORegion *region;
+    int i;
+
+    DPRINTF("fast path = %d\n", enabled);
+
+    for (i = 0; i < vdev->num_regions; i++) {
+        region = vdev->regions[i];
+
+        /* register space is unmapped to trap EOI */
+        memory_region_set_enabled(&region->mmap_mem, enabled);
+    }
+}
+
+/*
+ * Checks whether the IRQ is still pending. In the negative
+ * the fast path mode (where reg space is mmaped) can be restored.
+ * if the IRQ is still pending, we must keep on trapping IRQ status
+ * register reset with mmap disabled (slow path).
+ * the function is called on mmap_timer event.
+ * by construction a single fd is handled at a time. See EOI comment
+ * for additional details.
+ */
+
+
+static void vfio_intp_mmap_enable(void *opaque)
+{
+    VFIOINTp *tmp;
+    VFIODevice *vdev = (VFIODevice *)opaque;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    bool one_active_irq = false;
+
+    QLIST_FOREACH(tmp, &vplatdev->intp_list, next) {
+        if (tmp->state == VFIO_IRQ_ACTIVE) {
+            if (one_active_irq) {
+                error_report("several active IRQ: "
+                             "this case should not happen!\n");
+            }
+            DPRINTF("IRQ #%d still pending, stay in slow path\n",
+                    tmp->pin);
+            timer_mod(vdev->mmap_timer,
+                          qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+                          vdev->mmap_timeout);
+            one_active_irq = true;
+        }
+    }
+    if (one_active_irq) {
+        return;
+    }
+
+    DPRINTF("no pending IRQ, restore fast path\n");
+    vfio_mmap_set_enabled(vdev, true);
+}
+
+/*
+ * The fd handler
+ */
+
+static void vfio_intp_interrupt(void *opaque)
+{
+    int ret;
+    VFIOINTp *tmp, *intp = (VFIOINTp *)opaque;
+    VFIOPlatformDevice *vplatdev = intp->vdev;
+    VFIODevice *vdev = &vplatdev->vdev;
+    bool one_active_irq = false;
+
+    /*
+     * first check whether there is a pending IRQ
+     * in the positive the new IRQ cannot be handled until the
+     * active one is not completed.
+     * by construction the same IRQ as the pending one cannot hit
+     * since the physical IRQ was disabled by the VFIO driver
+     */
+    QLIST_FOREACH(tmp, &vplatdev->intp_list, next) {
+        if (tmp->state == VFIO_IRQ_ACTIVE) {
+            one_active_irq = true;
+        }
+    }
+    if (one_active_irq) {
+        /*
+         * the new IRQ gets a pending status and is pushed in
+         * the pending queue
+         */
+        intp->state = VFIO_IRQ_PENDING;
+        QSIMPLEQ_INSERT_TAIL(&vplatdev->pending_intp_queue,
+                             intp, pqnext);
+        return;
+    }
+
+    /* no active IRQ, the new IRQ can be forwarded to guest */
+    DPRINTF("Handle IRQ #%d (fd = %d)\n",
+            intp->pin, event_notifier_get_fd(&intp->interrupt));
+
+    ret = event_notifier_test_and_clear(&intp->interrupt);
+    if (!ret) {
+        DPRINTF("Error when clearing fd=%d\n",
+                event_notifier_get_fd(&intp->interrupt));
+    }
+
+    intp->state = VFIO_IRQ_ACTIVE;
+
+    /* sets slow path */
+    vfio_mmap_set_enabled(vdev, false);
+
+    /* trigger the virtual IRQ */
+    qemu_set_irq(intp->qemuirq, 1);
+
+    /* schedule the mmap timer which will restore mmap path after EOI*/
+    if (vdev->mmap_timeout) {
+        timer_mod(vdev->mmap_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->mmap_timeout);
+    }
+
+}
+
+static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
+{
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+    int ret, argsz;
+    int device = vdev->fd;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(vplatdev);
+
+    /* allocate and populate a new VFIOINTp structure put in a queue list */
+    VFIOINTp *intp = g_malloc0(sizeof(*intp));
+    intp->vdev = vplatdev;
+    intp->pin = index;
+    intp->state = VFIO_IRQ_INACTIVE;
+
+    sysbus_init_irq(sbdev, &intp->qemuirq);
+
+    ret = event_notifier_init(&intp->interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed ");
+        return ret;
+    }
+    /* build the irq_set to be passed to the vfio kernel driver */
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = index;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&intp->interrupt);
+
+    DPRINTF("register fd=%d/irq index=%d to kernel\n", *pfd, index);
+
+    qemu_set_fd_handler(*pfd, vfio_intp_interrupt, NULL, intp);
+
+    /*
+     * pass the index/fd binding to the kernel driver so that it
+     * triggers this fd on HW IRQ
+     */
+    ret = ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
+        close(*pfd); /* TO DO : replace by event_notifier_cleanup */
+        return -errno;
+    }
+
+    /* store the new intp in qlist */
+
+    QLIST_INSERT_HEAD(&vplatdev->intp_list, intp, next);
+
+    return 0;
+}
+
 
-/* not implemented yet */
 static int vfio_platform_get_device_interrupts(VFIODevice *vdev)
 {
+    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+    int i, ret;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+
+    /*
+     * mmap timeout = 1100 ms, PCI default value
+     * this will become a user-defined value in subsequent patch
+     */
+    vdev->mmap_timeout = 1100;
+    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                    vfio_intp_mmap_enable, vdev);
+
+    QSIMPLEQ_INIT(&vplatdev->pending_intp_queue);
+
+    for (i = 0; i < vdev->num_irqs; i++) {
+        irq.index = i;
+
+        DPRINTF("Retrieve IRQ info from vfio platform driver ...\n");
+
+        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
+        if (ret) {
+            error_printf("vfio: error getting device %s irq info",
+                         vdev->name);
+        }
+        DPRINTF("- IRQ index %d: count %d, flags=0x%x\n",
+                irq.index, irq.count, irq.flags);
+
+        vfio_enable_intp(vdev, irq.index);
+    }
     return 0;
 }
 
-/* not implemented yet */
-static void vfio_platform_eoi(VFIODevice *vdev)
+
+static void vfio_disable_intp(VFIODevice *vdev)
 {
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    int fd;
+
+    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
+        fd = event_notifier_get_fd(&intp->interrupt);
+        DPRINTF("close IRQ pin=%d fd=%d\n", intp->pin, fd);
+
+        vfio_disable_irqindex(vdev, intp->pin);
+        intp->state = VFIO_IRQ_INACTIVE;
+        qemu_set_irq(intp->qemuirq, 0);
+
+        qemu_set_fd_handler(fd, NULL, NULL, NULL);
+        event_notifier_cleanup(&intp->interrupt);
+    }
+
+    /* restore fast path */
+    vfio_mmap_set_enabled(vdev, true);
+
 }
 
+
 static VFIODeviceOps vfio_platform_ops = {
     .vfio_eoi = vfio_platform_eoi,
     .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
@@ -194,9 +484,11 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
 static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
 {
     int i;
+    VFIOINTp *intp, *next_intp;
     SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
-    VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev);
-    VFIODevice *vbasedev = &vdev->vdev;
+    VFIOPlatformDevice *vplatdev = container_of(sbdev,
+                                                VFIOPlatformDevice, sbdev);
+    VFIODevice *vbasedev = &vplatdev->vdev;
     VFIOGroup *group = vbasedev->group;
     /*
      * placeholder for
@@ -205,6 +497,21 @@ static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
      * timer free
      * g_free vdev dynamic fields
     */
+    vfio_disable_intp(vbasedev);
+
+    while (!QSIMPLEQ_EMPTY(&vplatdev->pending_intp_queue)) {
+            QSIMPLEQ_REMOVE_HEAD(&vplatdev->pending_intp_queue, pqnext);
+     }
+
+    QLIST_FOREACH_SAFE(intp, &vplatdev->intp_list, next, next_intp) {
+        QLIST_REMOVE(intp, next);
+        g_free(intp);
+    }
+
+    if (vbasedev->mmap_timer) {
+        timer_free(vbasedev->mmap_timer);
+    }
+
     vfio_unmap_regions(vbasedev);
 
     for (i = 0; i < vbasedev->num_regions; i++) {
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
index 2699fba..7139d81 100644
--- a/hw/vfio/vfio-common.h
+++ b/hw/vfio/vfio-common.h
@@ -42,6 +42,13 @@ enum {
     VFIO_DEVICE_TYPE_PLATFORM = 1,
 };
 
+enum {
+    VFIO_IRQ_INACTIVE = 0,
+    VFIO_IRQ_PENDING = 1,
+    VFIO_IRQ_ACTIVE = 2,
+    /* VFIO_IRQ_ACTIVE_AND_PENDING cannot happen with VFIO */
+};
+
 struct VFIOGroup;
 struct VFIODevice;
 
@@ -61,7 +68,6 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
-
 /* Base Class for a VFIO device */
 
 typedef struct VFIODevice {
@@ -75,6 +81,8 @@ typedef struct VFIODevice {
     int type;
     bool reset_works;
     bool needs_reset;
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
     VFIODeviceOps *ops;
 } VFIODevice;
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (4 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-25 21:30   ` Alexander Graf
  2014-06-25 22:28   ` Peter Maydell
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 07/10] Add EXEC_FLAG to VFIO DMA mappings Eric Auger
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This patch aims at allowing the end-user to specify the device he
wants to directly assign to his mach-virt guest in the QEMU command
line.

The QEMU platform device becomes generic.

Current choice is to reuse the "-device" option.

For example when assigning Calxeda Midway xgmac device this option
is used:
-device vfio-platform,vfio_device="fff51000.ethernet",\
compat="calxeda/hb-xgmac",mmap-timeout-ms=1000

where
- fff51000.ethernet is the name of the device in
 /sys/bus/platform/devices/
- calxeda/hb-xgma is the compatibility where the standard comma
  separator is replaced by "/" since coma is specifically used by
  QEMU command line parser
- mmap-timeout-ms is minimal amount of time (ms) during which the IP
  register space stays MMIO mapped after an IRQ triggers in order to
  trap the end of interrupt (EOI). This is an optional parameter
  (default value set to 1100 ms).

mach-virt was modified to interpret this line and automatically

- map the device at a chosen guest physical address in
  [0xa004000, 0x10000000],
- map the device IRQs after 48,
- create the associated guest device tree with the provided
  compatibility.

The "-device" option underlying implementation is not standard
which can be argued. Indeed normaly it induces the call to the QEMU
device realize function once after the virtual machine init
execution.

In vl.c the following sequence is implemented:
1) machine init
   machine_class->init(&current_machine->init_args);
2) init of devices added with -device option
   qemu_opts_foreach(qemu_find_opts("device"),
                     device_init_func, NULL, 1)

The issue with that sequence is that the device tree is built in
mach-virt and at that stage we miss information about the VFIO device
(IRQ number, region number and size). Those information only are
collected when the VFIO device is realized.

For that reason it is decided to interpret the -device option line in
mach-virt and also to call the VFIO realize function there.

since vl.c is not changed by this patch, this means the VFIO realize
function is called twice, once from mach-virt and once from vl.c
The second call returns immediatly since the QEMU device is recognized
as already attached to the group.

changes v2 -> v3
- retrieve device properties through standard functions
- support of multiple "reg" tuples (not tested)

Acknowledgements:
- a single compatibility currently is supported
- More complex device nodes will request specialized code
- cases where multiple VFIO devices are assigned could not be tested
---
 hw/arm/virt.c         | 222 +++++++++++++++++++++++++++++++++++++++++---------
 hw/vfio/common.c      |  10 +--
 hw/vfio/pci.c         |  17 ++++
 hw/vfio/platform.c    |  39 ++++++---
 hw/vfio/vfio-common.h |   1 +
 5 files changed, 233 insertions(+), 56 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index f5693aa..8de6d1a 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -40,6 +40,8 @@
 #include "exec/address-spaces.h"
 #include "qemu/bitops.h"
 #include "qemu/error-report.h"
+#include "monitor/qdev.h"
+#include "qemu/config-file.h"
 
 #define NUM_VIRTIO_TRANSPORTS 32
 
@@ -65,7 +67,7 @@ enum {
     VIRT_GIC_CPU,
     VIRT_UART,
     VIRT_MMIO,
-    VIRT_ETHERNET,
+    VIRT_VFIO,
 };
 
 typedef struct MemMapEntry {
@@ -77,7 +79,10 @@ typedef struct VirtBoardInfo {
     struct arm_boot_info bootinfo;
     const char *cpu_model;
     const MemMapEntry *memmap;
+    qemu_irq pic[NUM_IRQS];
     const int *irqmap;
+    hwaddr avail_vfio_base;
+    int avail_vfio_irq;
     int smp_cpus;
     void *fdt;
     int fdt_size;
@@ -103,16 +108,16 @@ static const MemMapEntry a15memmap[] = {
     [VIRT_GIC_CPU] = { 0x8010000, 0x10000 },
     [VIRT_UART] = { 0x9000000, 0x1000 },
     [VIRT_MMIO] = { 0xa000000, 0x200 },
+    [VIRT_VFIO] = { 0xa004000, 0x0 }, /* size is dynamically populated */
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
-    [VIRT_MEM] = { 0x40000000, 1ULL * 1024 * 1024 * 1024 },
-    [VIRT_ETHERNET] = { 0xfff51000, 0x1000 },
+    [VIRT_MEM] = { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
 };
 
 static const int a15irqmap[] = {
     [VIRT_UART] = 1,
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
-    [VIRT_ETHERNET] = 77,
+    [VIRT_VFIO] = 48,
 };
 
 static VirtBoardInfo machines[] = {
@@ -265,7 +270,7 @@ static void fdt_add_gic_node(const VirtBoardInfo *vbi)
     qemu_fdt_setprop_cell(vbi->fdt, "/intc", "phandle", gic_phandle);
 }
 
-static void create_gic(const VirtBoardInfo *vbi, qemu_irq *pic)
+static void create_gic(VirtBoardInfo *vbi)
 {
     /* We create a standalone GIC v2 */
     DeviceState *gicdev;
@@ -309,13 +314,13 @@ static void create_gic(const VirtBoardInfo *vbi, qemu_irq *pic)
     }
 
     for (i = 0; i < NUM_IRQS; i++) {
-        pic[i] = qdev_get_gpio_in(gicdev, i);
+        vbi->pic[i] = qdev_get_gpio_in(gicdev, i);
     }
 
     fdt_add_gic_node(vbi);
 }
 
-static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
+static void create_uart(const VirtBoardInfo *vbi)
 {
     char *nodename;
     hwaddr base = vbi->memmap[VIRT_UART].base;
@@ -324,7 +329,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     const char compat[] = "arm,pl011\0arm,primecell";
     const char clocknames[] = "uartclk\0apb_pclk";
 
-    sysbus_create_simple("pl011", base, pic[irq]);
+    sysbus_create_simple("pl011", base, vbi->pic[irq]);
 
     nodename = g_strdup_printf("/pl011@%" PRIx64, base);
     qemu_fdt_add_subnode(vbi->fdt, nodename);
@@ -343,36 +348,177 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic)
     g_free(nodename);
 }
 
-static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
+/*
+ * Function called for each vfio-platform device option found in the
+ * qemu user command line:
+ * -device vfio-platform,vfio-device="<device>",compat"<compat>"
+ * for instance <device> can be fff51000.ethernet (device unbound from
+ * original driver and bound to vfio driver)
+ * for instance <compat> can be calxeda/hb-xgmac
+ * note "/" replaces normal ",". Indeed "," would be interpreted by QEMU as
+ * a separator
+ */
+
+static int vfio_init_func(QemuOpts *opts, void *opaque)
 {
+    const char *driver;
+    DeviceState *dev;
+    SysBusDevice *s;
+    VirtBoardInfo *vbi = (VirtBoardInfo *)opaque;
+    driver = qemu_opt_get(opts, "driver");
+    int irq_start = vbi->avail_vfio_irq;
+    hwaddr vfio_base = vbi->avail_vfio_base;
     char *nodename;
-    hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
-    hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
-    const char compat[] = "calxeda,hb-xgmac";
-    int main_irq = vbi->irqmap[VIRT_ETHERNET];
-    int power_irq = main_irq+1;
-    int low_power_irq = main_irq+2;
-
-    sysbus_create_varargs("vfio-platform", base,
-                          pic[main_irq],
-                          pic[power_irq],
-                          pic[low_power_irq], NULL);
-
-    nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
-    qemu_fdt_add_subnode(vbi->fdt, nodename);
+    char *corrected_compat, *compat, *name;
+    int num_irqs, num_regions;
+    MemoryRegion *mr;
+    int i, ret;
+    uint32_t *irq_attr;
+    uint64_t *reg_attr;
+    uint64_t size;
+    Error *errp = NULL;
+
+    if (!driver) {
+        qerror_report(QERR_MISSING_PARAMETER, "driver");
+        return -1 ;
+    }
 
-    /* Note that we can't use setprop_string because of the embedded NUL */
-    qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
-    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
-    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
-                                0x0, main_irq, 0x4,
-                                0x0, power_irq, 0x4,
-                                0x0, low_power_irq, 0x4);
+    if (strcasecmp(driver, "vfio-platform") == 0) {
+        dev = qdev_device_add(opts);
+        if (!dev) {
+            return -1;
+        }
+        s = SYS_BUS_DEVICE(dev);
 
-    g_free(nodename);
+        name = object_property_get_str(OBJECT(s), "vfio_device", &errp);
+        if (errp != NULL || (name == NULL)) {
+            error_report("Couldn't retrieve vfio device name: %s\n",
+                         error_get_pretty(errp));
+            exit(1);
+           }
+        compat = object_property_get_str(OBJECT(s), "compat", &errp);
+        if ((errp != NULL) || (name == NULL)) {
+            error_report("Couldn't retrieve VFIO device compat: %s\n",
+                         error_get_pretty(errp));
+            exit(1);
+           }
+        num_irqs = object_property_get_int(OBJECT(s), "num_irqs", &errp);
+        if (errp != NULL) {
+            error_report("Couldn't retrieve VFIO IRQ number: %s\n",
+                         error_get_pretty(errp));
+            exit(1);
+           }
+        num_regions = object_property_get_int(OBJECT(s), "num_regions", &errp);
+        if ((errp != NULL) || (num_regions == 0)) {
+            error_report("Couldn't retrieve VFIO region number: %s\n",
+                         error_get_pretty(errp));
+            exit(1);
+           }
+
+        /*
+         * collect region info and build reg property as tuplets
+         * 2 base 2 size
+         * 2 being the number of cells for base and size
+         */
+        reg_attr = g_new(uint64_t, num_regions*4);
+
+        for (i = 0; i < num_regions; i++) {
+            mr = sysbus_mmio_get_region(s, i);
+            size = memory_region_size(mr);
+            reg_attr[4*i] = 2;
+            reg_attr[4*i+1] = vbi->avail_vfio_base;
+            reg_attr[4*i+2] = 2;
+            reg_attr[4*i+3] = size;
+            vbi->avail_vfio_base += size;
+        }
+
+        if (vbi->avail_vfio_base >= 0x10000000) {
+            /* VFIO region size exceeds remaining VFIO space */
+            qerror_report(QERR_DEVICE_INIT_FAILED, name);
+        } else if (irq_start + num_irqs >= NUM_IRQS) {
+            /* VFIO IRQ number exceeded */
+            qerror_report(QERR_DEVICE_INIT_FAILED, name);
+        }
+
+        /*
+         * process compatibility property string passed by end-user
+         * replaces / by ,
+         * currently a single property compatibility value is supported!
+         */
+        corrected_compat = g_strdup(compat);
+        char *slash = strchr(corrected_compat, '/');
+        if (slash != NULL) {
+            *slash = ',';
+        } else {
+            error_report("Wrong compat string %s, should contain a /\n",
+                         compat);
+            exit(1);
+        }
+
+        sysbus_mmio_map(s, 0, vfio_base);
+        nodename = g_strdup_printf("/%s@%" PRIx64,
+                                   name, vfio_base);
+
+        qemu_fdt_add_subnode(vbi->fdt, nodename);
+
+        qemu_fdt_setprop(vbi->fdt, nodename, "compatible",
+                             corrected_compat, strlen(corrected_compat));
+
+        ret = qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, nodename, "reg",
+                         num_regions*2, reg_attr);
+        if (ret < 0) {
+            error_report("could not set reg property of node %s", nodename);
+        }
+
+        irq_attr = g_new(uint32_t, num_irqs*3);
+        for (i = 0; i < num_irqs; i++) {
+            sysbus_connect_irq(s, i, vbi->pic[irq_start+i]);
+
+            irq_attr[3*i] = cpu_to_be32(0);
+            irq_attr[3*i+1] = cpu_to_be32(irq_start+i);
+            irq_attr[3*i+2] = cpu_to_be32(0x4);
+        }
+
+        ret = qemu_fdt_setprop(vbi->fdt, nodename, "interrupts",
+                         irq_attr, num_irqs*3*sizeof(uint32_t));
+        if (ret < 0) {
+            error_report("could not set interrupts property of node %s",
+                         nodename);
+        }
+
+        vbi->avail_vfio_irq += num_irqs;
+
+        g_free(nodename);
+        g_free(corrected_compat);
+        g_free(irq_attr);
+        g_free(reg_attr);
+
+        object_unref(OBJECT(dev));
+
+    }
+
+  return 0;
 }
 
-static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
+/*
+ * parses the option line and look for -device option
+ * for each of time vfio_init_func is called.
+ * this later only applies to -device vfio-platform ones
+ */
+
+static void create_vfio_devices(VirtBoardInfo *vbi)
+{
+    vbi->avail_vfio_base = vbi->memmap[VIRT_VFIO].base;
+    vbi->avail_vfio_irq =  vbi->irqmap[VIRT_VFIO];
+
+    if (qemu_opts_foreach(qemu_find_opts("device"),
+                        vfio_init_func, (void *)vbi, 1) != 0) {
+        exit(1);
+    }
+}
+
+
+static void create_virtio_devices(const VirtBoardInfo *vbi)
 {
     int i;
     hwaddr size = vbi->memmap[VIRT_MMIO].size;
@@ -386,7 +532,7 @@ static void create_virtio_devices(const VirtBoardInfo *vbi, qemu_irq *pic)
         int irq = vbi->irqmap[VIRT_MMIO] + i;
         hwaddr base = vbi->memmap[VIRT_MMIO].base + i * size;
 
-        sysbus_create_simple("virtio-mmio", base, pic[irq]);
+        sysbus_create_simple("virtio-mmio", base, vbi->pic[irq]);
     }
 
     for (i = NUM_VIRTIO_TRANSPORTS - 1; i >= 0; i--) {
@@ -417,7 +563,6 @@ static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size)
 
 static void machvirt_init(QEMUMachineInitArgs *args)
 {
-    qemu_irq pic[NUM_IRQS];
     MemoryRegion *sysmem = get_system_memory();
     int n;
     MemoryRegion *ram = g_new(MemoryRegion, 1);
@@ -483,16 +628,17 @@ static void machvirt_init(QEMUMachineInitArgs *args)
     vmstate_register_ram_global(ram);
     memory_region_add_subregion(sysmem, vbi->memmap[VIRT_MEM].base, ram);
 
-    create_gic(vbi, pic);
+    create_gic(vbi);
+
+    create_uart(vbi);
 
-    create_uart(vbi, pic);
-    create_ethernet(vbi, pic);
+    create_vfio_devices(vbi);
 
     /* Create mmio transports, so the user can create virtio backends
      * (which will be automatically plugged in to the transports). If
      * no backend is created the transport will just sit harmlessly idle.
      */
-    create_virtio_devices(vbi, pic);
+    create_virtio_devices(vbi);
 
     vbi->bootinfo.ram_size = args->ram_size;
     vbi->bootinfo.kernel_filename = args->kernel_filename;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 07dc409..28d29de 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -732,7 +732,6 @@ void vfio_put_base_device(VFIODevice *vdev)
 
 int vfio_base_device_init(VFIODevice *vdev, int type)
 {
-    VFIODevice *tmp;
     VFIOGroup *group;
     char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
     ssize_t len;
@@ -792,12 +791,9 @@ int vfio_base_device_init(VFIODevice *vdev, int type)
 
     snprintf(path, sizeof(path), "%s", vdev->name);
 
-    QLIST_FOREACH(tmp, &group->device_list, next) {
-        if (strcmp(tmp->name, vdev->name) == 0) {
-            error_report("vfio: error: device %s is already attached", path);
-            vfio_put_group(group, vfio_reset_handler);
-            return -EBUSY;
-        }
+    if (vdev->ops->vfio_is_device_already_attached(vdev, group)) {
+        vfio_put_group(group, vfio_reset_handler);
+        return -EBUSY;
     }
 
     ret = vfio_get_base_device(group, path, vdev);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1b49205..c86bef9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2988,6 +2988,22 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     event_notifier_cleanup(&vdev->err_notifier);
 }
 
+static bool vfio_pci_is_device_already_attached(VFIODevice *vdev,
+                                                VFIOGroup *group)
+{
+    VFIODevice *tmp;
+
+    QLIST_FOREACH(tmp, &group->device_list, next) {
+        if (strcmp(tmp->name, vdev->name) == 0) {
+            error_report("vfio: error: device %s is already attached",
+                         vdev->name);
+            return true;
+        }
+    }
+    return false;
+}
+
+
 
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_eoi = vfio_pci_eoi,
@@ -2996,6 +3012,7 @@ static VFIODeviceOps vfio_pci_ops = {
     .vfio_check_device = vfio_pci_check_device,
     .vfio_get_device_regions = vfio_pci_get_device_regions,
     .vfio_get_device_interrupts = vfio_pci_get_device_interrupts,
+    .vfio_is_device_already_attached = vfio_pci_is_device_already_attached,
 };
 
 static int vfio_initfn(PCIDevice *pdev)
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5b9451f..377783b 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -43,6 +43,7 @@ typedef struct VFIOPlatformDevice {
     QLIST_HEAD(, VFIOINTp) intp_list; /* list of IRQ */
     /* queue of pending IRQ */
     QSIMPLEQ_HEAD(pending_intp_queue, VFIOINTp) pending_intp_queue;
+    char *compat; /* compatibility string */
 } VFIOPlatformDevice;
 
 
@@ -394,11 +395,6 @@ static int vfio_platform_get_device_interrupts(VFIODevice *vdev)
     int i, ret;
     VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
 
-    /*
-     * mmap timeout = 1100 ms, PCI default value
-     * this will become a user-defined value in subsequent patch
-     */
-    vdev->mmap_timeout = 1100;
     vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                     vfio_intp_mmap_enable, vdev);
 
@@ -446,6 +442,19 @@ static void vfio_disable_intp(VFIODevice *vdev)
 
 }
 
+static bool vfio_platform_is_device_already_attached(VFIODevice *vdev,
+                                                     VFIOGroup *group)
+{
+    VFIODevice *tmp;
+
+    QLIST_FOREACH(tmp, &group->device_list, next) {
+        if (strcmp(tmp->name, vdev->name) == 0) {
+            return true;
+        }
+    }
+    return false;
+}
+
 
 static VFIODeviceOps vfio_platform_ops = {
     .vfio_eoi = vfio_platform_eoi,
@@ -454,6 +463,7 @@ static VFIODeviceOps vfio_platform_ops = {
     .vfio_check_device = vfio_platform_check_device,
     .vfio_get_device_regions = vfio_platform_get_device_regions,
     .vfio_get_device_interrupts = vfio_platform_get_device_interrupts,
+    .vfio_is_device_already_attached = vfio_platform_is_device_already_attached,
 };
 
 
@@ -466,9 +476,7 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
 
     vbasedev->ops = &vfio_platform_ops;
 
-    /* TODO: pass device name on command line */
-    vbasedev->name = malloc(PATH_MAX);
-    snprintf(vbasedev->name, PATH_MAX, "%s", "fff51000.ethernet");
+    DPRINTF("vfio device %s, compat = %s\n", vbasedev->name, vdev->compat);
 
     ret = vfio_base_device_init(vbasedev, VFIO_DEVICE_TYPE_PLATFORM);
     if (ret < 0) {
@@ -531,8 +539,8 @@ static const VMStateDescription vfio_platform_vmstate = {
 
 typedef struct VFIOPlatformDeviceClass {
     DeviceClass parent_class;
+    void (*init)(VFIOPlatformDevice *vdev);
 
-    int (*init)(VFIODevice *dev);
 } VFIOPlatformDeviceClass;
 
 #define VFIO_PLATFORM_DEVICE(obj) \
@@ -542,19 +550,28 @@ typedef struct VFIOPlatformDeviceClass {
 #define VFIO_PLATFORM_DEVICE_GET_CLASS(obj) \
      OBJECT_GET_CLASS(VFIOPlatformDeviceClass, (obj), TYPE_VFIO_PLATFORM)
 
+static Property vfio_platform_dev_properties[] = {
+DEFINE_PROP_STRING("vfio_device", VFIOPlatformDevice, vdev.name),
+DEFINE_PROP_STRING("compat", VFIOPlatformDevice, compat),
+DEFINE_PROP_UINT32("mmap-timeout-ms", VFIOPlatformDevice,
+                   vdev.mmap_timeout, 1100),
+DEFINE_PROP_UINT32("num_irqs", VFIOPlatformDevice, vdev.num_irqs, 0),
+DEFINE_PROP_UINT32("num_regions", VFIOPlatformDevice, vdev.num_regions, 0),
+DEFINE_PROP_END_OF_LIST(),
+};
 
 
 static void vfio_platform_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     VFIOPlatformDeviceClass *vdc = VFIO_PLATFORM_DEVICE_CLASS(klass);
-
+    dc->props = vfio_platform_dev_properties;
     dc->realize = vfio_platform_realize;
     dc->unrealize = vfio_platform_unrealize;
     dc->vmsd = &vfio_platform_vmstate;
     dc->desc = "VFIO-based platform device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
-
+    dc->cannot_instantiate_with_device_add_yet = false;
     vdc->init = NULL;
 }
 
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
index 7139d81..5412acd 100644
--- a/hw/vfio/vfio-common.h
+++ b/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ struct VFIODeviceOps {
     int (*vfio_check_device)(VFIODevice *vdev);
     int (*vfio_get_device_regions)(VFIODevice *vdev);
     int (*vfio_get_device_interrupts)(VFIODevice *vdev);
+    bool (*vfio_is_device_already_attached)(VFIODevice *vdev, VFIOGroup*);
 };
 
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 07/10] Add EXEC_FLAG to VFIO DMA mappings
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (5 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 08/10] Add AMBA devices support to VFIO Eric Auger
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Alvise Rigo <a.rigo@virtualopensystems.com>

The flag is mandatory for the ARM SMMU so we always add it if the MMIO
handles it.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 hw/vfio/common.c           | 9 +++++++++
 hw/vfio/vfio-common.h      | 1 +
 linux-headers/linux/vfio.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 28d29de..8b25380 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -82,6 +82,11 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
 
+    /* add exec flag */
+    if (container->iommu_data.has_exec_cap) {
+        map.flags |= VFIO_DMA_MAP_FLAG_EXEC;
+    }
+
     /*
      * Try the mapping, if it fails with EBUSY, unmap the region and try
      * again.  This shouldn't be necessary, but we sometimes see it in
@@ -327,6 +332,10 @@ static int vfio_connect_container(VFIOGroup *group)
             return -errno;
         }
 
+        if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_IOMMU_PROT_EXEC)) {
+            container->iommu_data.has_exec_cap = true;
+        }
+
         container->iommu_data.type1.listener = vfio_memory_listener;
         container->iommu_data.release = vfio_listener_release;
 
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
index 5412acd..dcd7ddd 100644
--- a/hw/vfio/vfio-common.h
+++ b/hw/vfio/vfio-common.h
@@ -100,6 +100,7 @@ typedef struct VFIOContainer {
         union {
             VFIOType1 type1;
         };
+        bool has_exec_cap; /* support of exec capability by the IOMMU */
         void (*release)(struct VFIOContainer *);
     } iommu_data;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ef4815d..e96e14d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -30,6 +30,7 @@
  */
 #define VFIO_DMA_CC_IOMMU		4
 
+#define VFIO_IOMMU_PROT_EXEC		5
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -399,6 +400,7 @@ struct vfio_iommu_type1_dma_map {
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_EXEC (1 << 2)	/* executable from device */
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 08/10] Add AMBA devices support to VFIO
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (6 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 07/10] Add EXEC_FLAG to VFIO DMA mappings Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 09/10] Always use eventfd as notifying mechanism Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device Eric Auger
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Alvise Rigo <a.rigo@virtualopensystems.com>

The impossibility to add more then one compatibility property to the
device tree node was not permitting to bind AMBA devices.
Now we can add an arbitrary number of compatibility property values divided by
the character ";".

If the compatibility string contains the substring "arm,primecell", a
clock property will be added to the device tree node in order to allow the AMBA
bus code to probe the device.

[Eric Auger]
put str_ptr in the declaration part and rename pcompat into compat

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 8de6d1a..bc561b5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -369,6 +369,7 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
     int irq_start = vbi->avail_vfio_irq;
     hwaddr vfio_base = vbi->avail_vfio_base;
     char *nodename;
+    char *str_ptr;
     char *corrected_compat, *compat, *name;
     int num_irqs, num_regions;
     MemoryRegion *mr;
@@ -377,6 +378,8 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
     uint64_t *reg_attr;
     uint64_t size;
     Error *errp = NULL;
+    bool is_amba = false;
+    int compat_str_len;
 
     if (!driver) {
         qerror_report(QERR_MISSING_PARAMETER, "driver");
@@ -442,17 +445,30 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
 
         /*
          * process compatibility property string passed by end-user
-         * replaces / by ,
-         * currently a single property compatibility value is supported!
+         * replaces / by , and ; by NUL character
          */
         corrected_compat = g_strdup(compat);
-        char *slash = strchr(corrected_compat, '/');
-        if (slash != NULL) {
-            *slash = ',';
-        } else {
-            error_report("Wrong compat string %s, should contain a /\n",
-                         compat);
-            exit(1);
+        /*
+         * the total length of the string has to include also the last
+         * NUL char.
+         */
+        compat_str_len = strlen(corrected_compat) + 1;
+
+        str_ptr = corrected_compat;
+        while ((str_ptr = strchr(str_ptr, '/')) != NULL) {
+            *str_ptr = ',';
+        }
+
+        /* check if is an AMBA device */
+        str_ptr = corrected_compat;
+        if (strstr(str_ptr, "arm,primecell") != NULL) {
+            is_amba = true;
+        }
+
+        /* substitute ";" with the NUL char */
+        str_ptr = corrected_compat;
+        while ((str_ptr = strchr(str_ptr, ';')) != NULL) {
+            *str_ptr = '\0';
         }
 
         sysbus_mmio_map(s, 0, vfio_base);
@@ -462,7 +478,7 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
         qemu_fdt_add_subnode(vbi->fdt, nodename);
 
         qemu_fdt_setprop(vbi->fdt, nodename, "compatible",
-                             corrected_compat, strlen(corrected_compat));
+                             corrected_compat, compat_str_len);
 
         ret = qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, nodename, "reg",
                          num_regions*2, reg_attr);
@@ -471,6 +487,15 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
         }
 
         irq_attr = g_new(uint32_t, num_irqs*3);
+
+        if (is_amba) {
+            qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks",
+                                   vbi->clock_phandle);
+            char clock_names[] = "apb_pclk";
+            qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", clock_names,
+                                                       sizeof(clock_names));
+        }
+
         for (i = 0; i < num_irqs; i++) {
             sysbus_connect_irq(s, i, vbi->pic[irq_start+i]);
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 09/10] Always use eventfd as notifying mechanism
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (7 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 08/10] Add AMBA devices support to VFIO Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device Eric Auger
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

From: Alvise Rigo <a.rigo@virtualopensystems.com>

When eventfd is not configured the method event_notifier_init fallbacks
to the pipe/pipe2 system call, causing an error in VFIO_DEVICE_SET_IRQS
since we pass to the kernel a file descriptor which is not created by
eventfd.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 hw/vfio/platform.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 377783b..56dde5f 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -346,6 +346,11 @@ static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
     sysbus_init_irq(sbdev, &intp->qemuirq);
 
     ret = event_notifier_init(&intp->interrupt, 0);
+    if (!ret && (intp->interrupt.rfd != intp->interrupt.wfd)) {
+        /* event_notifier_init created a pipe instead of eventfd */
+        ret = -1;
+    }
+
     if (ret) {
         error_report("vfio: Error: event_notifier_init failed ");
         return ret;
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device
  2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
                   ` (8 preceding siblings ...)
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 09/10] Always use eventfd as notifying mechanism Eric Auger
@ 2014-06-02  7:49 ` Eric Auger
  2014-06-25 21:35   ` Alexander Graf
  9 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-02  7:49 UTC (permalink / raw)
  To: eric.auger, christoffer.dall, qemu-devel, kim.phillips, a.rigo
  Cc: peter.maydell, eric.auger, patches, agraf, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

This patch aims at optimizing IRQ handling using irqfd framework.
It brings significant performance improvement over "traditional" IRQ
handling introduced in :
"vfio: Add initial IRQ support in platform device".

This new IRQ handling method depends on kernel KVM irqfd/GSI routing
capability.

The IRQ handling method can be dynamically chosen (default is irqfd,
if kernel supports it obviously).  For example to disable irqfd
handling, use:

-device vfio-platform,vfio_device="fff51000.ethernet",\
compat="calxeda/hb-xgmac",mmap-timeout-ms=110,irqfd=false\

Performances are improved for the following reasons:
- eventfds signalled by the VFIO platform driver are handled on
  kernel side by the KVM irqfd framework.
- the end of interrupt(EOI) is trapped at GIC level and not at MMIO
  region level. As a reminder, in traditional IRQ handling QEMU
  assumed the first guest access to a device MMIO region after IRQ
  hit was the IRQ status register reset. This trap was approximate
  and obliged to swap to slow path after IRQ hit. A mmap timer
  mechanism enabled to swap back to fast path after the mmap period
  introducing extra complexity. Now GIC detects the completion of
  the virtual IRQ and signals a resampler eventfd on maintenance
  IRQ. The corresponding handler re-enables the physical IRQ.

Next optimization step consists in attempting to remove EOI trap
(ie.  maintenance IRQ). This should be covered by another patch in
near future.

This work was tested with Calxeda Midway xgmac.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c         |  14 +++++
 hw/intc/arm_gic_kvm.c |   1 +
 hw/vfio/platform.c    | 165 +++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 166 insertions(+), 14 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index bc561b5..de1b885 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -43,6 +43,9 @@
 #include "monitor/qdev.h"
 #include "qemu/config-file.h"
 
+#define ENABLE_IRQFD 1
+void vfio_setup_irqfd(SysBusDevice *s, int index, int virq);
+
 #define NUM_VIRTIO_TRANSPORTS 32
 
 /* Number of external interrupt lines to configure the GIC with */
@@ -380,6 +383,7 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
     Error *errp = NULL;
     bool is_amba = false;
     int compat_str_len;
+    bool irqfd_allowed;
 
     if (!driver) {
         qerror_report(QERR_MISSING_PARAMETER, "driver");
@@ -417,6 +421,13 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
                          error_get_pretty(errp));
             exit(1);
            }
+        irqfd_allowed = object_property_get_bool(OBJECT(s), "irqfd", &errp);
+        if (errp != NULL) {
+            error_report("Couldn't retrieve irqfd flag: %s\n",
+                         error_get_pretty(errp));
+            exit(1);
+           }
+
 
         /*
          * collect region info and build reg property as tuplets
@@ -502,6 +513,9 @@ static int vfio_init_func(QemuOpts *opts, void *opaque)
             irq_attr[3*i] = cpu_to_be32(0);
             irq_attr[3*i+1] = cpu_to_be32(irq_start+i);
             irq_attr[3*i+2] = cpu_to_be32(0x4);
+            if (irqfd_allowed) {
+                vfio_setup_irqfd(s, i, irq_start+i);
+            }
         }
 
         ret = qemu_fdt_setprop(vbi->fdt, nodename, "interrupts",
diff --git a/hw/intc/arm_gic_kvm.c b/hw/intc/arm_gic_kvm.c
index 5038885..18a6204 100644
--- a/hw/intc/arm_gic_kvm.c
+++ b/hw/intc/arm_gic_kvm.c
@@ -576,6 +576,7 @@ static void kvm_arm_gic_realize(DeviceState *dev, Error **errp)
                             KVM_DEV_ARM_VGIC_GRP_ADDR,
                             KVM_VGIC_V2_ADDR_TYPE_CPU,
                             s->dev_fd);
+    kvm_irqfds_allowed = true;
 }
 
 static void kvm_arm_gic_class_init(ObjectClass *klass, void *data)
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 56dde5f..3e7089a 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -23,6 +23,7 @@
 #include "hw/sysbus.h"
 
 #include "vfio-common.h"
+#include "sysemu/kvm.h"
 
 typedef struct VFIOINTp {
     QLIST_ENTRY(VFIOINTp) next; /* entry for IRQ list */
@@ -34,6 +35,7 @@ typedef struct VFIOINTp {
     int state; /* inactive, pending, active */
     bool kvm_accel; /* set when QEMU bypass through KVM enabled */
     uint8_t pin; /* index */
+    uint8_t virtualID; /* virtual IRQ */
 } VFIOINTp;
 
 
@@ -44,6 +46,7 @@ typedef struct VFIOPlatformDevice {
     /* queue of pending IRQ */
     QSIMPLEQ_HEAD(pending_intp_queue, VFIOINTp) pending_intp_queue;
     char *compat; /* compatibility string */
+    bool irqfd_allowed;
 } VFIOPlatformDevice;
 
 
@@ -54,6 +57,7 @@ static const MemoryRegionOps vfio_region_ops = {
 };
 
 static void vfio_intp_interrupt(void *opaque);
+void vfio_setup_irqfd(SysBusDevice *s, int index, int virq);
 
 /*
  * It is mandatory to pass a VFIOPlatformDevice since VFIODevice
@@ -424,29 +428,138 @@ static int vfio_platform_get_device_interrupts(VFIODevice *vdev)
 }
 
 
-static void vfio_disable_intp(VFIODevice *vdev)
+static void vfio_disable_intp(VFIOINTp *intp)
 {
+    int fd = event_notifier_get_fd(&intp->interrupt);
+    DPRINTF("close IRQ pin=%d fd=%d\n", intp->pin, fd);
+
+    /* remove the IRQ handler */
+    vfio_disable_irqindex(&intp->vdev->vdev, intp->pin);
+    intp->state = VFIO_IRQ_INACTIVE;
+    qemu_set_irq(intp->qemuirq, 0);
+    qemu_set_fd_handler(fd, NULL, NULL, NULL);
+    event_notifier_cleanup(&intp->interrupt);
+
+}
+
+
+/* IRQFD */
+
+static void resampler_handler(void *opaque)
+{
+    VFIOINTp *intp = (VFIOINTp *)opaque;
+    DPRINTF("%s index %d virtual ID = %d fd = %d\n",
+            __func__,
+            intp->pin, intp->virtualID,
+            event_notifier_get_fd(&intp->unmask));
+    vfio_unmask_irqindex(&intp->vdev->vdev, intp->pin);
+}
+
+
+static void vfio_enable_intp_kvm(VFIOINTp *intp)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&intp->interrupt),
+        .gsi = intp->virtualID,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+
+    if (!kvm_irqfds_enabled() ||
+        !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, NULL);
+    intp->state = VFIO_IRQ_INACTIVE;
+    qemu_set_irq(intp->qemuirq, 0);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&intp->unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&intp->unmask);
+    qemu_set_fd_handler(irqfd.resamplefd, resampler_handler, NULL, intp);
+
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+    intp->kvm_accel = true;
+
+    DPRINTF("%s irqfd pin=%d to virtID = %d fd=%d, resamplefd=%d)\n",
+            __func__, intp->pin, intp->virtualID,
+            irqfd.fd, irqfd.resamplefd);
+
+    return;
+
+fail_irqfd:
+    event_notifier_cleanup(&intp->unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_intp_interrupt, NULL, intp);
+    vfio_unmask_irqindex(&intp->vdev->vdev, intp->pin);
+#endif
+}
+
+void vfio_setup_irqfd(SysBusDevice *s, int index, int virq)
+{
+    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
     VFIOINTp *intp;
-    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
-    int fd;
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        if (intp->pin == index) {
+            intp->virtualID = virq;
+            vfio_enable_intp_kvm(intp);
+        }
+    }
+}
 
-    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
-        fd = event_notifier_get_fd(&intp->interrupt);
-        DPRINTF("close IRQ pin=%d fd=%d\n", intp->pin, fd);
+static void vfio_disable_intp_kvm(VFIOINTp *intp)
+{
+#ifdef CONFIG_KVM
 
-        vfio_disable_irqindex(vdev, intp->pin);
-        intp->state = VFIO_IRQ_INACTIVE;
-        qemu_set_irq(intp->qemuirq, 0);
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&intp->interrupt),
+        .gsi = intp->virtualID,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+        };
 
-        qemu_set_fd_handler(fd, NULL, NULL, NULL);
-        event_notifier_cleanup(&intp->interrupt);
+    if (!intp->kvm_accel) {
+        return;
     }
 
-    /* restore fast path */
-    vfio_mmap_set_enabled(vdev, true);
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    intp->state = VFIO_IRQ_INACTIVE;
+    /* Tell KVM to stop listening for an INTp irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&intp->unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_intp_interrupt, NULL, intp->vdev);
+
+    intp->kvm_accel = false;
 
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_irqindex(&intp->vdev->vdev, intp->pin);
+
+    DPRINTF("%s: KVM INTx accel disabled\n", __func__);
+#endif
 }
 
+
+
+
 static bool vfio_platform_is_device_already_attached(VFIODevice *vdev,
                                                      VFIOGroup *group)
 {
@@ -494,6 +607,25 @@ static void vfio_platform_realize(DeviceState *dev, Error **errp)
     }
 }
 
+
+static void vfio_disable_all_intp(VFIODevice *vdev)
+{
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vplatdev =
+        container_of(vdev, VFIOPlatformDevice, vdev);
+
+    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
+        /* first disable IRQFD handled IRQ and turn them in QEMU handled ones */
+        vfio_disable_intp_kvm(intp);
+        /* actually disable IRQ */
+        vfio_disable_intp(intp);
+    }
+
+    /* restore fast path */
+    vfio_mmap_set_enabled(vdev, true);
+
+}
+
 static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
 {
     int i;
@@ -510,7 +642,7 @@ static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
      * timer free
      * g_free vdev dynamic fields
     */
-    vfio_disable_intp(vbasedev);
+    vfio_disable_all_intp(vbasedev);
 
     while (!QSIMPLEQ_EMPTY(&vplatdev->pending_intp_queue)) {
             QSIMPLEQ_REMOVE_HEAD(&vplatdev->pending_intp_queue, pqnext);
@@ -537,6 +669,10 @@ static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
 
 }
 
+
+
+
+
 static const VMStateDescription vfio_platform_vmstate = {
     .name = TYPE_VFIO_PLATFORM,
     .unmigratable = 1,
@@ -562,6 +698,7 @@ DEFINE_PROP_UINT32("mmap-timeout-ms", VFIOPlatformDevice,
                    vdev.mmap_timeout, 1100),
 DEFINE_PROP_UINT32("num_irqs", VFIOPlatformDevice, vdev.num_irqs, 0),
 DEFINE_PROP_UINT32("num_regions", VFIOPlatformDevice, vdev.num_regions, 0),
+DEFINE_PROP_BOOL("irqfd", VFIOPlatformDevice, irqfd_allowed, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support Eric Auger
@ 2014-06-25 21:21   ` Alexander Graf
  2014-06-26  7:47     ` Eric Auger
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 21:21 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, Kim Phillips, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm


On 02.06.14 09:49, Eric Auger wrote:
> From: Kim Phillips <kim.phillips@linaro.org>
>
> Functions for which PCI and platform device support share are moved
> into common.c.  The common vfio_{get,put}_group() get an additional
> argument, a pointer to a vfio_reset_handler(), for which to pass on
> to qemu_register_reset, but only if it exists (the platform device
> code currently passes a NULL as its reset_handler).
>
> For the platform device code, we basically use SysBusDevice
> instead of PCIDevice.  Since realize() returns void, unlike
> PCIDevice's initfn, error codes are moved into the
> error message text with %m.
>
> Currently only MMIO access is supported at this time.
>
> The perceived path for future QEMU development is:
>
> - add support for interrupts
> - verify and test platform dev unmap path
> - test existing PCI path for any regressions
> - add support for creating platform devices on the qemu command line
>    - currently device address specification is hardcoded for test
>      development on Calxeda Midway's fff51000.ethernet device
> - reset is not supported and registration of reset functions is
>    bypassed for platform devices.
>    - there is no standard means of resetting a platform device,
>      unsure if it suffices to be handled at device--VFIO binding time
>
> [1] http://www.spinics.net/lists/kvm-arm/msg08195.html
>
> Changes (v2 -> v3):
> [work done by Eric Auger]
>
> This new version introduces 2 separate VFIO Device objects:
> - VFIOPCIDevice
> - VFIOPlatformDevice
>
> Both objects share a VFIODevice struct.
>
> Also a VFIORegion shared struct was created. It is embedded in
> VFIOBAR struct. VFIOPlatformDevice uses VFIORegion directly.
>
> Introducing those base classes induced quite a lot of tiny
> changes in the PCI code. Theoretically PCI and platform
> devices can be supported simultaneously. PCI modifications
> currently are not tested.
>
> The VFIODevice is not a QOM object due to the single inheritance
> model limitation.
>
> The VFIODevice struct embeds an ops structure which is
> specialized in each VFIO leaf device. This makes possible to call
> device specific functions in common parts, hence achieving better
> factorization.
>
> Reset handling typically is handled that way where a unique
> generic ResetHandler (in common.c) is used for both derived
> classes. It calls device specific methods.
>
> As in the original contribution, only MMIO is supported in that
> patch file (in mmap mode). IRQ support comes in a subsequent patch.
>
> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> ---
>   hw/vfio/Makefile.objs      |    2 +
>   hw/vfio/common.c           |  849 ++++++++++++++++++++++++++++
>   hw/vfio/pci.c              | 1316 ++++++++++----------------------------------
>   hw/vfio/platform.c         |  267 +++++++++
>   hw/vfio/vfio-common.h      |  143 +++++
>   linux-headers/linux/vfio.h |    1 +
>   6 files changed, 1565 insertions(+), 1013 deletions(-)

This patch is impossible to review. Please split it out into a part (or 
maybe even multiple patches) that separate pci specific VFIO support 
into its own file and then another set of patches implementing platform 
support.


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name Eric Auger
@ 2014-06-25 21:22   ` Alexander Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 21:22 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm


On 02.06.14 09:49, Eric Auger wrote:
> This patch gets benefit from the new VFIODevice name field.
>
> Occurences of
> DPRINTF("%s(%04x:%02x:%02x.%x) ...", __func__, vdev->host.domain,
>          vdev->host.bus, vdev->host.slot, vdev->host.function, ...)
> are replaced by
> DPRINTF("%s(%s ...", __func__, vdev->vdev.name, ...).
>
> name is built using "%04x:%02x:%02x.%01x" format string.
>
> Signed-off-by: Eric Auger <eric.auger@linaro.org>

Just convert them to trace points?


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device Eric Auger
@ 2014-06-25 21:28   ` Alexander Graf
  2014-06-25 21:40     ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 21:28 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm


On 02.06.14 09:49, Eric Auger wrote:
> This patch brings a first support for device IRQ assignment to a
> KVM guest. Code is inspired of PCI INTx code.
>
> General principle of IRQ handling:
>
> when a physical IRQ occurs, VFIO driver signals an eventfd that was
> registered by the QEMU VFIO platform device. The eventfd handler
> (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
> disables MMIO region fast path (where MMIO regions are mapped as
> RAM). The purpose is to trap the IRQ status register guest reset.
> The physical interrupt is unmasked on the first read/write in any
> MMIO region. It was masked in the VFIO driver at the instant it
> signaled the eventfd.

This doesn't sound like a very promising generic scheme to me. I can 
easily see devices requiring 2 or 3 or more accesses until they're 
pulling down the IRQ line. During that time interrupts will keep firing, 
queue up in the irqfd and get at us as spurious interrupts.

Can't we handle it like PCI where we require devices to not share an 
interrupt line? Then we can just wait until the EOI in the interrupt 
controller.


Alex

>
> A single IRQ can be forwarded to the guest at a time, ie. before a
> new virtual IRQ to be injected, the previous active one must have
> completed.
>
> When no IRQ is pending anymore, fast path can be restored. This is
> done on mmap_timer scheduling.
>
> irqfd support will be added in a subsequent patch. irqfd brings a
> framework where the eventfd is handled on kernel side instead of in
> user-side as currently done, hence improving the performance.
>
> Although the code is prepared to support multiple IRQs, this is not
> tested at that stage.
>
> Tested on Calxeda Midway xgmac which can be directly assigned to one
> guest (unfortunately only the main IRQ is exercised). A KVM patch is
> required to invalidate stage2 entries on RAM memory region destruction
> (https://patches.linaro.org/27691/). Without that patch, slow/fast path
> switch cannot work.
>
> change v2 -> v3:
>
> - Move mmap_timer and mmap_timeout in new VFIODevice struct as
>    PCI/platform factorization.
> - multiple IRQ handling (a pending IRQ queue is added) - not tested -
> - create vfio_mmap_set_enabled as in PCI code
> - name of irq changed in virt
>
> Signed-off-by: Eric Auger <eric.auger@linaro.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option Eric Auger
@ 2014-06-25 21:30   ` Alexander Graf
  2014-06-26  8:53     ` Eric Auger
  2014-06-25 22:28   ` Peter Maydell
  1 sibling, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 21:30 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm


On 02.06.14 09:49, Eric Auger wrote:
> This patch aims at allowing the end-user to specify the device he
> wants to directly assign to his mach-virt guest in the QEMU command
> line.
>
> The QEMU platform device becomes generic.
>
> Current choice is to reuse the "-device" option.
>
> For example when assigning Calxeda Midway xgmac device this option
> is used:
> -device vfio-platform,vfio_device="fff51000.ethernet",\
> compat="calxeda/hb-xgmac",mmap-timeout-ms=1000

I think we're walking into the right direction, but there is one major 
nit I have. I don't think we should have a -device vfio-platform. I 
think we should have a -device vfio-xgmac that maybe inherits from an 
abstrace vfio-platform class.

That way machine code can assemble the device tree according to the 
device and you can also implement hardware specific hacks or 
dependencies if you need them - for example the MMIO masking to find an 
EOI you did earlier.


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device Eric Auger
@ 2014-06-25 21:35   ` Alexander Graf
  2014-06-25 21:54     ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 21:35 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm


On 02.06.14 09:49, Eric Auger wrote:
> This patch aims at optimizing IRQ handling using irqfd framework.
> It brings significant performance improvement over "traditional" IRQ
> handling introduced in :
> "vfio: Add initial IRQ support in platform device".
>
> This new IRQ handling method depends on kernel KVM irqfd/GSI routing
> capability.
>
> The IRQ handling method can be dynamically chosen (default is irqfd,
> if kernel supports it obviously).  For example to disable irqfd
> handling, use:
>
> -device vfio-platform,vfio_device="fff51000.ethernet",\
> compat="calxeda/hb-xgmac",mmap-timeout-ms=110,irqfd=false\
>
> Performances are improved for the following reasons:
> - eventfds signalled by the VFIO platform driver are handled on
>    kernel side by the KVM irqfd framework.
> - the end of interrupt(EOI) is trapped at GIC level and not at MMIO
>    region level. As a reminder, in traditional IRQ handling QEMU
>    assumed the first guest access to a device MMIO region after IRQ
>    hit was the IRQ status register reset. This trap was approximate
>    and obliged to swap to slow path after IRQ hit. A mmap timer
>    mechanism enabled to swap back to fast path after the mmap period
>    introducing extra complexity. Now GIC detects the completion of
>    the virtual IRQ and signals a resampler eventfd on maintenance
>    IRQ. The corresponding handler re-enables the physical IRQ.

Ah, so if you're using irqfd you do unmask the interrupt on EOI. Why not 
without irqfd? And if the answer is "because it's too difficult" - why 
support VFIO without irqfd at all then?


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device
  2014-06-25 21:28   ` Alexander Graf
@ 2014-06-25 21:40     ` Alex Williamson
  2014-06-26  8:41       ` Eric Auger
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2014-06-25 21:40 UTC (permalink / raw)
  To: Alexander Graf
  Cc: peter.maydell, kim.phillips, eric.auger, Eric Auger, patches,
	a.rigo, qemu-devel, stuart.yoder, christophe.barnichon,
	a.motakis, kvmarm, christoffer.dall

On Wed, 2014-06-25 at 23:28 +0200, Alexander Graf wrote:
> On 02.06.14 09:49, Eric Auger wrote:
> > This patch brings a first support for device IRQ assignment to a
> > KVM guest. Code is inspired of PCI INTx code.
> >
> > General principle of IRQ handling:
> >
> > when a physical IRQ occurs, VFIO driver signals an eventfd that was
> > registered by the QEMU VFIO platform device. The eventfd handler
> > (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
> > disables MMIO region fast path (where MMIO regions are mapped as
> > RAM). The purpose is to trap the IRQ status register guest reset.
> > The physical interrupt is unmasked on the first read/write in any
> > MMIO region. It was masked in the VFIO driver at the instant it
> > signaled the eventfd.
> 
> This doesn't sound like a very promising generic scheme to me. I can 
> easily see devices requiring 2 or 3 or more accesses until they're 
> pulling down the IRQ line. During that time interrupts will keep firing, 
> queue up in the irqfd and get at us as spurious interrupts.
> 
> Can't we handle it like PCI where we require devices to not share an 
> interrupt line? Then we can just wait until the EOI in the interrupt 
> controller.

QEMU's interrupt abstraction makes this really difficult and something
that's not generally necessary outside of device assignment.  I spent a
long time trying to figure out how we'd do it for PCI before I came up
with this super generic hack that works surprisingly well.  Yes, we may
get additional spurious interrupts, but from a host perspective they're
rate limited by the guest poking hardware, so there's a feedback loop.
Also note that assuming this is the same approach we take for PCI, this
mode is only used for the non-KVM accelerated path.  When we have a KVM
irqchip that supports a resampling irqfd then we can get an eventfd
signal back at the point when we should unmask the interrupt on the
host.  Creating a cross-architecture QEMU interface to give you a
callback when the architecture's notion of a resampling event occurs is
not a trivial undertaking.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device
  2014-06-25 21:35   ` Alexander Graf
@ 2014-06-25 21:54     ` Alex Williamson
  2014-06-25 22:02       ` Alexander Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2014-06-25 21:54 UTC (permalink / raw)
  To: Alexander Graf
  Cc: peter.maydell, kim.phillips, eric.auger, Eric Auger, patches,
	a.rigo, qemu-devel, stuart.yoder, christophe.barnichon,
	a.motakis, kvmarm, christoffer.dall

On Wed, 2014-06-25 at 23:35 +0200, Alexander Graf wrote:
> On 02.06.14 09:49, Eric Auger wrote:
> > This patch aims at optimizing IRQ handling using irqfd framework.
> > It brings significant performance improvement over "traditional" IRQ
> > handling introduced in :
> > "vfio: Add initial IRQ support in platform device".
> >
> > This new IRQ handling method depends on kernel KVM irqfd/GSI routing
> > capability.
> >
> > The IRQ handling method can be dynamically chosen (default is irqfd,
> > if kernel supports it obviously).  For example to disable irqfd
> > handling, use:
> >
> > -device vfio-platform,vfio_device="fff51000.ethernet",\
> > compat="calxeda/hb-xgmac",mmap-timeout-ms=110,irqfd=false\
> >
> > Performances are improved for the following reasons:
> > - eventfds signalled by the VFIO platform driver are handled on
> >    kernel side by the KVM irqfd framework.
> > - the end of interrupt(EOI) is trapped at GIC level and not at MMIO
> >    region level. As a reminder, in traditional IRQ handling QEMU
> >    assumed the first guest access to a device MMIO region after IRQ
> >    hit was the IRQ status register reset. This trap was approximate
> >    and obliged to swap to slow path after IRQ hit. A mmap timer
> >    mechanism enabled to swap back to fast path after the mmap period
> >    introducing extra complexity. Now GIC detects the completion of
> >    the virtual IRQ and signals a resampler eventfd on maintenance
> >    IRQ. The corresponding handler re-enables the physical IRQ.
> 
> Ah, so if you're using irqfd you do unmask the interrupt on EOI. Why not 
> without irqfd? And if the answer is "because it's too difficult" - why 
> support VFIO without irqfd at all then?

Yes, it's too difficult, or at least it doesn't have sufficient ROI to
try to plumb it through QEMU.  What the hack in patch 5 enables is IRQ
support that doesn't rely on a KVM irqchip.  By not supporting it, we
drop any hope of running tcg targets.  I'll admit that a tcg target with
an assigned device is mostly an unsupportable toy, but it's potentially
useful for development and things like driver or even hardware
debugging.  IMHO, the EOI on device access is a nice, self contained
solution, even if it ends up being rather sloppy with interrupts.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device
  2014-06-25 21:54     ` Alex Williamson
@ 2014-06-25 22:02       ` Alexander Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 22:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: peter.maydell, kim.phillips, eric.auger, Eric Auger, patches,
	a.rigo, qemu-devel, stuart.yoder, christophe.barnichon,
	a.motakis, kvmarm, christoffer.dall


On 25.06.14 23:54, Alex Williamson wrote:
> On Wed, 2014-06-25 at 23:35 +0200, Alexander Graf wrote:
>> On 02.06.14 09:49, Eric Auger wrote:
>>> This patch aims at optimizing IRQ handling using irqfd framework.
>>> It brings significant performance improvement over "traditional" IRQ
>>> handling introduced in :
>>> "vfio: Add initial IRQ support in platform device".
>>>
>>> This new IRQ handling method depends on kernel KVM irqfd/GSI routing
>>> capability.
>>>
>>> The IRQ handling method can be dynamically chosen (default is irqfd,
>>> if kernel supports it obviously).  For example to disable irqfd
>>> handling, use:
>>>
>>> -device vfio-platform,vfio_device="fff51000.ethernet",\
>>> compat="calxeda/hb-xgmac",mmap-timeout-ms=110,irqfd=false\
>>>
>>> Performances are improved for the following reasons:
>>> - eventfds signalled by the VFIO platform driver are handled on
>>>     kernel side by the KVM irqfd framework.
>>> - the end of interrupt(EOI) is trapped at GIC level and not at MMIO
>>>     region level. As a reminder, in traditional IRQ handling QEMU
>>>     assumed the first guest access to a device MMIO region after IRQ
>>>     hit was the IRQ status register reset. This trap was approximate
>>>     and obliged to swap to slow path after IRQ hit. A mmap timer
>>>     mechanism enabled to swap back to fast path after the mmap period
>>>     introducing extra complexity. Now GIC detects the completion of
>>>     the virtual IRQ and signals a resampler eventfd on maintenance
>>>     IRQ. The corresponding handler re-enables the physical IRQ.
>> Ah, so if you're using irqfd you do unmask the interrupt on EOI. Why not
>> without irqfd? And if the answer is "because it's too difficult" - why
>> support VFIO without irqfd at all then?
> Yes, it's too difficult, or at least it doesn't have sufficient ROI to
> try to plumb it through QEMU.  What the hack in patch 5 enables is IRQ
> support that doesn't rely on a KVM irqchip.  By not supporting it, we
> drop any hope of running tcg targets.  I'll admit that a tcg target with
> an assigned device is mostly an unsupportable toy, but it's potentially
> useful for development and things like driver or even hardware
> debugging.  IMHO, the EOI on device access is a nice, self contained
> solution, even if it ends up being rather sloppy with interrupts.
> Thanks,

Yeah, I didn't realize that this is the same mechanism we use for PCI. 
You convinced me :).


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-02  7:49 ` [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option Eric Auger
  2014-06-25 21:30   ` Alexander Graf
@ 2014-06-25 22:28   ` Peter Maydell
  2014-06-25 22:28     ` Alexander Graf
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2014-06-25 22:28 UTC (permalink / raw)
  To: Eric Auger
  Cc: Alexander Graf, Kim Phillips, eric.auger, Patch Tracking,
	QEMU Developers, Alvise Rigo, Alex Williamson,
	christophe.barnichon, Stuart Yoder, Antonios Motakis, kvmarm,
	Christoffer Dall

On 2 June 2014 08:49, Eric Auger <eric.auger@linaro.org> wrote:
> This patch aims at allowing the end-user to specify the device he
> wants to directly assign to his mach-virt guest in the QEMU command
> line.

>  hw/arm/virt.c         | 222 +++++++++++++++++++++++++++++++++++++++++---------

This is way too much code to be adding to virt.c. I really don't
want to be dealing with VFIO related code in board models
beyond an absolute minimal "go do vfio stuff if the user asked
for it" set of hooks.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-25 22:28   ` Peter Maydell
@ 2014-06-25 22:28     ` Alexander Graf
  2014-06-26  7:39       ` Eric Auger
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-25 22:28 UTC (permalink / raw)
  To: Peter Maydell, Eric Auger
  Cc: Kim Phillips, eric.auger, Patch Tracking, QEMU Developers,
	Alvise Rigo, Alex Williamson, christophe.barnichon, Stuart Yoder,
	Antonios Motakis, kvmarm, Christoffer Dall


On 26.06.14 00:28, Peter Maydell wrote:
> On 2 June 2014 08:49, Eric Auger <eric.auger@linaro.org> wrote:
>> This patch aims at allowing the end-user to specify the device he
>> wants to directly assign to his mach-virt guest in the QEMU command
>> line.
>>   hw/arm/virt.c         | 222 +++++++++++++++++++++++++++++++++++++++++---------
> This is way too much code to be adding to virt.c. I really don't
> want to be dealing with VFIO related code in board models
> beyond an absolute minimal "go do vfio stuff if the user asked
> for it" set of hooks.

And device tree chunks :).


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-25 22:28     ` Alexander Graf
@ 2014-06-26  7:39       ` Eric Auger
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-26  7:39 UTC (permalink / raw)
  To: Alexander Graf, Peter Maydell
  Cc: Kim Phillips, eric.auger, Patch Tracking, QEMU Developers,
	Alvise Rigo, Alex Williamson, christophe.barnichon, Stuart Yoder,
	Antonios Motakis, kvmarm, Christoffer Dall

On 06/26/2014 12:28 AM, Alexander Graf wrote:
> 
> On 26.06.14 00:28, Peter Maydell wrote:
>> On 2 June 2014 08:49, Eric Auger <eric.auger@linaro.org> wrote:
>>> This patch aims at allowing the end-user to specify the device he
>>> wants to directly assign to his mach-virt guest in the QEMU command
>>> line.
>>>   hw/arm/virt.c         | 222
>>> +++++++++++++++++++++++++++++++++++++++++---------
>> This is way too much code to be adding to virt.c. I really don't
>> want to be dealing with VFIO related code in board models
>> beyond an absolute minimal "go do vfio stuff if the user asked
>> for it" set of hooks.
Hi Alex, Peter,

Thanks for your comments. I am currently preparing v4 where I use the
same technique as Alex did in
[PATCH 4/5] PPC: e500: Support platform devices
(http://lists.gnu.org/archive/html/qemu-devel/2014-06/msg00847.html), ie.
using
- qemu_add_machine_init_done_notifier
- qemu_register_reset, which seems to be the right clean way to do what
I tried to achieve here.
Alex, actually I am reusing the code you put in e500, moved it in a
separate helper file. That way, effectively, I will be able to have very
few vfio specific code in the machine file.

Best Regards

Eric

> 
> And device tree chunks :).
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support
  2014-06-25 21:21   ` Alexander Graf
@ 2014-06-26  7:47     ` Eric Auger
  2014-06-26  9:56       ` Alexander Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-26  7:47 UTC (permalink / raw)
  To: Alexander Graf, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, Kim Phillips, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm

On 06/25/2014 11:21 PM, Alexander Graf wrote:
> 
> On 02.06.14 09:49, Eric Auger wrote:
>> From: Kim Phillips <kim.phillips@linaro.org>
>>
>> Functions for which PCI and platform device support share are moved
>> into common.c.  The common vfio_{get,put}_group() get an additional
>> argument, a pointer to a vfio_reset_handler(), for which to pass on
>> to qemu_register_reset, but only if it exists (the platform device
>> code currently passes a NULL as its reset_handler).
>>
>> For the platform device code, we basically use SysBusDevice
>> instead of PCIDevice.  Since realize() returns void, unlike
>> PCIDevice's initfn, error codes are moved into the
>> error message text with %m.
>>
>> Currently only MMIO access is supported at this time.
>>
>> The perceived path for future QEMU development is:
>>
>> - add support for interrupts
>> - verify and test platform dev unmap path
>> - test existing PCI path for any regressions
>> - add support for creating platform devices on the qemu command line
>>    - currently device address specification is hardcoded for test
>>      development on Calxeda Midway's fff51000.ethernet device
>> - reset is not supported and registration of reset functions is
>>    bypassed for platform devices.
>>    - there is no standard means of resetting a platform device,
>>      unsure if it suffices to be handled at device--VFIO binding time
>>
>> [1] http://www.spinics.net/lists/kvm-arm/msg08195.html
>>
>> Changes (v2 -> v3):
>> [work done by Eric Auger]
>>
>> This new version introduces 2 separate VFIO Device objects:
>> - VFIOPCIDevice
>> - VFIOPlatformDevice
>>
>> Both objects share a VFIODevice struct.
>>
>> Also a VFIORegion shared struct was created. It is embedded in
>> VFIOBAR struct. VFIOPlatformDevice uses VFIORegion directly.
>>
>> Introducing those base classes induced quite a lot of tiny
>> changes in the PCI code. Theoretically PCI and platform
>> devices can be supported simultaneously. PCI modifications
>> currently are not tested.
>>
>> The VFIODevice is not a QOM object due to the single inheritance
>> model limitation.
>>
>> The VFIODevice struct embeds an ops structure which is
>> specialized in each VFIO leaf device. This makes possible to call
>> device specific functions in common parts, hence achieving better
>> factorization.
>>
>> Reset handling typically is handled that way where a unique
>> generic ResetHandler (in common.c) is used for both derived
>> classes. It calls device specific methods.
>>
>> As in the original contribution, only MMIO is supported in that
>> patch file (in mmap mode). IRQ support comes in a subsequent patch.
>>
>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>> ---
>>   hw/vfio/Makefile.objs      |    2 +
>>   hw/vfio/common.c           |  849 ++++++++++++++++++++++++++++
>>   hw/vfio/pci.c              | 1316
>> ++++++++++----------------------------------
>>   hw/vfio/platform.c         |  267 +++++++++
>>   hw/vfio/vfio-common.h      |  143 +++++
>>   linux-headers/linux/vfio.h |    1 +
>>   6 files changed, 1565 insertions(+), 1013 deletions(-)
> 
> This patch is impossible to review. Please split it out into a part (or
> maybe even multiple patches) that separate pci specific VFIO support
> into its own file and then another set of patches implementing platform
> support.
Hi Alex,

OK I will reorganize the patch as you suggest.

Alex (both of you), do you think the general split direction looks
reasonable, ie. having this VFIODevice struct and VFIORegion struct? or
do you think it induces too many  (and possibly cumbersome) changes in
the PCI code?

wrt VFIOPlatformDevice I changed the code to inherit from PlatformDevice
instead of SysBusDevice - maybe too early but future will say ... - .

Best Regards

Eric
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device
  2014-06-25 21:40     ` Alex Williamson
@ 2014-06-26  8:41       ` Eric Auger
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-26  8:41 UTC (permalink / raw)
  To: Alex Williamson, Alexander Graf
  Cc: peter.maydell, kim.phillips, eric.auger, patches, qemu-devel,
	a.rigo, stuart.yoder, christophe.barnichon, a.motakis, kvmarm,
	christoffer.dall

On 06/25/2014 11:40 PM, Alex Williamson wrote:
> On Wed, 2014-06-25 at 23:28 +0200, Alexander Graf wrote:
>> On 02.06.14 09:49, Eric Auger wrote:
>>> This patch brings a first support for device IRQ assignment to a
>>> KVM guest. Code is inspired of PCI INTx code.
>>>
>>> General principle of IRQ handling:
>>>
>>> when a physical IRQ occurs, VFIO driver signals an eventfd that was
>>> registered by the QEMU VFIO platform device. The eventfd handler
>>> (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
>>> disables MMIO region fast path (where MMIO regions are mapped as
>>> RAM). The purpose is to trap the IRQ status register guest reset.
>>> The physical interrupt is unmasked on the first read/write in any
>>> MMIO region. It was masked in the VFIO driver at the instant it
>>> signaled the eventfd.
>>
>> This doesn't sound like a very promising generic scheme to me. I can 
>> easily see devices requiring 2 or 3 or more accesses until they're 
>> pulling down the IRQ line. During that time interrupts will keep firing, 
>> queue up in the irqfd and get at us as spurious interrupts.
>>
>> Can't we handle it like PCI where we require devices to not share an 
>> interrupt line? Then we can just wait until the EOI in the interrupt 
>> controller.
Hi Alex,

Actually I transposed what was done for PCI INTx. For sure the virtual
IRQ completion instant is not precise but as Alex says latter on irqfd
should be used whenever possible for both precision aspects and
performance. Given the perf of this legacy solution for IRQ intensive IP
I would discourage to use that mode anyway. This is why I did not plan
to invest more on this mode.
> 
> QEMU's interrupt abstraction makes this really difficult and something
> that's not generally necessary outside of device assignment.  I spent a
> long time trying to figure out how we'd do it for PCI before I came up
> with this super generic hack that works surprisingly well.  Yes, we may
> get additional spurious interrupts, but from a host perspective they're
> rate limited by the guest poking hardware, so there's a feedback loop.
> Also note that assuming this is the same approach we take for PCI, this
> mode is only used for the non-KVM accelerated path.
Yes this is again exactly the same approach as for PCI. We now have full
irqfd + resamplefd support.

Best Regards

Eric
>  When we have a KVM
> irqchip that supports a resampling irqfd then we can get an eventfd
> signal back at the point when we should unmask the interrupt on the
> host.  Creating a cross-architecture QEMU interface to give you a
> callback when the architecture's notion of a resampling event occurs is
> not a trivial undertaking.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-25 21:30   ` Alexander Graf
@ 2014-06-26  8:53     ` Eric Auger
  2014-06-26  9:25       ` Alexander Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Auger @ 2014-06-26  8:53 UTC (permalink / raw)
  To: Alexander Graf, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm

On 06/25/2014 11:30 PM, Alexander Graf wrote:
> 
> On 02.06.14 09:49, Eric Auger wrote:
>> This patch aims at allowing the end-user to specify the device he
>> wants to directly assign to his mach-virt guest in the QEMU command
>> line.
>>
>> The QEMU platform device becomes generic.
>>
>> Current choice is to reuse the "-device" option.
>>
>> For example when assigning Calxeda Midway xgmac device this option
>> is used:
>> -device vfio-platform,vfio_device="fff51000.ethernet",\
>> compat="calxeda/hb-xgmac",mmap-timeout-ms=1000
> 
> I think we're walking into the right direction, but there is one major
> nit I have. I don't think we should have a -device vfio-platform. I
> think we should have a -device vfio-xgmac that maybe inherits from an
> abstrace vfio-platform class.
> 
> That way machine code can assemble the device tree according to the
> device and you can also implement hardware specific hacks or
> dependencies if you need them - for example the MMIO masking to find an
> EOI you did earlier.
I must admit I am lacking experience of other devices than my dear
xgmac. I can just say that for the time beeing the approach seems to fit
some ARM Amba devices like PL330 DMA. We need to go further to identity
the limits of this generic approach.

Best Regards

Eric
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-26  8:53     ` Eric Auger
@ 2014-06-26  9:25       ` Alexander Graf
  2014-06-26  9:30         ` Eric Auger
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2014-06-26  9:25 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm


On 26.06.14 10:53, Eric Auger wrote:
> On 06/25/2014 11:30 PM, Alexander Graf wrote:
>> On 02.06.14 09:49, Eric Auger wrote:
>>> This patch aims at allowing the end-user to specify the device he
>>> wants to directly assign to his mach-virt guest in the QEMU command
>>> line.
>>>
>>> The QEMU platform device becomes generic.
>>>
>>> Current choice is to reuse the "-device" option.
>>>
>>> For example when assigning Calxeda Midway xgmac device this option
>>> is used:
>>> -device vfio-platform,vfio_device="fff51000.ethernet",\
>>> compat="calxeda/hb-xgmac",mmap-timeout-ms=1000
>> I think we're walking into the right direction, but there is one major
>> nit I have. I don't think we should have a -device vfio-platform. I
>> think we should have a -device vfio-xgmac that maybe inherits from an
>> abstrace vfio-platform class.
>>
>> That way machine code can assemble the device tree according to the
>> device and you can also implement hardware specific hacks or
>> dependencies if you need them - for example the MMIO masking to find an
>> EOI you did earlier.
> I must admit I am lacking experience of other devices than my dear
> xgmac. I can just say that for the time beeing the approach seems to fit
> some ARM Amba devices like PL330 DMA. We need to go further to identity
> the limits of this generic approach.

No, I think we're better off not faking anything generic at all, because 
I'm 99.9% sure it will never be generic in real-world device cases.

And if we start doing things generically, people will soon want to have 
other mad additions to do device specific things in generic code, such 
as "take the device tree from the host, but modify property x, y and z". 
Better be clear about our limits from the beginning :).

Imagine vfio-platform as a transport, similar to TCP. We have ports and 
moving data from left to right is always the same, but whether you need 
to open 2 ports to get a working FTP data transfer is up to the 
implementation of the protocol above. Same thing here.


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option
  2014-06-26  9:25       ` Alexander Graf
@ 2014-06-26  9:30         ` Eric Auger
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Auger @ 2014-06-26  9:30 UTC (permalink / raw)
  To: Alexander Graf, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, stuart.yoder, alex.williamson,
	christophe.barnichon, a.motakis, kvmarm

On 06/26/2014 11:25 AM, Alexander Graf wrote:
> 
> On 26.06.14 10:53, Eric Auger wrote:
>> On 06/25/2014 11:30 PM, Alexander Graf wrote:
>>> On 02.06.14 09:49, Eric Auger wrote:
>>>> This patch aims at allowing the end-user to specify the device he
>>>> wants to directly assign to his mach-virt guest in the QEMU command
>>>> line.
>>>>
>>>> The QEMU platform device becomes generic.
>>>>
>>>> Current choice is to reuse the "-device" option.
>>>>
>>>> For example when assigning Calxeda Midway xgmac device this option
>>>> is used:
>>>> -device vfio-platform,vfio_device="fff51000.ethernet",\
>>>> compat="calxeda/hb-xgmac",mmap-timeout-ms=1000
>>> I think we're walking into the right direction, but there is one major
>>> nit I have. I don't think we should have a -device vfio-platform. I
>>> think we should have a -device vfio-xgmac that maybe inherits from an
>>> abstrace vfio-platform class.
>>>
>>> That way machine code can assemble the device tree according to the
>>> device and you can also implement hardware specific hacks or
>>> dependencies if you need them - for example the MMIO masking to find an
>>> EOI you did earlier.
>> I must admit I am lacking experience of other devices than my dear
>> xgmac. I can just say that for the time beeing the approach seems to fit
>> some ARM Amba devices like PL330 DMA. We need to go further to identity
>> the limits of this generic approach.
> 
> No, I think we're better off not faking anything generic at all, because
> I'm 99.9% sure it will never be generic in real-world device cases.
> 
> And if we start doing things generically, people will soon want to have
> other mad additions to do device specific things in generic code, such
> as "take the device tree from the host, but modify property x, y and z".
> Better be clear about our limits from the beginning :).
> 
> Imagine vfio-platform as a transport, similar to TCP. We have ports and
> moving data from left to right is always the same, but whether you need
> to open 2 ports to get a working FTP data transfer is up to the
> implementation of the protocol above. Same thing here.
OK you convinced me. I will investigate that way then.
Eric
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support
  2014-06-26  7:47     ` Eric Auger
@ 2014-06-26  9:56       ` Alexander Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Graf @ 2014-06-26  9:56 UTC (permalink / raw)
  To: Eric Auger, eric.auger, christoffer.dall, qemu-devel,
	kim.phillips, a.rigo
  Cc: peter.maydell, patches, Kim Phillips, stuart.yoder,
	alex.williamson, christophe.barnichon, a.motakis, kvmarm


On 26.06.14 09:47, Eric Auger wrote:
> On 06/25/2014 11:21 PM, Alexander Graf wrote:
>> On 02.06.14 09:49, Eric Auger wrote:
>>> From: Kim Phillips <kim.phillips@linaro.org>
>>>
>>> Functions for which PCI and platform device support share are moved
>>> into common.c.  The common vfio_{get,put}_group() get an additional
>>> argument, a pointer to a vfio_reset_handler(), for which to pass on
>>> to qemu_register_reset, but only if it exists (the platform device
>>> code currently passes a NULL as its reset_handler).
>>>
>>> For the platform device code, we basically use SysBusDevice
>>> instead of PCIDevice.  Since realize() returns void, unlike
>>> PCIDevice's initfn, error codes are moved into the
>>> error message text with %m.
>>>
>>> Currently only MMIO access is supported at this time.
>>>
>>> The perceived path for future QEMU development is:
>>>
>>> - add support for interrupts
>>> - verify and test platform dev unmap path
>>> - test existing PCI path for any regressions
>>> - add support for creating platform devices on the qemu command line
>>>     - currently device address specification is hardcoded for test
>>>       development on Calxeda Midway's fff51000.ethernet device
>>> - reset is not supported and registration of reset functions is
>>>     bypassed for platform devices.
>>>     - there is no standard means of resetting a platform device,
>>>       unsure if it suffices to be handled at device--VFIO binding time
>>>
>>> [1] http://www.spinics.net/lists/kvm-arm/msg08195.html
>>>
>>> Changes (v2 -> v3):
>>> [work done by Eric Auger]
>>>
>>> This new version introduces 2 separate VFIO Device objects:
>>> - VFIOPCIDevice
>>> - VFIOPlatformDevice
>>>
>>> Both objects share a VFIODevice struct.
>>>
>>> Also a VFIORegion shared struct was created. It is embedded in
>>> VFIOBAR struct. VFIOPlatformDevice uses VFIORegion directly.
>>>
>>> Introducing those base classes induced quite a lot of tiny
>>> changes in the PCI code. Theoretically PCI and platform
>>> devices can be supported simultaneously. PCI modifications
>>> currently are not tested.
>>>
>>> The VFIODevice is not a QOM object due to the single inheritance
>>> model limitation.
>>>
>>> The VFIODevice struct embeds an ops structure which is
>>> specialized in each VFIO leaf device. This makes possible to call
>>> device specific functions in common parts, hence achieving better
>>> factorization.
>>>
>>> Reset handling typically is handled that way where a unique
>>> generic ResetHandler (in common.c) is used for both derived
>>> classes. It calls device specific methods.
>>>
>>> As in the original contribution, only MMIO is supported in that
>>> patch file (in mmap mode). IRQ support comes in a subsequent patch.
>>>
>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>> ---
>>>    hw/vfio/Makefile.objs      |    2 +
>>>    hw/vfio/common.c           |  849 ++++++++++++++++++++++++++++
>>>    hw/vfio/pci.c              | 1316
>>> ++++++++++----------------------------------
>>>    hw/vfio/platform.c         |  267 +++++++++
>>>    hw/vfio/vfio-common.h      |  143 +++++
>>>    linux-headers/linux/vfio.h |    1 +
>>>    6 files changed, 1565 insertions(+), 1013 deletions(-)
>> This patch is impossible to review. Please split it out into a part (or
>> maybe even multiple patches) that separate pci specific VFIO support
>> into its own file and then another set of patches implementing platform
>> support.
> Hi Alex,
>
> OK I will reorganize the patch as you suggest.
>
> Alex (both of you), do you think the general split direction looks
> reasonable, ie. having this VFIODevice struct and VFIORegion struct? or
> do you think it induces too many  (and possibly cumbersome) changes in
> the PCI code?

I don't think we'll be able to tell before the implications are visible 
- and that basically means splitting the patch :(.

> wrt VFIOPlatformDevice I changed the code to inherit from PlatformDevice
> instead of SysBusDevice - maybe too early but future will say ... - .

I'll check out which way to go forward there today. But either way, the 
two aren't incredibly different.


Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-06-26  9:56 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-02  7:49 [Qemu-devel] [RFC v3 00/10] KVM platform device passthrough Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 01/10] hw/arm/virt: add a xgmac device Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 02/10] vfio: move hw/misc/vfio.c to hw/vfio/pci.c Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 03/10] vfio: add vfio-platform support Eric Auger
2014-06-25 21:21   ` Alexander Graf
2014-06-26  7:47     ` Eric Auger
2014-06-26  9:56       ` Alexander Graf
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 04/10] vfio: simplifed DPRINTF calls using device name Eric Auger
2014-06-25 21:22   ` Alexander Graf
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 05/10] vfio: Add initial IRQ support in platform device Eric Auger
2014-06-25 21:28   ` Alexander Graf
2014-06-25 21:40     ` Alex Williamson
2014-06-26  8:41       ` Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 06/10] virt: Assign a VFIO platform device with -device option Eric Auger
2014-06-25 21:30   ` Alexander Graf
2014-06-26  8:53     ` Eric Auger
2014-06-26  9:25       ` Alexander Graf
2014-06-26  9:30         ` Eric Auger
2014-06-25 22:28   ` Peter Maydell
2014-06-25 22:28     ` Alexander Graf
2014-06-26  7:39       ` Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 07/10] Add EXEC_FLAG to VFIO DMA mappings Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 08/10] Add AMBA devices support to VFIO Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 09/10] Always use eventfd as notifying mechanism Eric Auger
2014-06-02  7:49 ` [Qemu-devel] [RFC v3 10/10] vfio: Add irqfd support in platform device Eric Auger
2014-06-25 21:35   ` Alexander Graf
2014-06-25 21:54     ` Alex Williamson
2014-06-25 22:02       ` Alexander Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.