All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
@ 2013-05-13 10:54 David Gibson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c David Gibson
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, agraf, mst

Hi,

Here's another spin on my patches working towards integrating vfio
with guest visible iommu support.  These are on top of Paolo Bonzini's
iommu branch at:
	git://github.com/bonzini/qemu.git

This new spin starts with some extensions to the pci / memory API
core, then some VFIO updates to use those to start breaking vfio's
assumption of mapping system memory directly.

It doesn't actually support guest IOMMUs with vfio yet, but it does
support simpler non-identify DMA mappings between PCI space and main
memory, such as RAM being mapped at a non-zero offset within PCI space
(this is configurable on many embedded host bridges).  This new spin
no longer adds vfio specific hooks into the memory core, which turns
out to be less messy than I anticipated, though it may yet get worse
with actual guest IOMMU support, which I'm still working on.

Aside: I realised there's another problem with assignment of DMA
address spaces in the current DMAContext scheme which is still there
with the iommu rework, but it's more or less orthogonal to the changes
here, so I've left it for now.  Specifically the way the iommu is
determined from a callback in the PCIBus means that it won't be
assigned for devices under a PCI-PCI bridge.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 12:14   ` Paolo Bonzini
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion David Gibson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

The iommu tree introduced a build bug into hw/i386/kvm/ioapic.c.  Looks
like a messed up copy and paste.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/i386/kvm/ioapic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/kvm/ioapic.c b/hw/i386/kvm/ioapic.c
index e3c29da..fe24d26 100644
--- a/hw/i386/kvm/ioapic.c
+++ b/hw/i386/kvm/ioapic.c
@@ -108,7 +108,7 @@ static void kvm_ioapic_put(IOAPICCommonState *s)
         abort();
     }
 
-    memory_region_unref(mrs.mr);
+    memory_region_unref(&s->io_memory);
 }
 
 static void kvm_ioapic_reset(DeviceState *dev)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 12:14   ` Paolo Bonzini
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions David Gibson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

Currently do_pci_unregister_device() calls memory_region_del_subregion()
on the device's bus_master_enable_region and the device's iommu region.
But the bus_master_enable_region is an _alias_ to the iommu region, the
iommu region is _not_ a subregion of it.  I suspect this has slipped by
only because PCI hot unplug has not been tested with the new PCI DMA
address space handling.  This patch removes the bogus call.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/pci/pci.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 0ba39e6..58d3f69 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -875,7 +875,6 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
     pci_config_free(pci_dev);
 
     address_space_destroy(&pci_dev->bus_master_as);
-    memory_region_del_subregion(&pci_dev->bus_master_enable_region, pci_dev->iommu);
     pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
     memory_region_destroy(&pci_dev->bus_master_enable_region);
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c David Gibson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 12:15   ` Paolo Bonzini
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space David Gibson
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

The current bus callbacks to assign and destroy an iommu memory region for
a PCI device tacitly assume that the lifetime of that address space is
tied to the lifetime of the PCI device.  Although that's certainly
possible, it's much more likely that the region will be (at least
potentially) shared between several devices and have a lifetime instead
tied to the PCI host bridge, or the IOMMU itself.  This is demonstrated in
the fact that there are no existing users of the destructor hook.

This patch simplifies the code by moving to the more likely use model.
This means we can eliminate the destructor hook entirely, and the iommu_fn
hook is now assumed to inform us of the device's pre-existing DMA adddress
space, rather than creating or assigning it.  That in turn means we have
no need to keep a reference to the region around in the PCIDevice
structure, which saves a little space.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/pci/pci.c             |   16 +++++-----------
 hw/ppc/spapr_pci.c       |    2 +-
 include/hw/pci/pci.h     |    5 +----
 include/hw/pci/pci_bus.h |    1 -
 4 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 58d3f69..3c947b3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -285,10 +285,6 @@ static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
     return get_system_memory();
 }
 
-static void pci_default_iommu_dtor(MemoryRegion *mr)
-{
-}
-
 static void pci_bus_init(PCIBus *bus, DeviceState *parent,
                          const char *name,
                          MemoryRegion *address_space_mem,
@@ -299,7 +295,7 @@ static void pci_bus_init(PCIBus *bus, DeviceState *parent,
     bus->devfn_min = devfn_min;
     bus->address_space_mem = address_space_mem;
     bus->address_space_io = address_space_io;
-    pci_setup_iommu(bus, pci_default_iommu, NULL, NULL);
+    pci_setup_iommu(bus, pci_default_iommu, NULL);
 
     /* host bridge */
     QLIST_INIT(&bus->child);
@@ -797,6 +793,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
     PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
     PCIConfigReadFunc *config_read = pc->config_read;
     PCIConfigWriteFunc *config_write = pc->config_write;
+    MemoryRegion *dma_mr;
 
     if (devfn < 0) {
         for(devfn = bus->devfn_min ; devfn < ARRAY_SIZE(bus->devices);
@@ -814,9 +811,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
     }
 
     pci_dev->bus = bus;
-    pci_dev->iommu = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
+    dma_mr = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
     memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
-                             pci_dev->iommu, 0, memory_region_size(pci_dev->iommu));
+                             dma_mr, 0, memory_region_size(dma_mr));
     memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
     address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region,
                        name);
@@ -875,7 +872,6 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
     pci_config_free(pci_dev);
 
     address_space_destroy(&pci_dev->bus_master_as);
-    pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
     memory_region_destroy(&pci_dev->bus_master_enable_region);
 }
 
@@ -2237,11 +2233,9 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->props = pci_props;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
-                     void *opaque)
+void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
 {
     bus->iommu_fn = fn;
-    bus->iommu_dtor_fn = dtor ? dtor : pci_default_iommu_dtor;
     bus->iommu_opaque = opaque;
 }
 
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7014b61..eb1d9e7 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -650,7 +650,7 @@ static int spapr_phb_init(SysBusDevice *s)
         fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
         return -1;
     }
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, NULL, sphb);
+    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 400e772..61fe51e 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -242,7 +242,6 @@ struct PCIDevice {
     PCIIORegion io_regions[PCI_NUM_REGIONS];
     AddressSpace bus_master_as;
     MemoryRegion bus_master_enable_region;
-    MemoryRegion *iommu;
 
     /* do not access the following fields */
     PCIConfigReadFunc *config_read;
@@ -402,10 +401,8 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
 void pci_device_deassert_intx(PCIDevice *dev);
 
 typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
-typedef void (*PCIIOMMUDestructorFunc)(MemoryRegion *mr);
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
-                     void *opaque);
+void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index fada8f5..66762f6 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -11,7 +11,6 @@
 struct PCIBus {
     BusState qbus;
     PCIIOMMUFunc iommu_fn;
-    PCIIOMMUDestructorFunc iommu_dtor_fn;
     void *iommu_opaque;
     uint8_t devfn_min;
     pci_set_irq_fn set_irq;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (2 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 12:16   ` Paolo Bonzini
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 5/8] pci: Introduce helper to retrieve a PCI device's DMA address space David Gibson
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

Currently the PCI iommu_fn hook returns a MemoryRegion * to represent the
DMA address of this bus's IOMMU, although that MemoryRegion does have to
be a root MemoryRegion.  Several upcoming users of this need the extra
features of an AddressSpace object, rather than a MemoryRegion, and while
they could each construct their own AddressSpace wrapper for the iommu
MemoryRegion, that leads to unnecessary proliferation of essentially
identical AddressSpace objects.  This patch avoids that, by instead having
iommu_fn return an AddressSpace *, assuming the referenced AS's lifetime
is managed somewhere else (probably the PCI host bridge).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/pci/pci.c                |   10 +++++-----
 hw/ppc/spapr_pci.c          |    6 ++++--
 include/hw/pci-host/spapr.h |    1 +
 include/hw/pci/pci.h        |    2 +-
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 3c947b3..39085d8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -279,10 +279,10 @@ int pci_find_domain(const PCIBus *bus)
     return -1;
 }
 
-static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
+static AddressSpace *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     /* FIXME: inherit memory region from bus creator */
-    return get_system_memory();
+    return &address_space_memory;
 }
 
 static void pci_bus_init(PCIBus *bus, DeviceState *parent,
@@ -793,7 +793,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
     PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
     PCIConfigReadFunc *config_read = pc->config_read;
     PCIConfigWriteFunc *config_write = pc->config_write;
-    MemoryRegion *dma_mr;
+    AddressSpace *dma_as;
 
     if (devfn < 0) {
         for(devfn = bus->devfn_min ; devfn < ARRAY_SIZE(bus->devices);
@@ -811,9 +811,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
     }
 
     pci_dev->bus = bus;
-    dma_mr = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
+    dma_as = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
     memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
-                             dma_mr, 0, memory_region_size(dma_mr));
+                             dma_as->root, 0, memory_region_size(dma_as->root));
     memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
     address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region,
                        name);
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index eb1d9e7..762db62 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -506,11 +506,11 @@ static const MemoryRegionOps spapr_msi_ops = {
 /*
  * PHB PCI device
  */
-static MemoryRegion *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
+static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     sPAPRPHBState *phb = opaque;
 
-    return spapr_tce_get_iommu(phb->tcet);
+    return &phb->iommu_as;
 }
 
 static int spapr_phb_init(SysBusDevice *s)
@@ -650,6 +650,8 @@ static int spapr_phb_init(SysBusDevice *s)
         fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
         return -1;
     }
+    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
+                       sphb->dtbusname);
     pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 653dd40..1e23dbf 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -50,6 +50,7 @@ typedef struct sPAPRPHBState {
     uint64_t dma_window_start;
     uint64_t dma_window_size;
     sPAPRTCETable *tcet;
+    AddressSpace iommu_as;
 
     struct {
         uint32_t irq;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 61fe51e..6ef1f97 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -400,7 +400,7 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 5/8] pci: Introduce helper to retrieve a PCI device's DMA address space
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (3 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace David Gibson
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

A PCI device's DMA address space (possibly an IOMMU) is returned by a
method on the PCIBus.  At the moment that only has one caller, so the
method is simply open coded.  We'll need another caller for VFIO, so
this patch introduces a helper/wrapper function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/pci/pci.c         |    9 ++++++++-
 include/hw/pci/pci.h |    1 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 39085d8..e7a9735 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -811,7 +811,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
     }
 
     pci_dev->bus = bus;
-    dma_as = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
+    dma_as = pci_iommu_as(pci_dev);
     memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
                              dma_as->root, 0, memory_region_size(dma_as->root));
     memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
@@ -2233,6 +2233,13 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->props = pci_props;
 }
 
+AddressSpace *pci_iommu_as(PCIDevice *dev)
+{
+    PCIBus *bus = PCI_BUS(dev->bus);
+
+    return bus->iommu_fn(bus, bus->iommu_opaque, dev->devfn);
+}
+
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
 {
     bus->iommu_fn = fn;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6ef1f97..21e2132 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -402,6 +402,7 @@ void pci_device_deassert_intx(PCIDevice *dev);
 
 typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 
+AddressSpace *pci_iommu_as(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
 static inline void
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (4 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 5/8] pci: Introduce helper to retrieve a PCI device's DMA address space David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 11:10   ` Peter Maydell
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces David Gibson
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

At the moment, most AddressSpace objects last as long as the guest system
in practice, but that could well change in future.  In addition, for VFIO
we will be introducing some private per-AdressSpace information, which must
be disposed of before the AddressSpace itself is destroyed.

To reduce the chances of subtle bugs in this area, this patch adds
asssertions to ensure that when an AddressSpace is destroyed, there are no
remaining MemoryListeners using that AS as a filter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 memory.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/memory.c b/memory.c
index 1a86607..c409ee5 100644
--- a/memory.c
+++ b/memory.c
@@ -1727,11 +1727,18 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
 
 void address_space_destroy(AddressSpace *as)
 {
+    MemoryListener *listener;
+
     /* Flush out anything from MemoryListeners listening in on this */
     memory_region_transaction_begin();
     as->root = NULL;
     memory_region_transaction_commit();
     QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
+
+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
+        assert(listener->address_space_filter != as);
+    }
+
     address_space_destroy_dispatch(as);
     flatview_unref(as->current_map);
     g_free(as->name);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (5 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 21:32   ` Alex Williamson
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed David Gibson
  2013-05-13 12:23 ` [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited Paolo Bonzini
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

The only model so far supported for VFIO passthrough devices is the model
usually used on x86, where all of the guest's RAM is mapped into the
(host) IOMMU and there is no IOMMU visible in the guest.

This patch begins to relax this model, introducing the notion of a
VFIOAddressSpace.  This represents a logical DMA address space which will
be visible to one or more VFIO devices by appropriate mapping in the (host)
IOMMU.  Thus the currently global list of containers becomes local to
a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
group to multiple address spaces.

For now, only one VFIOAddressSpace is created and used, corresponding to
main system memory, that will change in future patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/misc/vfio.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 50 insertions(+), 17 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index c4a8853..b1e9220 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -113,9 +113,17 @@ enum {
     VFIO_INT_MSIX = 3,
 };
 
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+} VFIOAddressSpace;
+
+static VFIOAddressSpace vfio_address_space_memory;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
+    VFIOAddressSpace *vas;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     struct {
         /* enable abstraction to support various iommu backends */
@@ -178,9 +186,6 @@ typedef struct VFIOGroup {
 
 #define MSIX_CAP_LENGTH 12
 
-static QLIST_HEAD(, VFIOContainer)
-    container_list = QLIST_HEAD_INITIALIZER(container_list);
-
 static QLIST_HEAD(, VFIOGroup)
     group_list = QLIST_HEAD_INITIALIZER(group_list);
 
@@ -2624,16 +2629,28 @@ static int vfio_load_rom(VFIODevice *vdev)
     return 0;
 }
 
-static int vfio_connect_container(VFIOGroup *group)
+static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
+{
+    vas->as = as;
+    QLIST_INIT(&vas->containers);
+}
+
+static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)
 {
     VFIOContainer *container;
     int ret, fd;
 
     if (group->container) {
-        return 0;
+        if (group->container->vas == vas) {
+            return 0;
+        } else {
+            error_report("vfio: group %d used in multiple address spaces",
+                         group->groupid);
+            return -EBUSY;
+        }
     }
 
-    QLIST_FOREACH(container, &container_list, next) {
+    QLIST_FOREACH(container, &vas->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -2656,6 +2673,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     container = g_malloc0(sizeof(*container));
+    container->vas = vas;
     container->fd = fd;
 
     if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
@@ -2678,7 +2696,8 @@ static int vfio_connect_container(VFIOGroup *group)
         container->iommu_data.listener = vfio_memory_listener;
         container->iommu_data.release = vfio_listener_release;
 
-        memory_listener_register(&container->iommu_data.listener, &address_space_memory);
+        memory_listener_register(&container->iommu_data.listener,
+                                 container->vas->as);
     } else {
         error_report("vfio: No available IOMMU models");
         g_free(container);
@@ -2687,7 +2706,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&container_list, container, next);
+    QLIST_INSERT_HEAD(&vas->containers, container, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -2695,12 +2714,12 @@ static int vfio_connect_container(VFIOGroup *group)
     return 0;
 }
 
-static void vfio_disconnect_container(VFIOGroup *group)
+static void vfio_disconnect(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
 
     if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
+        error_report("vfio: error disconnecting group %d from context",
                      group->groupid);
     }
 
@@ -2712,13 +2731,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
             container->iommu_data.release(container);
         }
         QLIST_REMOVE(container, next);
-        DPRINTF("vfio_disconnect_container: close container->fd\n");
+        DPRINTF("vfio_disconnect: close container->fd\n");
         close(container->fd);
         g_free(container);
     }
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *vas)
 {
     VFIOGroup *group;
     char path[32];
@@ -2726,7 +2745,15 @@ static VFIOGroup *vfio_get_group(int groupid)
 
     QLIST_FOREACH(group, &group_list, next) {
         if (group->groupid == groupid) {
-            return group;
+            /* Found it.  Now is it already in the right context? */
+            assert(group->container);
+            if (group->container->vas == vas) {
+                return group;
+            } else {
+                error_report("vfio: group %d used in multiple address spaces",
+                             group->groupid);
+                return NULL;
+            }
         }
     }
 
@@ -2759,8 +2786,8 @@ static VFIOGroup *vfio_get_group(int groupid)
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group)) {
-        error_report("vfio: failed to setup container for group %d", groupid);
+    if (vfio_connect(group, vas)) {
+        error_report("vfio: failed to setup context for group %d", groupid);
         close(group->fd);
         g_free(group);
         return NULL;
@@ -2777,7 +2804,7 @@ static void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    vfio_disconnect_container(group);
+    vfio_disconnect(group);
     QLIST_REMOVE(group, next);
     DPRINTF("vfio_put_group: close group->fd\n");
     close(group->fd);
@@ -2992,7 +3019,12 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    group = vfio_get_group(groupid);
+    if (pci_iommu_as(pdev) != &address_space_memory) {
+        error_report("vfio: DMA address space must be system memory");
+        return -ENXIO;
+    }
+
+    group = vfio_get_group(groupid, &vfio_address_space_memory);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -3212,6 +3244,7 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (6 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces David Gibson
@ 2013-05-13 10:54 ` David Gibson
  2013-05-13 21:33   ` Alex Williamson
  2013-05-13 12:23 ` [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited Paolo Bonzini
  8 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 10:54 UTC (permalink / raw)
  To: alex.williamson, pbonzini; +Cc: aik, qemu-devel, David Gibson, agraf, mst

So far, VFIO has a notion of different logical DMA address spaces, but
only ever uses one (system memory).  This patch extends this, creating
new VFIOAddressSpace objects as necessary, according to the AddressSpace
reported by the PCI subsystem for this device's DMAs.

This isn't enough yet to support guest side IOMMUs with VFIO, but it does
mean we could now support VFIO devices on, for example, a guest side PCI
host bridge which maps system memory at somewhere other than 0 in PCI
space.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/misc/vfio.c |   36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index b1e9220..3850d39 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -116,9 +116,10 @@ enum {
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
-static VFIOAddressSpace vfio_address_space_memory;
+QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
 
 struct VFIOGroup;
 
@@ -2635,6 +2636,23 @@ static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
     QLIST_INIT(&vas->containers);
 }
 
+static VFIOAddressSpace *vfio_address_space_get(AddressSpace *as)
+{
+    VFIOAddressSpace *vas;
+
+    QLIST_FOREACH(vas, &vfio_address_spaces, list) {
+        if (vas->as == as)
+            return vas;
+    }
+
+    /* No suitable VFIOAddressSpace, create a new one */
+    vas = g_malloc0(sizeof(*vas));
+    vfio_address_space_init(vas, as);
+    QLIST_INSERT_HEAD(&vfio_address_spaces, vas, list);
+
+    return vas;
+}
+
 static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)
 {
     VFIOContainer *container;
@@ -2727,6 +2745,8 @@ static void vfio_disconnect(VFIOGroup *group)
     group->container = NULL;
 
     if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *vas = container->vas;
+
         if (container->iommu_data.release) {
             container->iommu_data.release(container);
         }
@@ -2734,6 +2754,11 @@ static void vfio_disconnect(VFIOGroup *group)
         DPRINTF("vfio_disconnect: close container->fd\n");
         close(container->fd);
         g_free(container);
+
+        if (QLIST_EMPTY(&vas->containers)) {
+            QLIST_REMOVE(vas, list);
+            g_free(vas);
+        }
     }
 }
 
@@ -2984,6 +3009,7 @@ static int vfio_initfn(PCIDevice *pdev)
 {
     VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
     VFIOGroup *group;
+    VFIOAddressSpace *vas;
     char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
     ssize_t len;
     struct stat st;
@@ -3019,12 +3045,9 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    if (pci_iommu_as(pdev) != &address_space_memory) {
-        error_report("vfio: DMA address space must be system memory");
-        return -ENXIO;
-    }
+    vas = vfio_address_space_get(pci_iommu_as(pdev));
 
-    group = vfio_get_group(groupid, &vfio_address_space_memory);
+    group = vfio_get_group(groupid, vas);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -3244,7 +3267,6 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
-    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace David Gibson
@ 2013-05-13 11:10   ` Peter Maydell
  2013-05-13 11:48     ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2013-05-13 11:10 UTC (permalink / raw)
  To: David Gibson; +Cc: mst, aik, qemu-devel, agraf, alex.williamson, pbonzini

On 13 May 2013 11:54, David Gibson <david@gibson.dropbear.id.au> wrote:
> At the moment, most AddressSpace objects last as long as the guest system
> in practice, but that could well change in future.  In addition, for VFIO
> we will be introducing some private per-AdressSpace information, which must
> be disposed of before the AddressSpace itself is destroyed.
>
> To reduce the chances of subtle bugs in this area, this patch adds
> asssertions to ensure that when an AddressSpace is destroyed, there are no
> remaining MemoryListeners using that AS as a filter.

Hmm, is this the ideal semantics? Typically the owner of the
MemoryListener isn't the owner of the AddressSpace so it isn't
necessarily in a position to guarantee that it can unregister
the listener before the address space is destroyed. In fact
as the listener API is currently documented, the filter
argument is just an optimisation to save the callbacks having
to filter out irrelevant information themselves.

Perhaps an "address space being destroyed" callback for
listeners would be better? Then VFIO could just do the
necessary disposal operations automatically when the AS
goes away.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 11:10   ` Peter Maydell
@ 2013-05-13 11:48     ` David Gibson
  2013-05-13 12:07       ` Peter Maydell
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 11:48 UTC (permalink / raw)
  To: Peter Maydell; +Cc: mst, aik, qemu-devel, agraf, alex.williamson, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

On Mon, May 13, 2013 at 12:10:10PM +0100, Peter Maydell wrote:
> On 13 May 2013 11:54, David Gibson <david@gibson.dropbear.id.au> wrote:
> > At the moment, most AddressSpace objects last as long as the guest system
> > in practice, but that could well change in future.  In addition, for VFIO
> > we will be introducing some private per-AdressSpace information, which must
> > be disposed of before the AddressSpace itself is destroyed.
> >
> > To reduce the chances of subtle bugs in this area, this patch adds
> > asssertions to ensure that when an AddressSpace is destroyed, there are no
> > remaining MemoryListeners using that AS as a filter.
> 
> Hmm, is this the ideal semantics? Typically the owner of the
> MemoryListener isn't the owner of the AddressSpace so it isn't
> necessarily in a position to guarantee that it can unregister
> the listener before the address space is destroyed. In fact
> as the listener API is currently documented, the filter
> argument is just an optimisation to save the callbacks having
> to filter out irrelevant information themselves.

If so, then it's broken by design.  There's no guarantee that after an
AddressSpace is destroyed another one won't be created at the same
address (in fact, depending on your malloc() implementation, it could
be very likely).  So references by pointer to an object *must* be
removed before the object itself is freed.

> Perhaps an "address space being destroyed" callback for
> listeners would be better? Then VFIO could just do the
> necessary disposal operations automatically when the AS
> goes away.

I was originally going to do that instead, but none of the things I
have planned immediately need it, since I think other lifetime
constraints will sort things out (chiefly that PCI devices must be
removed before their host bridge can, which will control the
AddressSpace lifetime).  This patch is so that if I'm wrong about
that, it will show up clearly, at which point we can implement such a
callback.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 11:48     ` David Gibson
@ 2013-05-13 12:07       ` Peter Maydell
  2013-05-13 12:19         ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2013-05-13 12:07 UTC (permalink / raw)
  To: David Gibson; +Cc: mst, aik, qemu-devel, agraf, alex.williamson, pbonzini

On 13 May 2013 12:48, David Gibson <david@gibson.dropbear.id.au> wrote:
> On Mon, May 13, 2013 at 12:10:10PM +0100, Peter Maydell wrote:
>> Hmm, is this the ideal semantics? Typically the owner of the
>> MemoryListener isn't the owner of the AddressSpace so it isn't
>> necessarily in a position to guarantee that it can unregister
>> the listener before the address space is destroyed. In fact
>> as the listener API is currently documented, the filter
>> argument is just an optimisation to save the callbacks having
>> to filter out irrelevant information themselves.
>
> If so, then it's broken by design.  There's no guarantee that after an
> AddressSpace is destroyed another one won't be created at the same
> address (in fact, depending on your malloc() implementation, it could
> be very likely).  So references by pointer to an object *must* be
> removed before the object itself is freed.

Mmm. Looking through the code it turns out we don't actually
make use of the ability to pass NULL as a filter (except in
target-arm/kvm.c which was just me being lazy and not passing
in the system address space). Perhaps we should just drop that
capability, at which point you have a clearer "you are listening
on one AS and you must make sure you arrange to unregister before
that AS goes away" API definition?

-- PMM

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion David Gibson
@ 2013-05-13 12:14   ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:14 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

Il 13/05/2013 12:54, David Gibson ha scritto:
> Currently do_pci_unregister_device() calls memory_region_del_subregion()
> on the device's bus_master_enable_region and the device's iommu region.
> But the bus_master_enable_region is an _alias_ to the iommu region, the
> iommu region is _not_ a subregion of it.  I suspect this has slipped by
> only because PCI hot unplug has not been tested with the new PCI DMA
> address space handling.  This patch removes the bogus call.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/pci/pci.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0ba39e6..58d3f69 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -875,7 +875,6 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
>      pci_config_free(pci_dev);
>  
>      address_space_destroy(&pci_dev->bus_master_as);
> -    memory_region_del_subregion(&pci_dev->bus_master_enable_region, pci_dev->iommu);
>      pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
>      memory_region_destroy(&pci_dev->bus_master_enable_region);
>  }
> 

Applied to iommu branch, thanks.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c David Gibson
@ 2013-05-13 12:14   ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:14 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

Il 13/05/2013 12:54, David Gibson ha scritto:
> The iommu tree introduced a build bug into hw/i386/kvm/ioapic.c.  Looks
> like a messed up copy and paste.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/i386/kvm/ioapic.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/i386/kvm/ioapic.c b/hw/i386/kvm/ioapic.c
> index e3c29da..fe24d26 100644
> --- a/hw/i386/kvm/ioapic.c
> +++ b/hw/i386/kvm/ioapic.c
> @@ -108,7 +108,7 @@ static void kvm_ioapic_put(IOAPICCommonState *s)
>          abort();
>      }
>  
> -    memory_region_unref(mrs.mr);
> +    memory_region_unref(&s->io_memory);
>  }
>  
>  static void kvm_ioapic_reset(DeviceState *dev)
> 

Squashed, thanks.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions David Gibson
@ 2013-05-13 12:15   ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:15 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

Il 13/05/2013 12:54, David Gibson ha scritto:
> The current bus callbacks to assign and destroy an iommu memory region for
> a PCI device tacitly assume that the lifetime of that address space is
> tied to the lifetime of the PCI device.  Although that's certainly
> possible, it's much more likely that the region will be (at least
> potentially) shared between several devices and have a lifetime instead
> tied to the PCI host bridge, or the IOMMU itself.  This is demonstrated in
> the fact that there are no existing users of the destructor hook.
> 
> This patch simplifies the code by moving to the more likely use model.
> This means we can eliminate the destructor hook entirely, and the iommu_fn
> hook is now assumed to inform us of the device's pre-existing DMA adddress
> space, rather than creating or assigning it.  That in turn means we have
> no need to keep a reference to the region around in the PCIDevice
> structure, which saves a little space.

Good idea, applying this too.

Paolo

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/pci/pci.c             |   16 +++++-----------
>  hw/ppc/spapr_pci.c       |    2 +-
>  include/hw/pci/pci.h     |    5 +----
>  include/hw/pci/pci_bus.h |    1 -
>  4 files changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 58d3f69..3c947b3 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -285,10 +285,6 @@ static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
>      return get_system_memory();
>  }
>  
> -static void pci_default_iommu_dtor(MemoryRegion *mr)
> -{
> -}
> -
>  static void pci_bus_init(PCIBus *bus, DeviceState *parent,
>                           const char *name,
>                           MemoryRegion *address_space_mem,
> @@ -299,7 +295,7 @@ static void pci_bus_init(PCIBus *bus, DeviceState *parent,
>      bus->devfn_min = devfn_min;
>      bus->address_space_mem = address_space_mem;
>      bus->address_space_io = address_space_io;
> -    pci_setup_iommu(bus, pci_default_iommu, NULL, NULL);
> +    pci_setup_iommu(bus, pci_default_iommu, NULL);
>  
>      /* host bridge */
>      QLIST_INIT(&bus->child);
> @@ -797,6 +793,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
>      PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
>      PCIConfigReadFunc *config_read = pc->config_read;
>      PCIConfigWriteFunc *config_write = pc->config_write;
> +    MemoryRegion *dma_mr;
>  
>      if (devfn < 0) {
>          for(devfn = bus->devfn_min ; devfn < ARRAY_SIZE(bus->devices);
> @@ -814,9 +811,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
>      }
>  
>      pci_dev->bus = bus;
> -    pci_dev->iommu = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
> +    dma_mr = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
>      memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
> -                             pci_dev->iommu, 0, memory_region_size(pci_dev->iommu));
> +                             dma_mr, 0, memory_region_size(dma_mr));
>      memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
>      address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region,
>                         name);
> @@ -875,7 +872,6 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
>      pci_config_free(pci_dev);
>  
>      address_space_destroy(&pci_dev->bus_master_as);
> -    pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
>      memory_region_destroy(&pci_dev->bus_master_enable_region);
>  }
>  
> @@ -2237,11 +2233,9 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
>      k->props = pci_props;
>  }
>  
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
> -                     void *opaque)
> +void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>  {
>      bus->iommu_fn = fn;
> -    bus->iommu_dtor_fn = dtor ? dtor : pci_default_iommu_dtor;
>      bus->iommu_opaque = opaque;
>  }
>  
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 7014b61..eb1d9e7 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -650,7 +650,7 @@ static int spapr_phb_init(SysBusDevice *s)
>          fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
>          return -1;
>      }
> -    pci_setup_iommu(bus, spapr_pci_dma_iommu, NULL, sphb);
> +    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
>  
>      QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
>  
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 400e772..61fe51e 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -242,7 +242,6 @@ struct PCIDevice {
>      PCIIORegion io_regions[PCI_NUM_REGIONS];
>      AddressSpace bus_master_as;
>      MemoryRegion bus_master_enable_region;
> -    MemoryRegion *iommu;
>  
>      /* do not access the following fields */
>      PCIConfigReadFunc *config_read;
> @@ -402,10 +401,8 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
>  void pci_device_deassert_intx(PCIDevice *dev);
>  
>  typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> -typedef void (*PCIIOMMUDestructorFunc)(MemoryRegion *mr);
>  
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
> -                     void *opaque);
> +void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>  
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index fada8f5..66762f6 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -11,7 +11,6 @@
>  struct PCIBus {
>      BusState qbus;
>      PCIIOMMUFunc iommu_fn;
> -    PCIIOMMUDestructorFunc iommu_dtor_fn;
>      void *iommu_opaque;
>      uint8_t devfn_min;
>      pci_set_irq_fn set_irq;
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space David Gibson
@ 2013-05-13 12:16   ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:16 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

Il 13/05/2013 12:54, David Gibson ha scritto:
> Currently the PCI iommu_fn hook returns a MemoryRegion * to represent the
> DMA address of this bus's IOMMU, although that MemoryRegion does have to
> be a root MemoryRegion.  Several upcoming users of this need the extra
> features of an AddressSpace object, rather than a MemoryRegion, and while
> they could each construct their own AddressSpace wrapper for the iommu
> MemoryRegion, that leads to unnecessary proliferation of essentially
> identical AddressSpace objects.  This patch avoids that, by instead having
> iommu_fn return an AddressSpace *, assuming the referenced AS's lifetime
> is managed somewhere else (probably the PCI host bridge).

Makes sense too.

Paolo

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/pci/pci.c                |   10 +++++-----
>  hw/ppc/spapr_pci.c          |    6 ++++--
>  include/hw/pci-host/spapr.h |    1 +
>  include/hw/pci/pci.h        |    2 +-
>  4 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 3c947b3..39085d8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -279,10 +279,10 @@ int pci_find_domain(const PCIBus *bus)
>      return -1;
>  }
>  
> -static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
> +static AddressSpace *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
>  {
>      /* FIXME: inherit memory region from bus creator */
> -    return get_system_memory();
> +    return &address_space_memory;
>  }
>  
>  static void pci_bus_init(PCIBus *bus, DeviceState *parent,
> @@ -793,7 +793,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
>      PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
>      PCIConfigReadFunc *config_read = pc->config_read;
>      PCIConfigWriteFunc *config_write = pc->config_write;
> -    MemoryRegion *dma_mr;
> +    AddressSpace *dma_as;
>  
>      if (devfn < 0) {
>          for(devfn = bus->devfn_min ; devfn < ARRAY_SIZE(bus->devices);
> @@ -811,9 +811,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
>      }
>  
>      pci_dev->bus = bus;
> -    dma_mr = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
> +    dma_as = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
>      memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
> -                             dma_mr, 0, memory_region_size(dma_mr));
> +                             dma_as->root, 0, memory_region_size(dma_as->root));
>      memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
>      address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region,
>                         name);
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index eb1d9e7..762db62 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -506,11 +506,11 @@ static const MemoryRegionOps spapr_msi_ops = {
>  /*
>   * PHB PCI device
>   */
> -static MemoryRegion *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> +static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>  {
>      sPAPRPHBState *phb = opaque;
>  
> -    return spapr_tce_get_iommu(phb->tcet);
> +    return &phb->iommu_as;
>  }
>  
>  static int spapr_phb_init(SysBusDevice *s)
> @@ -650,6 +650,8 @@ static int spapr_phb_init(SysBusDevice *s)
>          fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
>          return -1;
>      }
> +    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
> +                       sphb->dtbusname);
>      pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
>  
>      QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 653dd40..1e23dbf 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -50,6 +50,7 @@ typedef struct sPAPRPHBState {
>      uint64_t dma_window_start;
>      uint64_t dma_window_size;
>      sPAPRTCETable *tcet;
> +    AddressSpace iommu_as;
>  
>      struct {
>          uint32_t irq;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 61fe51e..6ef1f97 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -400,7 +400,7 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
>  
>  void pci_device_deassert_intx(PCIDevice *dev);
>  
> -typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>  
>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 12:07       ` Peter Maydell
@ 2013-05-13 12:19         ` Paolo Bonzini
  2013-05-13 12:57           ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:19 UTC (permalink / raw)
  To: Peter Maydell; +Cc: mst, aik, agraf, qemu-devel, alex.williamson, David Gibson

Il 13/05/2013 14:07, Peter Maydell ha scritto:
> On 13 May 2013 12:48, David Gibson <david@gibson.dropbear.id.au> wrote:
>> On Mon, May 13, 2013 at 12:10:10PM +0100, Peter Maydell wrote:
>>> Hmm, is this the ideal semantics? Typically the owner of the
>>> MemoryListener isn't the owner of the AddressSpace so it isn't
>>> necessarily in a position to guarantee that it can unregister
>>> the listener before the address space is destroyed. In fact
>>> as the listener API is currently documented, the filter
>>> argument is just an optimisation to save the callbacks having
>>> to filter out irrelevant information themselves.
>>
>> If so, then it's broken by design.  There's no guarantee that after an
>> AddressSpace is destroyed another one won't be created at the same
>> address (in fact, depending on your malloc() implementation, it could
>> be very likely).  So references by pointer to an object *must* be
>> removed before the object itself is freed.
> 
> Mmm. Looking through the code it turns out we don't actually
> make use of the ability to pass NULL as a filter (except in
> target-arm/kvm.c which was just me being lazy and not passing
> in the system address space). Perhaps we should just drop that
> capability, at which point you have a clearer "you are listening
> on one AS and you must make sure you arrange to unregister before
> that AS goes away" API definition?

Yes, that could be an idea.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
                   ` (7 preceding siblings ...)
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed David Gibson
@ 2013-05-13 12:23 ` Paolo Bonzini
  2013-05-13 13:13   ` David Gibson
  8 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 12:23 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

Il 13/05/2013 12:54, David Gibson ha scritto:
> Specifically the way the iommu is
> determined from a callback in the PCIBus means that it won't be
> assigned for devices under a PCI-PCI bridge.

Right.  I saw the report from Alexey, but I am a bit wary of touching it
because it's not a regression.  In fact there is even a FIXME for it:

        /* FIXME: inherit memory region from bus creator */

Perhaps we can make pci_iommu_as a Bus method, where the default
implementation looks up along the chain, and the end of the recursion is
in SysBus or in PCI buses that have set the callback.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace
  2013-05-13 12:19         ` Paolo Bonzini
@ 2013-05-13 12:57           ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2013-05-13 12:57 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Maydell, mst, aik, agraf, qemu-devel, alex.williamson

[-- Attachment #1: Type: text/plain, Size: 1963 bytes --]

On Mon, May 13, 2013 at 02:19:50PM +0200, Paolo Bonzini wrote:
> Il 13/05/2013 14:07, Peter Maydell ha scritto:
> > On 13 May 2013 12:48, David Gibson <david@gibson.dropbear.id.au> wrote:
> >> On Mon, May 13, 2013 at 12:10:10PM +0100, Peter Maydell wrote:
> >>> Hmm, is this the ideal semantics? Typically the owner of the
> >>> MemoryListener isn't the owner of the AddressSpace so it isn't
> >>> necessarily in a position to guarantee that it can unregister
> >>> the listener before the address space is destroyed. In fact
> >>> as the listener API is currently documented, the filter
> >>> argument is just an optimisation to save the callbacks having
> >>> to filter out irrelevant information themselves.
> >>
> >> If so, then it's broken by design.  There's no guarantee that after an
> >> AddressSpace is destroyed another one won't be created at the same
> >> address (in fact, depending on your malloc() implementation, it could
> >> be very likely).  So references by pointer to an object *must* be
> >> removed before the object itself is freed.
> > 
> > Mmm. Looking through the code it turns out we don't actually
> > make use of the ability to pass NULL as a filter (except in
> > target-arm/kvm.c which was just me being lazy and not passing
> > in the system address space). Perhaps we should just drop that
> > capability, at which point you have a clearer "you are listening
> > on one AS and you must make sure you arrange to unregister before
> > that AS goes away" API definition?
> 
> Yes, that could be an idea.

Fine by me, that would also naturally lead to making the listener list
per-AS.  Either way, it doesn't alter the fact that pointer references
to the AS need to be removed before the AS is destroyed.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-13 12:23 ` [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited Paolo Bonzini
@ 2013-05-13 13:13   ` David Gibson
  2013-05-13 13:30     ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-13 13:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

On Mon, May 13, 2013 at 02:23:30PM +0200, Paolo Bonzini wrote:
> Il 13/05/2013 12:54, David Gibson ha scritto:
> > Specifically the way the iommu is
> > determined from a callback in the PCIBus means that it won't be
> > assigned for devices under a PCI-PCI bridge.
> 
> Right.  I saw the report from Alexey, but I am a bit wary of touching it
> because it's not a regression.  In fact there is even a FIXME for it:
> 
>         /* FIXME: inherit memory region from bus creator */

Uh.. sort of.

> Perhaps we can make pci_iommu_as a Bus method, where the default
> implementation looks up along the chain, and the end of the recursion is
> in SysBus or in PCI buses that have set the callback.

So, this is complicated by the fact that there are two cases, and they
can both be found in existing hardware.

1) One is where devices behind the bridge are not visible /
differentiable to the IOMMU, and so effectively all their DMAs
originate from the bridge device itself.  In this case the correct
thing is to give all devices under the bridge the same DMA
AddressSpace as the bridge device, as suggested by the FIXME.  This
will be typical behaviour for PCI-E to PCI bridges.

2) The other case is where the bridge passes through RIDs, so that the
IOMMU can still differentiate devices behind it.  For this case, we
really want the hook to be in the host bridge / root bus, and it can
make a decision based on the full bus/dev/fn information.  This will
be typical for PCI-E to PCI-E bridges (or switches or nexuses or
whatever they're usually called for PCI-E).  This case will be very
important as we start to model newer PCI-E based machines by default,
where typically *all* devices are behind a logical p2p bridge inside
the root complex (but are still differentiable by the Intel IOMMU
amongst others).


I'm not sure at this stage how to properly handle both cases.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-13 13:13   ` David Gibson
@ 2013-05-13 13:30     ` Paolo Bonzini
  2013-05-14  2:39       ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-13 13:30 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Il 13/05/2013 15:13, David Gibson ha scritto:
> On Mon, May 13, 2013 at 02:23:30PM +0200, Paolo Bonzini wrote:
>> Il 13/05/2013 12:54, David Gibson ha scritto:
>>> Specifically the way the iommu is determined from a callback
>>> in the PCIBus means that it won't be assigned for devices under
>>> a PCI-PCI bridge.
>> 
>> Right.  I saw the report from Alexey, but I am a bit wary of 
>> touching it because it's not a regression.  In fact there is
>> even a FIXME for it:
>> 
>> /* FIXME: inherit memory region from bus creator */
> 
> Uh.. sort of.
> 
>> Perhaps we can make pci_iommu_as a Bus method, where the default
>>  implementation looks up along the chain, and the end of the 
>> recursion is in SysBus or in PCI buses that have set the 
>> callback.
> 
> So, this is complicated by the fact that there are two cases, and 
> they can both be found in existing hardware.
> 
> 1) One is where devices behind the bridge are not visible / 
> differentiable to the IOMMU, and so effectively all their DMAs 
> originate from the bridge device itself.  In this case the correct
>  thing is to give all devices under the bridge the same DMA 
> AddressSpace as the bridge device, as suggested by the FIXME. This
> will be typical behaviour for PCI-E to PCI bridges.
> 
> 2) The other case is where the bridge passes through RIDs, so that 
> the IOMMU can still differentiate devices behind it.  For this 
> case, we really want the hook to be in the host bridge / root bus, 
> and it can make a decision based on the full bus/dev/fn 
> information.  This will be typical for PCI-E to PCI-E bridges (or 
> switches or nexuses or whatever they're usually called for PCI-E). 
> This case will be very important as we start to model newer PCI-E 
> based machines by default, where typically *all* devices are
> behind a logical p2p bridge inside the root complex (but are still 
> differentiable by the Intel IOMMU amongst others).
> 
> I'm not sure at this stage how to properly handle both cases.

Suppose you have a host bridge pci_bus0 and a PCIE->PCIE bridge
pci_bus1.  pci_bus1 does not define a IOMMU callback, pci_bus0 does.

Would it work to use the PCIBus callback provided by pci_bus0, but
invoke it as

    pci_bus0->iommu_fn(pci_bus1, pci_bus0->iommu_opaque, devfn)

?

Paolo

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRkOryAAoJEBvWZb6bTYbydNoQAIbBuyYTm0MvERXOFb7YkTQC
oe0B09CN2fdnWIeacl7taa1/5Pe0hY7JpuLSQCD1/4P4ovKPOROJ652BxSdDmBtG
cZTmqG35ScfhM8ES66voKMO/o7LhPiJ3OLoUUNaVvEMmXCq4ag6LQd42RMyQVGtu
hyjA5thPRV7u7/m0+iPxALxegvicxqQk/IkN5nRtXeO0LK78o5TxXX/3qMU1k7Rh
FqfxbfSA0ceFu5PUtm5TnpwArVYJUZJfJZuMpr+B8Ub5zoo0nt6nfBZG7P5Sk7Aw
MN54CfvJ/3roAS4BwFNBYbYV6ZxW6P99yUnlMcsrmUXhjsGfVxiCz7pqhTUuFnyQ
O620UTSH3rIrdEFalfrG1mZSvf550b4cc1PW/+TAd2d2uBzEq2wb/vOW9vr6/EXB
/Y7nLLHeik12sWg1bNklhnr4UYQ9FBzAo6duWCgw42OIggKSuOzRWxU4vFpX/abb
BVX/61zSTujh6qfnQdWOKQuvYlcZUI+sqHF3SUdYH7yaE+gH0PJea3vqg3H2jQgT
ktVjNuxQE2P+lL8C+VfHGoTRjZ0OgAgAo19qrCzdwX6kJocKU5apqZJ/qOvHatKq
ZoOyfnwO/HUJ6wfP1INLvgNvOIbHyQ98fIVgFtmOQbES/NRVpf0y1K+opKkJ+BXZ
kmJHRiIUcH0VnLYTmaVr
=2jMC
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces David Gibson
@ 2013-05-13 21:32   ` Alex Williamson
  2013-05-14  1:00     ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2013-05-13 21:32 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, pbonzini, qemu-devel, agraf, mst

On Mon, 2013-05-13 at 20:54 +1000, David Gibson wrote:
> The only model so far supported for VFIO passthrough devices is the model
> usually used on x86, where all of the guest's RAM is mapped into the
> (host) IOMMU and there is no IOMMU visible in the guest.
> 
> This patch begins to relax this model, introducing the notion of a
> VFIOAddressSpace.  This represents a logical DMA address space which will
> be visible to one or more VFIO devices by appropriate mapping in the (host)
> IOMMU.  Thus the currently global list of containers becomes local to
> a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
> group to multiple address spaces.
> 
> For now, only one VFIOAddressSpace is created and used, corresponding to
> main system memory, that will change in future patches.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/misc/vfio.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 50 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index c4a8853..b1e9220 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -113,9 +113,17 @@ enum {
>      VFIO_INT_MSIX = 3,
>  };
>  
> +typedef struct VFIOAddressSpace {
> +    AddressSpace *as;
> +    QLIST_HEAD(, VFIOContainer) containers;
> +} VFIOAddressSpace;
> +
> +static VFIOAddressSpace vfio_address_space_memory;
> +
>  struct VFIOGroup;
>  
>  typedef struct VFIOContainer {
> +    VFIOAddressSpace *vas;

"space" maybe?

>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      struct {
>          /* enable abstraction to support various iommu backends */
> @@ -178,9 +186,6 @@ typedef struct VFIOGroup {
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -static QLIST_HEAD(, VFIOContainer)
> -    container_list = QLIST_HEAD_INITIALIZER(container_list);
> -
>  static QLIST_HEAD(, VFIOGroup)
>      group_list = QLIST_HEAD_INITIALIZER(group_list);
>  
> @@ -2624,16 +2629,28 @@ static int vfio_load_rom(VFIODevice *vdev)
>      return 0;
>  }
>  
> -static int vfio_connect_container(VFIOGroup *group)
> +static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
> +{
> +    vas->as = as;
> +    QLIST_INIT(&vas->containers);
> +}
> +
> +static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)

Connect what?  Verb, no object.  This is still trying to attach a group
to an existing container or create a new container, the address space is
just a parameter in that matching.  So I'm not sure why this isn't still
vfio_connect_container.

>  {
>      VFIOContainer *container;
>      int ret, fd;
>  
>      if (group->container) {
> -        return 0;
> +        if (group->container->vas == vas) {
> +            return 0;
> +        } else {
> +            error_report("vfio: group %d used in multiple address spaces",
> +                         group->groupid);
> +            return -EBUSY;
> +        }
>      }
>  
> -    QLIST_FOREACH(container, &container_list, next) {
> +    QLIST_FOREACH(container, &vas->containers, next) {
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> @@ -2656,6 +2673,7 @@ static int vfio_connect_container(VFIOGroup *group)
>      }
>  
>      container = g_malloc0(sizeof(*container));
> +    container->vas = vas;
>      container->fd = fd;
>  
>      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
> @@ -2678,7 +2696,8 @@ static int vfio_connect_container(VFIOGroup *group)
>          container->iommu_data.listener = vfio_memory_listener;
>          container->iommu_data.release = vfio_listener_release;
>  
> -        memory_listener_register(&container->iommu_data.listener, &address_space_memory);
> +        memory_listener_register(&container->iommu_data.listener,
> +                                 container->vas->as);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          g_free(container);
> @@ -2687,7 +2706,7 @@ static int vfio_connect_container(VFIOGroup *group)
>      }
>  
>      QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&container_list, container, next);
> +    QLIST_INSERT_HEAD(&vas->containers, container, next);
>  
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> @@ -2695,12 +2714,12 @@ static int vfio_connect_container(VFIOGroup *group)
>      return 0;
>  }
>  
> -static void vfio_disconnect_container(VFIOGroup *group)
> +static void vfio_disconnect(VFIOGroup *group)

Same as above.  Thanks,

Alex

>  {
>      VFIOContainer *container = group->container;
>  
>      if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> -        error_report("vfio: error disconnecting group %d from container",
> +        error_report("vfio: error disconnecting group %d from context",
>                       group->groupid);
>      }
>  
> @@ -2712,13 +2731,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
>              container->iommu_data.release(container);
>          }
>          QLIST_REMOVE(container, next);
> -        DPRINTF("vfio_disconnect_container: close container->fd\n");
> +        DPRINTF("vfio_disconnect: close container->fd\n");
>          close(container->fd);
>          g_free(container);
>      }
>  }
>  
> -static VFIOGroup *vfio_get_group(int groupid)
> +static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *vas)
>  {
>      VFIOGroup *group;
>      char path[32];
> @@ -2726,7 +2745,15 @@ static VFIOGroup *vfio_get_group(int groupid)
>  
>      QLIST_FOREACH(group, &group_list, next) {
>          if (group->groupid == groupid) {
> -            return group;
> +            /* Found it.  Now is it already in the right context? */
> +            assert(group->container);
> +            if (group->container->vas == vas) {
> +                return group;
> +            } else {
> +                error_report("vfio: group %d used in multiple address spaces",
> +                             group->groupid);
> +                return NULL;
> +            }
>          }
>      }
>  
> @@ -2759,8 +2786,8 @@ static VFIOGroup *vfio_get_group(int groupid)
>      group->groupid = groupid;
>      QLIST_INIT(&group->device_list);
>  
> -    if (vfio_connect_container(group)) {
> -        error_report("vfio: failed to setup container for group %d", groupid);
> +    if (vfio_connect(group, vas)) {
> +        error_report("vfio: failed to setup context for group %d", groupid);
>          close(group->fd);
>          g_free(group);
>          return NULL;
> @@ -2777,7 +2804,7 @@ static void vfio_put_group(VFIOGroup *group)
>          return;
>      }
>  
> -    vfio_disconnect_container(group);
> +    vfio_disconnect(group);
>      QLIST_REMOVE(group, next);
>      DPRINTF("vfio_put_group: close group->fd\n");
>      close(group->fd);
> @@ -2992,7 +3019,12 @@ static int vfio_initfn(PCIDevice *pdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>  
> -    group = vfio_get_group(groupid);
> +    if (pci_iommu_as(pdev) != &address_space_memory) {
> +        error_report("vfio: DMA address space must be system memory");
> +        return -ENXIO;
> +    }
> +
> +    group = vfio_get_group(groupid, &vfio_address_space_memory);
>      if (!group) {
>          error_report("vfio: failed to get group %d", groupid);
>          return -ENOENT;
> @@ -3212,6 +3244,7 @@ static const TypeInfo vfio_pci_dev_info = {
>  
>  static void register_vfio_pci_dev_type(void)
>  {
> +    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>      type_register_static(&vfio_pci_dev_info);
>  }
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed
  2013-05-13 10:54 ` [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed David Gibson
@ 2013-05-13 21:33   ` Alex Williamson
  2013-05-14  1:58     ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2013-05-13 21:33 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, pbonzini, qemu-devel, agraf, mst

On Mon, 2013-05-13 at 20:54 +1000, David Gibson wrote:
> So far, VFIO has a notion of different logical DMA address spaces, but
> only ever uses one (system memory).  This patch extends this, creating
> new VFIOAddressSpace objects as necessary, according to the AddressSpace
> reported by the PCI subsystem for this device's DMAs.
> 
> This isn't enough yet to support guest side IOMMUs with VFIO, but it does
> mean we could now support VFIO devices on, for example, a guest side PCI
> host bridge which maps system memory at somewhere other than 0 in PCI
> space.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/misc/vfio.c |   36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index b1e9220..3850d39 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -116,9 +116,10 @@ enum {
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
>  
> -static VFIOAddressSpace vfio_address_space_memory;
> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
>  
>  struct VFIOGroup;
>  
> @@ -2635,6 +2636,23 @@ static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
>      QLIST_INIT(&vas->containers);
>  }
>  
> +static VFIOAddressSpace *vfio_address_space_get(AddressSpace *as)

vfio_get_address_space is a better match for the rest of the code.

> +{
> +    VFIOAddressSpace *vas;
> +
> +    QLIST_FOREACH(vas, &vfio_address_spaces, list) {
> +        if (vas->as == as)
> +            return vas;
> +    }
> +
> +    /* No suitable VFIOAddressSpace, create a new one */
> +    vas = g_malloc0(sizeof(*vas));
> +    vfio_address_space_init(vas, as);
> +    QLIST_INSERT_HEAD(&vfio_address_spaces, vas, list);

Do we still need vfio_address_space_init?  Seems like it should be
rolled in here.

> +
> +    return vas;
> +}
> +
>  static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)
>  {
>      VFIOContainer *container;
> @@ -2727,6 +2745,8 @@ static void vfio_disconnect(VFIOGroup *group)
>      group->container = NULL;
>  
>      if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOAddressSpace *vas = container->vas;
> +
>          if (container->iommu_data.release) {
>              container->iommu_data.release(container);
>          }
> @@ -2734,6 +2754,11 @@ static void vfio_disconnect(VFIOGroup *group)
>          DPRINTF("vfio_disconnect: close container->fd\n");
>          close(container->fd);
>          g_free(container);
> +
> +        if (QLIST_EMPTY(&vas->containers)) {
> +            QLIST_REMOVE(vas, list);
> +            g_free(vas);
> +        }

vfio_put_address_space?  Where there's a get...

>      }
>  }
>  
> @@ -2984,6 +3009,7 @@ static int vfio_initfn(PCIDevice *pdev)
>  {
>      VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
>      VFIOGroup *group;
> +    VFIOAddressSpace *vas;
>      char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>      ssize_t len;
>      struct stat st;
> @@ -3019,12 +3045,9 @@ static int vfio_initfn(PCIDevice *pdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>  
> -    if (pci_iommu_as(pdev) != &address_space_memory) {
> -        error_report("vfio: DMA address space must be system memory");
> -        return -ENXIO;
> -    }
> +    vas = vfio_address_space_get(pci_iommu_as(pdev));

I don't think the structure malloc'd here will get cleaned up in all
cases on error.  Thanks,

Alex

>  
> -    group = vfio_get_group(groupid, &vfio_address_space_memory);
> +    group = vfio_get_group(groupid, vas);
>      if (!group) {
>          error_report("vfio: failed to get group %d", groupid);
>          return -ENOENT;
> @@ -3244,7 +3267,6 @@ static const TypeInfo vfio_pci_dev_info = {
>  
>  static void register_vfio_pci_dev_type(void)
>  {
> -    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>      type_register_static(&vfio_pci_dev_info);
>  }
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces
  2013-05-13 21:32   ` Alex Williamson
@ 2013-05-14  1:00     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2013-05-14  1:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, pbonzini, qemu-devel, agraf, mst

[-- Attachment #1: Type: text/plain, Size: 3139 bytes --]

On Mon, May 13, 2013 at 03:32:59PM -0600, Alex Williamson wrote:
> On Mon, 2013-05-13 at 20:54 +1000, David Gibson wrote:
> > The only model so far supported for VFIO passthrough devices is the model
> > usually used on x86, where all of the guest's RAM is mapped into the
> > (host) IOMMU and there is no IOMMU visible in the guest.
> > 
> > This patch begins to relax this model, introducing the notion of a
> > VFIOAddressSpace.  This represents a logical DMA address space which will
> > be visible to one or more VFIO devices by appropriate mapping in the (host)
> > IOMMU.  Thus the currently global list of containers becomes local to
> > a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
> > group to multiple address spaces.
> > 
> > For now, only one VFIOAddressSpace is created and used, corresponding to
> > main system memory, that will change in future patches.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/misc/vfio.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 50 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > index c4a8853..b1e9220 100644
> > --- a/hw/misc/vfio.c
> > +++ b/hw/misc/vfio.c
> > @@ -113,9 +113,17 @@ enum {
> >      VFIO_INT_MSIX = 3,
> >  };
> >  
> > +typedef struct VFIOAddressSpace {
> > +    AddressSpace *as;
> > +    QLIST_HEAD(, VFIOContainer) containers;
> > +} VFIOAddressSpace;
> > +
> > +static VFIOAddressSpace vfio_address_space_memory;
> > +
> >  struct VFIOGroup;
> >  
> >  typedef struct VFIOContainer {
> > +    VFIOAddressSpace *vas;
> 
> "space" maybe?

Ok, "space" it is.

> >      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >      struct {
> >          /* enable abstraction to support various iommu backends */
> > @@ -178,9 +186,6 @@ typedef struct VFIOGroup {
> >  
> >  #define MSIX_CAP_LENGTH 12
> >  
> > -static QLIST_HEAD(, VFIOContainer)
> > -    container_list = QLIST_HEAD_INITIALIZER(container_list);
> > -
> >  static QLIST_HEAD(, VFIOGroup)
> >      group_list = QLIST_HEAD_INITIALIZER(group_list);
> >  
> > @@ -2624,16 +2629,28 @@ static int vfio_load_rom(VFIODevice *vdev)
> >      return 0;
> >  }
> >  
> > -static int vfio_connect_container(VFIOGroup *group)
> > +static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
> > +{
> > +    vas->as = as;
> > +    QLIST_INIT(&vas->containers);
> > +}
> > +
> > +static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)
> 
> Connect what?  Verb, no object.  This is still trying to attach a group
> to an existing container or create a new container, the address space is
> just a parameter in that matching.  So I'm not sure why this isn't still
> vfio_connect_container.

Uh, yeah, fair enough.  I've reverted to connect_container in the
next version.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed
  2013-05-13 21:33   ` Alex Williamson
@ 2013-05-14  1:58     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2013-05-14  1:58 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, pbonzini, qemu-devel, agraf, mst

[-- Attachment #1: Type: text/plain, Size: 4560 bytes --]

On Mon, May 13, 2013 at 03:33:06PM -0600, Alex Williamson wrote:
> On Mon, 2013-05-13 at 20:54 +1000, David Gibson wrote:
> > So far, VFIO has a notion of different logical DMA address spaces, but
> > only ever uses one (system memory).  This patch extends this, creating
> > new VFIOAddressSpace objects as necessary, according to the AddressSpace
> > reported by the PCI subsystem for this device's DMAs.
> > 
> > This isn't enough yet to support guest side IOMMUs with VFIO, but it does
> > mean we could now support VFIO devices on, for example, a guest side PCI
> > host bridge which maps system memory at somewhere other than 0 in PCI
> > space.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/misc/vfio.c |   36 +++++++++++++++++++++++++++++-------
> >  1 file changed, 29 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > index b1e9220..3850d39 100644
> > --- a/hw/misc/vfio.c
> > +++ b/hw/misc/vfio.c
> > @@ -116,9 +116,10 @@ enum {
> >  typedef struct VFIOAddressSpace {
> >      AddressSpace *as;
> >      QLIST_HEAD(, VFIOContainer) containers;
> > +    QLIST_ENTRY(VFIOAddressSpace) list;
> >  } VFIOAddressSpace;
> >  
> > -static VFIOAddressSpace vfio_address_space_memory;
> > +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
> >  
> >  struct VFIOGroup;
> >  
> > @@ -2635,6 +2636,23 @@ static void vfio_address_space_init(VFIOAddressSpace *vas, AddressSpace *as)
> >      QLIST_INIT(&vas->containers);
> >  }
> >  
> > +static VFIOAddressSpace *vfio_address_space_get(AddressSpace *as)
> 
> vfio_get_address_space is a better match for the rest of the code.

Ok.

> > +{
> > +    VFIOAddressSpace *vas;
> > +
> > +    QLIST_FOREACH(vas, &vfio_address_spaces, list) {
> > +        if (vas->as == as)
> > +            return vas;
> > +    }
> > +
> > +    /* No suitable VFIOAddressSpace, create a new one */
> > +    vas = g_malloc0(sizeof(*vas));
> > +    vfio_address_space_init(vas, as);
> > +    QLIST_INSERT_HEAD(&vfio_address_spaces, vas, list);
> 
> Do we still need vfio_address_space_init?  Seems like it should be
> rolled in here.

Ah, true.  I had some notions of allowing host bridges to statically
allocate a vfio address space as part of their own structure, but this
current code assumes they are malloc()ed by get_address_space, so yes,
I'll fold that in.

> > +
> > +    return vas;
> > +}
> > +
> >  static int vfio_connect(VFIOGroup *group, VFIOAddressSpace *vas)
> >  {
> >      VFIOContainer *container;
> > @@ -2727,6 +2745,8 @@ static void vfio_disconnect(VFIOGroup *group)
> >      group->container = NULL;
> >  
> >      if (QLIST_EMPTY(&container->group_list)) {
> > +        VFIOAddressSpace *vas = container->vas;
> > +
> >          if (container->iommu_data.release) {
> >              container->iommu_data.release(container);
> >          }
> > @@ -2734,6 +2754,11 @@ static void vfio_disconnect(VFIOGroup *group)
> >          DPRINTF("vfio_disconnect: close container->fd\n");
> >          close(container->fd);
> >          g_free(container);
> > +
> > +        if (QLIST_EMPTY(&vas->containers)) {
> > +            QLIST_REMOVE(vas, list);
> > +            g_free(vas);
> > +        }
> 
> vfio_put_address_space?  Where there's a get...

Fair enough, will revise.


> 
> >      }
> >  }
> >  
> > @@ -2984,6 +3009,7 @@ static int vfio_initfn(PCIDevice *pdev)
> >  {
> >      VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> >      VFIOGroup *group;
> > +    VFIOAddressSpace *vas;
> >      char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
> >      ssize_t len;
> >      struct stat st;
> > @@ -3019,12 +3045,9 @@ static int vfio_initfn(PCIDevice *pdev)
> >      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
> >              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
> >  
> > -    if (pci_iommu_as(pdev) != &address_space_memory) {
> > -        error_report("vfio: DMA address space must be system memory");
> > -        return -ENXIO;
> > -    }
> > +    vas = vfio_address_space_get(pci_iommu_as(pdev));
> 
> I don't think the structure malloc'd here will get cleaned up in all
> cases on error.  Thanks,

Good point, auditing now..

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-13 13:30     ` Paolo Bonzini
@ 2013-05-14  2:39       ` David Gibson
  2013-05-14  9:58         ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2013-05-14  2:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

[-- Attachment #1: Type: text/plain, Size: 3906 bytes --]

On Mon, May 13, 2013 at 03:30:26PM +0200, Paolo Bonzini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Il 13/05/2013 15:13, David Gibson ha scritto:
> > On Mon, May 13, 2013 at 02:23:30PM +0200, Paolo Bonzini wrote:
> >> Il 13/05/2013 12:54, David Gibson ha scritto:
> >>> Specifically the way the iommu is determined from a callback
> >>> in the PCIBus means that it won't be assigned for devices under
> >>> a PCI-PCI bridge.
> >> 
> >> Right.  I saw the report from Alexey, but I am a bit wary of 
> >> touching it because it's not a regression.  In fact there is
> >> even a FIXME for it:
> >> 
> >> /* FIXME: inherit memory region from bus creator */
> > 
> > Uh.. sort of.
> > 
> >> Perhaps we can make pci_iommu_as a Bus method, where the default
> >>  implementation looks up along the chain, and the end of the 
> >> recursion is in SysBus or in PCI buses that have set the 
> >> callback.
> > 
> > So, this is complicated by the fact that there are two cases, and 
> > they can both be found in existing hardware.
> > 
> > 1) One is where devices behind the bridge are not visible / 
> > differentiable to the IOMMU, and so effectively all their DMAs 
> > originate from the bridge device itself.  In this case the correct
> >  thing is to give all devices under the bridge the same DMA 
> > AddressSpace as the bridge device, as suggested by the FIXME. This
> > will be typical behaviour for PCI-E to PCI bridges.
> > 
> > 2) The other case is where the bridge passes through RIDs, so that 
> > the IOMMU can still differentiate devices behind it.  For this 
> > case, we really want the hook to be in the host bridge / root bus, 
> > and it can make a decision based on the full bus/dev/fn 
> > information.  This will be typical for PCI-E to PCI-E bridges (or 
> > switches or nexuses or whatever they're usually called for PCI-E). 
> > This case will be very important as we start to model newer PCI-E 
> > based machines by default, where typically *all* devices are
> > behind a logical p2p bridge inside the root complex (but are still 
> > differentiable by the Intel IOMMU amongst others).
> > 
> > I'm not sure at this stage how to properly handle both cases.
> 
> Suppose you have a host bridge pci_bus0 and a PCIE->PCIE bridge
> pci_bus1.  pci_bus1 does not define a IOMMU callback, pci_bus0 does.
> 
> Would it work to use the PCIBus callback provided by pci_bus0, but
> invoke it as
> 
>     pci_bus0->iommu_fn(pci_bus1, pci_bus0->iommu_opaque, devfn)

Hrm.  I'm a bit nervous about that, because I think when writing an
iommu_fn it would be very easy to forget that it could be called with
a bus other than the one the hook is attached to - and e.g. assuming
they can use bus->qbus.parent_dev to get to the host bridge.

Tbh, my first inclination would just be to move the hook to the host
bridge class (that will make more sense on top of my other pending PCI
patches), passing the bus through.  The host bridge hook would need to
handle any intervening p2p bridges.  That's a bit nasty in theory, but
for the cases I know about it shouldn't be hard because either 1) the
IOMMU exists on PCI-E based systems, so the RIDs are passed through,
or 2) there's only one IOMMU address space per host bridge anyway.

The other approach would be to have both a bus and a host bridge
hook.  The bus hook would be the one used directly.  The default bus
hook would find the host bridge and invoke its hook.  Bridges that
lose RID information, though, can invoke the parent bus hook with the
bridge's details.  The tricky bit is to work out which is the correct
bridge type for each platform/configuration.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-14  2:39       ` David Gibson
@ 2013-05-14  9:58         ` Paolo Bonzini
  2013-05-15  3:55           ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2013-05-14  9:58 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Il 14/05/2013 04:39, David Gibson ha scritto:
> On Mon, May 13, 2013 at 03:30:26PM +0200, Paolo Bonzini wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> Il 13/05/2013 15:13, David Gibson ha scritto:
>>> On Mon, May 13, 2013 at 02:23:30PM +0200, Paolo Bonzini wrote:
>>>> Il 13/05/2013 12:54, David Gibson ha scritto:
>>>>> Specifically the way the iommu is determined from a
>>>>> callback in the PCIBus means that it won't be assigned for
>>>>> devices under a PCI-PCI bridge.
>>>> 
>>>> Right.  I saw the report from Alexey, but I am a bit wary of
>>>>  touching it because it's not a regression.  In fact there
>>>> is even a FIXME for it:
>>>> 
>>>> /* FIXME: inherit memory region from bus creator */
>>> 
>>> Uh.. sort of.
>>> 
>>>> Perhaps we can make pci_iommu_as a Bus method, where the
>>>> default implementation looks up along the chain, and the end
>>>> of the recursion is in SysBus or in PCI buses that have set
>>>> the callback.
>>> 
>>> So, this is complicated by the fact that there are two cases,
>>> and they can both be found in existing hardware.
>>> 
>>> 1) One is where devices behind the bridge are not visible / 
>>> differentiable to the IOMMU, and so effectively all their DMAs
>>>  originate from the bridge device itself.  In this case the
>>> correct thing is to give all devices under the bridge the same
>>> DMA AddressSpace as the bridge device, as suggested by the
>>> FIXME. This will be typical behaviour for PCI-E to PCI
>>> bridges.
>>> 
>>> 2) The other case is where the bridge passes through RIDs, so
>>> that the IOMMU can still differentiate devices behind it.  For
>>> this case, we really want the hook to be in the host bridge /
>>> root bus, and it can make a decision based on the full
>>> bus/dev/fn information.  This will be typical for PCI-E to
>>> PCI-E bridges (or switches or nexuses or whatever they're
>>> usually called for PCI-E). This case will be very important as
>>> we start to model newer PCI-E based machines by default, where
>>> typically *all* devices are behind a logical p2p bridge inside
>>> the root complex (but are still differentiable by the Intel
>>> IOMMU amongst others).
>>> 
>>> I'm not sure at this stage how to properly handle both cases.
>> 
>> Suppose you have a host bridge pci_bus0 and a PCIE->PCIE bridge 
>> pci_bus1.  pci_bus1 does not define a IOMMU callback, pci_bus0
>> does.
>> 
>> Would it work to use the PCIBus callback provided by pci_bus0,
>> but invoke it as
>> 
>> pci_bus0->iommu_fn(pci_bus1, pci_bus0->iommu_opaque, devfn)
> 
> Hrm.  I'm a bit nervous about that, because I think when writing
> an iommu_fn it would be very easy to forget that it could be called
> with a bus other than the one the hook is attached to - and e.g.
> assuming they can use bus->qbus.parent_dev to get to the host
> bridge.

I think we can fix that by removing the opaque, and just passing in
the PCIBus.

Then it's more obvious

   pci_bus0->iommu_fn(pci_bus0, pci_bus1, devfn)

and almost the same, since the host bridge is just a container_of away
from pci_bus0.

Paolo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRkgq4AAoJEBvWZb6bTYbyDeUP/RyQrKYOJAg7LX7Pj6ZyK4Dh
kH2FN7HimDG1/zWxc91wgRBOT5Q49kHlQvirSHfgvaTcEZs15rfNRrSqi0SisZGQ
2e4Dvxwv3YNIsc9hSvCa14QwZnWsQ2iQYepHMP5zQJ5FWr5OLrqEKWTwM3bsAcli
iYyDPWG4p+6knog5hIDLQ+/3TdJgRW5pLY4ZeYdDVB4azdRUpEn/OhYYT4tqrnMU
lZuinAYR/gF2vrnEc9RbDlRy+SVBn69oAfTnbsvQKkD5HAWD7NFEhIPd/HS/DoWz
KWyGBccXdgPvmRFkCmnRNnVUlvPziZNYZlS+vrTW8BiQS6C/ZZOuYyPpIzpHeAb+
t/Ds7nuQzzqj2kJIZlbhiRlEIXzLkXwcsZg8IW1Q1/wANprwYl2S6xyP9qhjUDDm
mjZOfD+miWoOmxAzK5RiWM+jhiN+EaZLLrwxjzxR6w50/T2sqa3AJxSasTsZtLLl
NaMX1bHnpcYA2XTMsMWEoReprvm5mBEcgAfzr5X2+uCd4GdI3/hfCgBkn4LCB5wp
vi8rwQZ0ULcHqV8qs2iEEGNwt7IAs/XAS/yCr3RQ9bd6cabAQyMMFIXqE5YHAU60
jh3cVQcr2TjT1+dFP/+kVnNmwHKiVcbu7EhAhg/NHGoxnKxcMRI7Iy1tQvBivEe6
Hxojhub17ah3JdB4vOOW
=Zz9q
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited
  2013-05-14  9:58         ` Paolo Bonzini
@ 2013-05-15  3:55           ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2013-05-15  3:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aik, alex.williamson, mst, qemu-devel, agraf

[-- Attachment #1: Type: text/plain, Size: 3821 bytes --]

On Tue, May 14, 2013 at 11:58:16AM +0200, Paolo Bonzini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Il 14/05/2013 04:39, David Gibson ha scritto:
> > On Mon, May 13, 2013 at 03:30:26PM +0200, Paolo Bonzini wrote:
> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
> >> 
> >> Il 13/05/2013 15:13, David Gibson ha scritto:
> >>> On Mon, May 13, 2013 at 02:23:30PM +0200, Paolo Bonzini wrote:
> >>>> Il 13/05/2013 12:54, David Gibson ha scritto:
> >>>>> Specifically the way the iommu is determined from a
> >>>>> callback in the PCIBus means that it won't be assigned for
> >>>>> devices under a PCI-PCI bridge.
> >>>> 
> >>>> Right.  I saw the report from Alexey, but I am a bit wary of
> >>>>  touching it because it's not a regression.  In fact there
> >>>> is even a FIXME for it:
> >>>> 
> >>>> /* FIXME: inherit memory region from bus creator */
> >>> 
> >>> Uh.. sort of.
> >>> 
> >>>> Perhaps we can make pci_iommu_as a Bus method, where the
> >>>> default implementation looks up along the chain, and the end
> >>>> of the recursion is in SysBus or in PCI buses that have set
> >>>> the callback.
> >>> 
> >>> So, this is complicated by the fact that there are two cases,
> >>> and they can both be found in existing hardware.
> >>> 
> >>> 1) One is where devices behind the bridge are not visible / 
> >>> differentiable to the IOMMU, and so effectively all their DMAs
> >>>  originate from the bridge device itself.  In this case the
> >>> correct thing is to give all devices under the bridge the same
> >>> DMA AddressSpace as the bridge device, as suggested by the
> >>> FIXME. This will be typical behaviour for PCI-E to PCI
> >>> bridges.
> >>> 
> >>> 2) The other case is where the bridge passes through RIDs, so
> >>> that the IOMMU can still differentiate devices behind it.  For
> >>> this case, we really want the hook to be in the host bridge /
> >>> root bus, and it can make a decision based on the full
> >>> bus/dev/fn information.  This will be typical for PCI-E to
> >>> PCI-E bridges (or switches or nexuses or whatever they're
> >>> usually called for PCI-E). This case will be very important as
> >>> we start to model newer PCI-E based machines by default, where
> >>> typically *all* devices are behind a logical p2p bridge inside
> >>> the root complex (but are still differentiable by the Intel
> >>> IOMMU amongst others).
> >>> 
> >>> I'm not sure at this stage how to properly handle both cases.
> >> 
> >> Suppose you have a host bridge pci_bus0 and a PCIE->PCIE bridge 
> >> pci_bus1.  pci_bus1 does not define a IOMMU callback, pci_bus0
> >> does.
> >> 
> >> Would it work to use the PCIBus callback provided by pci_bus0,
> >> but invoke it as
> >> 
> >> pci_bus0->iommu_fn(pci_bus1, pci_bus0->iommu_opaque, devfn)
> > 
> > Hrm.  I'm a bit nervous about that, because I think when writing
> > an iommu_fn it would be very easy to forget that it could be called
> > with a bus other than the one the hook is attached to - and e.g.
> > assuming they can use bus->qbus.parent_dev to get to the host
> > bridge.
> 
> I think we can fix that by removing the opaque, and just passing in
> the PCIBus.
> 
> Then it's more obvious
> 
>    pci_bus0->iommu_fn(pci_bus0, pci_bus1, devfn)

Yeah, that's probably ok, especially if we can think of good names for
the two bus parameters to make the distinction clear.

> and almost the same, since the host bridge is just a container_of away
> from pci_bus0.

Well, bus->qbus.parent_dev and then one of the suitable class wrappers
on container_of().

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-05-15  3:55 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-13 10:54 [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited David Gibson
2013-05-13 10:54 ` [Qemu-devel] [PATCH 1/8] iommu: Fix compile error in ioapic.c David Gibson
2013-05-13 12:14   ` Paolo Bonzini
2013-05-13 10:54 ` [Qemu-devel] [PATCH 2/8] pci: Don't del_subgregion on a non subregion David Gibson
2013-05-13 12:14   ` Paolo Bonzini
2013-05-13 10:54 ` [Qemu-devel] [PATCH 3/8] pci: Rework PCI iommu lifetime assumptions David Gibson
2013-05-13 12:15   ` Paolo Bonzini
2013-05-13 10:54 ` [Qemu-devel] [PATCH 4/8] pci: Use AddressSpace rather than MemoryRegion to represent PCI DMA space David Gibson
2013-05-13 12:16   ` Paolo Bonzini
2013-05-13 10:54 ` [Qemu-devel] [PATCH 5/8] pci: Introduce helper to retrieve a PCI device's DMA address space David Gibson
2013-05-13 10:54 ` [Qemu-devel] [PATCH 6/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace David Gibson
2013-05-13 11:10   ` Peter Maydell
2013-05-13 11:48     ` David Gibson
2013-05-13 12:07       ` Peter Maydell
2013-05-13 12:19         ` Paolo Bonzini
2013-05-13 12:57           ` David Gibson
2013-05-13 10:54 ` [Qemu-devel] [PATCH 7/8] vfio: Introduce VFIO address spaces David Gibson
2013-05-13 21:32   ` Alex Williamson
2013-05-14  1:00     ` David Gibson
2013-05-13 10:54 ` [Qemu-devel] [PATCH 8/8] vfio: Create VFIOAddressSpace objects as needed David Gibson
2013-05-13 21:33   ` Alex Williamson
2013-05-14  1:58     ` David Gibson
2013-05-13 12:23 ` [Qemu-devel] [0/8] RFC: VFIO and guest side IOMMUs, revisited Paolo Bonzini
2013-05-13 13:13   ` David Gibson
2013-05-13 13:30     ` Paolo Bonzini
2013-05-14  2:39       ` David Gibson
2013-05-14  9:58         ` Paolo Bonzini
2013-05-15  3:55           ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.