All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
@ 2016-12-19 14:41 Peter Xu
  2016-12-19 16:56 ` Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Peter Xu @ 2016-12-19 14:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv, david

This is preparation work to finally enabled dynamic switching ON/OFF for
VT-d protection. The old VT-d codes is using static IOMMU region, and
that won't satisfy vfio-pci device listeners.

Let me explain.

vfio-pci devices depend on the memory region listener and IOMMU replay
mechanism to make sure the device mapping is coherent with the guest
even if there are domain switches. And there are two kinds of domain
switches:

  (1) switch from domain A -> B
  (2) switch from domain A -> no domain (e.g., turn DMAR off)

Case (1) is handled by the context entry invalidation handling by the
VT-d replay logic. What the replay function should do here is to replay
the existing page mappings in domain B.

However for case (2), we don't want to replay any domain mappings - we
just need the default GPA->HPA mappings (the address_space_memory
mapping). And this patch helps on case (2) to build up the mapping
automatically by leveraging the vfio-pci memory listeners.

Another important thing that this patch does is to seperate
IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
depend on the DMAR region (like before this patch). It should be a
standalone region, and it should be able to be activated without
DMAR (which is a common behavior of Linux kernel - by default it enables
IR while disabled DMAR).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 75 ++++++++++++++++++++++++++++++++++++++++---
 hw/i386/trace-events          |  3 ++
 include/hw/i386/intel_iommu.h |  2 ++
 3 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5f3e351..75a3f4e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
+static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    VTDAddressSpace *as;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            as = vtd_bus->dev_as[i];
+            if (as == NULL) {
+                continue;
+            }
+            trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
+                                           VTD_PCI_SLOT(i), VTD_PCI_FUNC(i),
+                                           enabled);
+            if (enabled) {
+                memory_region_add_subregion_overlap(&as->root, 0,
+                                                    &as->iommu, 2);
+            } else {
+                memory_region_del_subregion(&as->root, &as->iommu);
+            }
+        }
+    }
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
+    bool old = s->dmar_enabled;
+
+    if (old == en) {
+        return;
+    }
+
     VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
 
     if (en) {
@@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
     }
+
+    vtd_switch_address_space(s, en);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+
+        /*
+         * When DMAR is disabled, memory region relationships looks
+         * like:
+         *
+         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
+         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
+         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
+         *
+         * When DMAR is disabled, it becomes:
+         *
+         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
+         *  0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu
+         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
+         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
+         *
+         * The intel_iommu region is dynamically added/removed.
+         */
         memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
                                  &s->iommu_ops, "intel_iommu", UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, memory_region_size(get_system_memory()));
         memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
                               &vtd_mem_ir_ops, s, "intel_iommu_ir",
                               VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
-                                    &vtd_dev_as->iommu_ir);
-        address_space_init(&vtd_dev_as->as,
-                           &vtd_dev_as->iommu, "intel_iommu");
+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
+                           "vtd_root", UINT64_MAX);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir, 64);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->sys_alias, 1);
+        if (s->dmar_enabled) {
+            memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                                &vtd_dev_as->iommu, 2);
+        }
+        trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
+                                       VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn),
+                                       s->dmar_enabled);
     }
     return vtd_dev_as;
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index d2b4973..aee93bb 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 # hw/i386/x86-iommu.c
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
+# hw/i386/intel_iommu.c
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 405c9d1..85c1b9b 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -83,6 +83,8 @@ struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
+    MemoryRegion root;
+    MemoryRegion sys_alias;
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2016-12-19 16:56 ` Alex Williamson
  2016-12-20  3:44   ` Peter Xu
  2016-12-19 23:30 ` David Gibson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2016-12-19 16:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Mon, 19 Dec 2016 22:41:26 +0800
Peter Xu <peterx@redhat.com> wrote:

> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU region, and
> that won't satisfy vfio-pci device listeners.
> 
> Let me explain.
> 
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
> 
>   (1) switch from domain A -> B
>   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> 
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.
> 
> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.
> 
> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).


This seems like an improvement, but I will note that there are existing
locked memory accounting issues inherent with VT-d and vfio.  With
VT-d, each device has a unique AddressSpace.  This requires that each
is managed via a separate vfio container.  Each container is accounted
for separately for locked pages.  libvirt currently only knows that if
any vfio devices are attached that the locked memory limit for the
process needs to be set sufficient for the VM memory.  When VT-d is
involved, we either need to figure out how to associate otherwise
independent vfio containers to share locked page accounting or teach
libvirt that the locked memory requirement needs to be multiplied by
the number of attached vfio devices.  The latter seems far less
complicated but reduces the containment of QEMU a bit since the
process has the ability to lock potentially many multiples of the VM
address size.  Thanks,

Alex

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c         | 75 ++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/trace-events          |  3 ++
>  include/hw/i386/intel_iommu.h |  2 ++
>  3 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 5f3e351..75a3f4e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    VTDAddressSpace *as;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            as = vtd_bus->dev_as[i];
> +            if (as == NULL) {
> +                continue;
> +            }
> +            trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> +                                           VTD_PCI_SLOT(i), VTD_PCI_FUNC(i),
> +                                           enabled);
> +            if (enabled) {
> +                memory_region_add_subregion_overlap(&as->root, 0,
> +                                                    &as->iommu, 2);
> +            } else {
> +                memory_region_del_subregion(&as->root, &as->iommu);
> +            }
> +        }
> +    }
> +}
> +
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> +    bool old = s->dmar_enabled;
> +
> +    if (old == en) {
> +        return;
> +    }
> +
>      VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>  
>      if (en) {
> @@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>          /* Ok - report back to driver */
>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>      }
> +
> +    vtd_switch_address_space(s, en);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>          vtd_dev_as->devfn = (uint8_t)devfn;
>          vtd_dev_as->iommu_state = s;
>          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * When DMAR is disabled, memory region relationships looks
> +         * like:
> +         *
> +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> +         *
> +         * When DMAR is disabled, it becomes:
> +         *
> +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> +         *  0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu
> +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> +         *
> +         * The intel_iommu region is dynamically added/removed.
> +         */
>          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
>                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);
> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));
>          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, "intel_iommu");
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        if (s->dmar_enabled) {
> +            memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                                &vtd_dev_as->iommu, 2);
> +        }
> +        trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> +                                       VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn),
> +                                       s->dmar_enabled);
>      }
>      return vtd_dev_as;
>  }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index d2b4973..aee93bb 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
>  # hw/i386/x86-iommu.c
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
> +# hw/i386/intel_iommu.c
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
> +
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
>  amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 405c9d1..85c1b9b 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>      uint8_t devfn;
>      AddressSpace as;
>      MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
  2016-12-19 16:56 ` Alex Williamson
@ 2016-12-19 23:30 ` David Gibson
  2016-12-20  4:16   ` Peter Xu
  2016-12-20 23:02 ` no-reply
  2016-12-20 23:57 ` no-reply
  3 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2016-12-19 23:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 9700 bytes --]

On Mon, Dec 19, 2016 at 10:41:26PM +0800, Peter Xu wrote:
> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU region, and
> that won't satisfy vfio-pci device listeners.
> 
> Let me explain.
> 
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
> 
>   (1) switch from domain A -> B
>   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> 
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.
> 
> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.
> 
> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c         | 75 ++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/trace-events          |  3 ++
>  include/hw/i386/intel_iommu.h |  2 ++
>  3 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 5f3e351..75a3f4e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    VTDAddressSpace *as;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            as = vtd_bus->dev_as[i];
> +            if (as == NULL) {
> +                continue;
> +            }
> +            trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> +                                           VTD_PCI_SLOT(i), VTD_PCI_FUNC(i),
> +                                           enabled);
> +            if (enabled) {
> +                memory_region_add_subregion_overlap(&as->root, 0,
> +                                                    &as->iommu, 2);
> +            } else {
> +                memory_region_del_subregion(&as->root, &as->iommu);

Why not use memory_region_set_enabled() rather than actually
adding/deleting the subregion?

> +            }
> +        }
> +    }
> +}
> +
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> +    bool old = s->dmar_enabled;
> +
> +    if (old == en) {
> +        return;
> +    }
> +
>      VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>  
>      if (en) {
> @@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>          /* Ok - report back to driver */
>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>      }
> +
> +    vtd_switch_address_space(s, en);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>          vtd_dev_as->devfn = (uint8_t)devfn;
>          vtd_dev_as->iommu_state = s;
>          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * When DMAR is disabled, memory region relationships looks
> +         * like:
> +         *
> +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> +         *
> +         * When DMAR is disabled, it becomes:
> +         *
> +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> +         *  0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu
> +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> +         *
> +         * The intel_iommu region is dynamically added/removed.
> +         */
>          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
>                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);

I'm almost certain UINT64_MAX is wrong here.  For one thing it would
collide with PCI BARs.  For another, I can't imagine that the IOMMU
page tables can really span an entire 2^64 space.

> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));

I strongly suspect using memory_region_size(get_system_memory()) is
also incorrect here.  System memory has size UINT64_MAX, but I'll bet
you you can't actually access all of that via PCI space (again, it
would collide with actual PCI BARs).  I also suspect you can't reach
CPU MMIO regions via the PCI DMA space.

So, I think you should find out what this limit actually is and
restrict the alias to that window.

>          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, "intel_iommu");
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        if (s->dmar_enabled) {
> +            memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                                &vtd_dev_as->iommu, 2);
> +        }

Hmm.  You have the IOMMU translated region overlaying the
direct-mapped alias.  You enable and disable the IOMMU subregion, but
you always leave the direct mapping enabled.  You might get away with
this because the size of the IOMMU region is UINT64_MAX, which will
overlay everything - but as above, I think that's wrong.  If that's
changed then guest devices may be able to access portions of the raw
address space outside the IOMMU mapped region, which could break the
guest's expectations of device isolation.

I think it would be much safer to disable the system memory alias when
the IOMMU is enabled.

> +        trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> +                                       VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn),
> +                                       s->dmar_enabled);
>      }
>      return vtd_dev_as;
>  }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index d2b4973..aee93bb 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
>  # hw/i386/x86-iommu.c
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
> +# hw/i386/intel_iommu.c
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
> +
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
>  amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 405c9d1..85c1b9b 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>      uint8_t devfn;
>      AddressSpace as;
>      MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 16:56 ` Alex Williamson
@ 2016-12-20  3:44   ` Peter Xu
  2016-12-20  4:52     ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2016-12-20  3:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote:
> On Mon, 19 Dec 2016 22:41:26 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > This is preparation work to finally enabled dynamic switching ON/OFF for
> > VT-d protection. The old VT-d codes is using static IOMMU region, and
> > that won't satisfy vfio-pci device listeners.
> > 
> > Let me explain.
> > 
> > vfio-pci devices depend on the memory region listener and IOMMU replay
> > mechanism to make sure the device mapping is coherent with the guest
> > even if there are domain switches. And there are two kinds of domain
> > switches:
> > 
> >   (1) switch from domain A -> B
> >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > 
> > Case (1) is handled by the context entry invalidation handling by the
> > VT-d replay logic. What the replay function should do here is to replay
> > the existing page mappings in domain B.
> > 
> > However for case (2), we don't want to replay any domain mappings - we
> > just need the default GPA->HPA mappings (the address_space_memory
> > mapping). And this patch helps on case (2) to build up the mapping
> > automatically by leveraging the vfio-pci memory listeners.
> > 
> > Another important thing that this patch does is to seperate
> > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> > depend on the DMAR region (like before this patch). It should be a
> > standalone region, and it should be able to be activated without
> > DMAR (which is a common behavior of Linux kernel - by default it enables
> > IR while disabled DMAR).
> 
> 
> This seems like an improvement, but I will note that there are existing
> locked memory accounting issues inherent with VT-d and vfio.  With
> VT-d, each device has a unique AddressSpace.  This requires that each
> is managed via a separate vfio container.  Each container is accounted
> for separately for locked pages.  libvirt currently only knows that if
> any vfio devices are attached that the locked memory limit for the
> process needs to be set sufficient for the VM memory.  When VT-d is
> involved, we either need to figure out how to associate otherwise
> independent vfio containers to share locked page accounting or teach
> libvirt that the locked memory requirement needs to be multiplied by
> the number of attached vfio devices.  The latter seems far less
> complicated but reduces the containment of QEMU a bit since the
> process has the ability to lock potentially many multiples of the VM
> address size.  Thanks,

Yes, this patch just tried to move VT-d forward a bit, rather than do
it once and for all. I think we can do better than this in the future,
for example, one address space per guest IOMMU domain (as you have
mentioned before). However I suppose that will need more work (which I
still can't estimate on the amount of work). So I am considering to
enable the device assignments functionally first, then we can further
improve based on a workable version. Same thoughts apply to the IOMMU
replay RFC series.

Regarding to the locked memory accounting issue: do we have existing
way to do the accounting? If so, would you (or anyone) please
elaborate a bit? If not, is that an ongoing/planned work?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 23:30 ` David Gibson
@ 2016-12-20  4:16   ` Peter Xu
  2016-12-21  2:53     ` David Gibson
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2016-12-20  4:16 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Tue, Dec 20, 2016 at 10:30:12AM +1100, David Gibson wrote:

[...]

> > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled)
> > +{
> > +    GHashTableIter iter;
> > +    VTDBus *vtd_bus;
> > +    VTDAddressSpace *as;
> > +    int i;
> > +
> > +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> > +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> > +            as = vtd_bus->dev_as[i];
> > +            if (as == NULL) {
> > +                continue;
> > +            }
> > +            trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> > +                                           VTD_PCI_SLOT(i), VTD_PCI_FUNC(i),
> > +                                           enabled);
> > +            if (enabled) {
> > +                memory_region_add_subregion_overlap(&as->root, 0,
> > +                                                    &as->iommu, 2);
> > +            } else {
> > +                memory_region_del_subregion(&as->root, &as->iommu);
> 
> Why not use memory_region_set_enabled() rather than actually
> adding/deleting the subregion?

Good idea, thanks. :-)

[...]

> > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >          vtd_dev_as->devfn = (uint8_t)devfn;
> >          vtd_dev_as->iommu_state = s;
> >          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> > +
> > +        /*
> > +         * When DMAR is disabled, memory region relationships looks
> > +         * like:
> > +         *
> > +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> > +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> > +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> > +         *
> > +         * When DMAR is disabled, it becomes:
> > +         *
> > +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> > +         *  0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu
> > +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> > +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> > +         *
> > +         * The intel_iommu region is dynamically added/removed.
> > +         */
> >          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
> >                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);
> 
> I'm almost certain UINT64_MAX is wrong here.  For one thing it would
> collide with PCI BARs.  For another, I can't imagine that the IOMMU
> page tables can really span an entire 2^64 space.

Could you explain why here device address space has things to do with
PCI BARs? I thought BARs are for CPU address space only (so that CPU
can access PCI registers via MMIO manner), am I wrong?

I think we should have a big enough IOMMU region size here. If device
writes to invalid addresses, IMHO we should trap it and report to
guest. If we have a smaller size than UINT64_MAX, how we can trap this
behavior and report for the whole address space (it should cover [0,
2^64-1])?

> 
> > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > +                                 "vtd_sys_alias", get_system_memory(),
> > +                                 0, memory_region_size(get_system_memory()));
> 
> I strongly suspect using memory_region_size(get_system_memory()) is
> also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> you you can't actually access all of that via PCI space (again, it
> would collide with actual PCI BARs).  I also suspect you can't reach
> CPU MMIO regions via the PCI DMA space.

Hmm, sounds correct.

However if so we will have the same problem if without IOMMU? See
pci_device_iommu_address_space() - address_space_memory will be the
default if we have no IOMMU protection, and that will cover e.g. CPU
MMIO regions as well.

> 
> So, I think you should find out what this limit actually is and
> restrict the alias to that window.

/me needs some more reading to figure this out. Still not quite
familiar with the whole VM memory regions. Hints are welcomed...

> 
> >          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
> >                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
> >                                VTD_INTERRUPT_ADDR_SIZE);
> > -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> > -                                    &vtd_dev_as->iommu_ir);
> > -        address_space_init(&vtd_dev_as->as,
> > -                           &vtd_dev_as->iommu, "intel_iommu");
> > +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> > +                           "vtd_root", UINT64_MAX);
> > +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> > +                                            VTD_INTERRUPT_ADDR_FIRST,
> > +                                            &vtd_dev_as->iommu_ir, 64);
> > +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> > +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> > +                                            &vtd_dev_as->sys_alias, 1);
> > +        if (s->dmar_enabled) {
> > +            memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> > +                                                &vtd_dev_as->iommu, 2);
> > +        }
> 
> Hmm.  You have the IOMMU translated region overlaying the
> direct-mapped alias.  You enable and disable the IOMMU subregion, but
> you always leave the direct mapping enabled.  You might get away with
> this because the size of the IOMMU region is UINT64_MAX, which will
> overlay everything - but as above, I think that's wrong.  If that's
> changed then guest devices may be able to access portions of the raw
> address space outside the IOMMU mapped region, which could break the
> guest's expectations of device isolation.
> 
> I think it would be much safer to disable the system memory alias when
> the IOMMU is enabled.

Reasonable. Will adopt.

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20  3:44   ` Peter Xu
@ 2016-12-20  4:52     ` Alex Williamson
  2016-12-20  6:38       ` Peter Xu
  2016-12-21  3:30       ` David Gibson
  0 siblings, 2 replies; 18+ messages in thread
From: Alex Williamson @ 2016-12-20  4:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Tue, 20 Dec 2016 11:44:41 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote:
> > On Mon, 19 Dec 2016 22:41:26 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > VT-d protection. The old VT-d codes is using static IOMMU region, and
> > > that won't satisfy vfio-pci device listeners.
> > > 
> > > Let me explain.
> > > 
> > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > mechanism to make sure the device mapping is coherent with the guest
> > > even if there are domain switches. And there are two kinds of domain
> > > switches:
> > > 
> > >   (1) switch from domain A -> B
> > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > 
> > > Case (1) is handled by the context entry invalidation handling by the
> > > VT-d replay logic. What the replay function should do here is to replay
> > > the existing page mappings in domain B.
> > > 
> > > However for case (2), we don't want to replay any domain mappings - we
> > > just need the default GPA->HPA mappings (the address_space_memory
> > > mapping). And this patch helps on case (2) to build up the mapping
> > > automatically by leveraging the vfio-pci memory listeners.
> > > 
> > > Another important thing that this patch does is to seperate
> > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> > > depend on the DMAR region (like before this patch). It should be a
> > > standalone region, and it should be able to be activated without
> > > DMAR (which is a common behavior of Linux kernel - by default it enables
> > > IR while disabled DMAR).  
> > 
> > 
> > This seems like an improvement, but I will note that there are existing
> > locked memory accounting issues inherent with VT-d and vfio.  With
> > VT-d, each device has a unique AddressSpace.  This requires that each
> > is managed via a separate vfio container.  Each container is accounted
> > for separately for locked pages.  libvirt currently only knows that if
> > any vfio devices are attached that the locked memory limit for the
> > process needs to be set sufficient for the VM memory.  When VT-d is
> > involved, we either need to figure out how to associate otherwise
> > independent vfio containers to share locked page accounting or teach
> > libvirt that the locked memory requirement needs to be multiplied by
> > the number of attached vfio devices.  The latter seems far less
> > complicated but reduces the containment of QEMU a bit since the
> > process has the ability to lock potentially many multiples of the VM
> > address size.  Thanks,  
> 
> Yes, this patch just tried to move VT-d forward a bit, rather than do
> it once and for all. I think we can do better than this in the future,
> for example, one address space per guest IOMMU domain (as you have
> mentioned before). However I suppose that will need more work (which I
> still can't estimate on the amount of work). So I am considering to
> enable the device assignments functionally first, then we can further
> improve based on a workable version. Same thoughts apply to the IOMMU
> replay RFC series.

I'm not arguing against it, I'm just trying to set expectations for
where this gets us.  An AddressSpace per guest iommu domain seems like
the right model for QEMU, but it has some fundamental issues with
vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
represents the host IOMMU context.  The AddressSpace of a device is
currently assumed to be fixed in QEMU, guest IOMMU domains clearly
are not.  vfio only let's us have access to a device while it's
protected within a container.  Therefore in order to move a device to a
different AddressSpace based on the guest domain configuration, we'd
need to tear down the vfio configuration, including releasing the
device.
 
> Regarding to the locked memory accounting issue: do we have existing
> way to do the accounting? If so, would you (or anyone) please
> elaborate a bit? If not, is that an ongoing/planned work?

As I describe above, there's a vfio container per AddressSpace, each
container is an IOMMU domain in the host.  In the guest, an IOMMU
domain can include multiple AddressSpaces, one for each context entry
that's part of the domain.  When the guest programs a translation for
an IOMMU domain, that maps a guest IOVA to a guest physical address,
for each AddressSpace.  Each AddressSpace is backed by a vfio
container, which needs to pin the pages of that translation in order to
get a host physical address, which then gets programmed into the host
IOMMU domain with the guest-IOVA and host physical address.  The
pinning process is where page accounting is done.  It's done per vfio
context.  The worst case scenario for accounting is thus when VT-d is
present but disabled (or in passthrough mode) as each AddressSpace
duplicates address_space_memory and every page of guest memory is
pinned and accounted for each vfio container.

That's the existing way we do accounting.  There is no current
development that I'm aware of to change this.  As above, the simplest
stop-gap solution is that libvirt would need to be aware when VT-d is
present for a VM and use a different algorithm to set QEMU locked
memory limit, but it's not without its downsides.  Alternatively, a new
IOMMU model would need to be developed for vfio.  The type1 model was
only ever intended to be used for relatively static user mappings and I
expect it to have horrendous performance when backing a dynamic guest
IOMMU domain.  Really the only guest IOMMU usage model that makes any
sort of sense with type1 is to run the guest with passthrough (iommu=pt)
and only pull devices out of passthrough for relatively static mapping
cases within the guest userspace (nested assigned devices or dpdk).  If
the expectation is that we just need this one little bit more code to
make vfio usable in the guest, that may be true, but it really is just
barely usable.  It's not going to be fast for any sort of dynamic
mapping and it's going to have accounting issues that are not
compatible with how libvirt sets locked memory limits for QEMU as soon
as you go beyond a single device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20  4:52     ` Alex Williamson
@ 2016-12-20  6:38       ` Peter Xu
  2016-12-21  0:04         ` Alex Williamson
  2016-12-21  3:30       ` David Gibson
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Xu @ 2016-12-20  6:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:

[...]

> > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > it once and for all. I think we can do better than this in the future,
> > for example, one address space per guest IOMMU domain (as you have
> > mentioned before). However I suppose that will need more work (which I
> > still can't estimate on the amount of work). So I am considering to
> > enable the device assignments functionally first, then we can further
> > improve based on a workable version. Same thoughts apply to the IOMMU
> > replay RFC series.
> 
> I'm not arguing against it, I'm just trying to set expectations for
> where this gets us.  An AddressSpace per guest iommu domain seems like
> the right model for QEMU, but it has some fundamental issues with
> vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> represents the host IOMMU context.  The AddressSpace of a device is
> currently assumed to be fixed in QEMU, guest IOMMU domains clearly
> are not.  vfio only let's us have access to a device while it's
> protected within a container.  Therefore in order to move a device to a
> different AddressSpace based on the guest domain configuration, we'd
> need to tear down the vfio configuration, including releasing the
> device.

I assume this is VT-d specific issue, right? Looks like ppc is using a
totally differnet way to manage the mapping, and devices can share the
same address space.

>  
> > Regarding to the locked memory accounting issue: do we have existing
> > way to do the accounting? If so, would you (or anyone) please
> > elaborate a bit? If not, is that an ongoing/planned work?
> 
> As I describe above, there's a vfio container per AddressSpace, each
> container is an IOMMU domain in the host.  In the guest, an IOMMU
> domain can include multiple AddressSpaces, one for each context entry
> that's part of the domain.  When the guest programs a translation for
> an IOMMU domain, that maps a guest IOVA to a guest physical address,
> for each AddressSpace.  Each AddressSpace is backed by a vfio
> container, which needs to pin the pages of that translation in order to
> get a host physical address, which then gets programmed into the host
> IOMMU domain with the guest-IOVA and host physical address.  The
> pinning process is where page accounting is done.  It's done per vfio
> context.  The worst case scenario for accounting is thus when VT-d is
> present but disabled (or in passthrough mode) as each AddressSpace
> duplicates address_space_memory and every page of guest memory is
> pinned and accounted for each vfio container.

IIUC this accounting issue will solve itself if we can solve the
previous issue. While we don't have it now, so ...

> 
> That's the existing way we do accounting.  There is no current
> development that I'm aware of to change this.  As above, the simplest
> stop-gap solution is that libvirt would need to be aware when VT-d is
> present for a VM and use a different algorithm to set QEMU locked
> memory limit, but it's not without its downsides.

... here I think it's sensible to consider a specific algorithm for
vt-d use case. I am just curious about how should we define this
algorithm.

First of all, when the devices are not sharing domain (or say, one
guest iommu domain per assigned device), everything should be fine. No
special algorithm needed. IMHO the problem will happen only if there
are assigned devices that share a same address space (either system,
or specific iommu domain). In that case, the accounted value (or say,
current->mm->locked_vm iiuc) will be bigger than the real locked
memory size.

However, I think the problem is whether devices will be put into same
address space depends on guest behavior - the guest can either use
iommu=pt, or manually putting devices into the same guest iommu region
to achieve that. But from hypervisor POV, how should we estimate this?
Can we really?

> Alternatively, a new
> IOMMU model would need to be developed for vfio.  The type1 model was
> only ever intended to be used for relatively static user mappings and I
> expect it to have horrendous performance when backing a dynamic guest
> IOMMU domain.  Really the only guest IOMMU usage model that makes any
> sort of sense with type1 is to run the guest with passthrough (iommu=pt)
> and only pull devices out of passthrough for relatively static mapping
> cases within the guest userspace (nested assigned devices or dpdk).  If
> the expectation is that we just need this one little bit more code to
> make vfio usable in the guest, that may be true, but it really is just
> barely usable.  It's not going to be fast for any sort of dynamic
> mapping and it's going to have accounting issues that are not
> compatible with how libvirt sets locked memory limits for QEMU as soon
> as you go beyond a single device.  Thanks,

I can totally understand that the performance will suck if dynamic
mapping is used. AFAIU this work will only be used with static dma
mapping like running DPDK in guest (besides other trivial goals, like,
development purpose).

Regarding to "the other" iommu model you mentioned besides type1, is
there any existing discussions out there? Any further learning
material/links would be greatly welcomed.

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
  2016-12-19 16:56 ` Alex Williamson
  2016-12-19 23:30 ` David Gibson
@ 2016-12-20 23:02 ` no-reply
  2016-12-21  3:33   ` Peter Xu
  2016-12-20 23:57 ` no-reply
  3 siblings, 1 reply; 18+ messages in thread
From: no-reply @ 2016-12-20 23:02 UTC (permalink / raw)
  To: peterx
  Cc: famz, qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	jasowang, alex.williamson, bd.aviv, david

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
Message-id: 1482158486-18597-1-git-send-email-peterx@redhat.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ae738a3 intel_iommu: allow dynamic switch of IOMMU region

=== OUTPUT BEGIN ===
Checking PATCH 1/1: intel_iommu: allow dynamic switch of IOMMU region...
ERROR: "(foo**)" should be "(foo **)"
#55: FILE: hw/i386/intel_iommu.c:1190:
+    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {

ERROR: space prohibited between function name and open parenthesis '('
#55: FILE: hw/i386/intel_iommu.c:1190:
+    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {

total: 2 errors, 0 warnings, 118 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
                   ` (2 preceding siblings ...)
  2016-12-20 23:02 ` no-reply
@ 2016-12-20 23:57 ` no-reply
  2016-12-21  3:39   ` Peter Xu
  3 siblings, 1 reply; 18+ messages in thread
From: no-reply @ 2016-12-20 23:57 UTC (permalink / raw)
  To: peterx
  Cc: famz, qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	jasowang, alex.williamson, bd.aviv, david

Hi,

Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.

Type: series
Message-id: 1482158486-18597-1-git-send-email-peterx@redhat.com
Subject: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
make docker-test-build@min-glib
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ae738a3 intel_iommu: allow dynamic switch of IOMMU region

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
  BUILD   centos6
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-0ljnwqpe/src'
  ARCHIVE qemu.tgz
  ARCHIVE dtc.tgz
  COPY    RUNNER
    RUN test-quick in qemu:centos6 
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=libfdt-devel ccache     tar git make gcc g++     zlib-devel glib2-devel SDL-devel pixman-devel     epel-release
HOSTNAME=191db77c270c
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix    /var/tmp/qemu-build/install
BIOS directory    /var/tmp/qemu-build/install/share/qemu
binary directory  /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory  /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory  /var/tmp/qemu-build/install/etc
local state directory   /var/tmp/qemu-build/install/var
Manual directory  /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path       /tmp/qemu-test/src
C compiler        cc
Host C compiler   cc
C++ compiler      
Objective-C compiler cc
ARFLAGS           rv
CFLAGS            -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS       -I/usr/include/pixman-1    -pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv  -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all
LDFLAGS           -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
make              make
install           install
python            python -B
smbd              /usr/sbin/smbd
module support    no
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled     no
sparse enabled    no
strip binaries    yes
profiler          no
static build      no
pixman            system
SDL support       yes (1.2.14)
GTK support       no 
GTK GL support    no
VTE support       no 
TLS priority      NORMAL
GNUTLS support    no
GNUTLS rnd        no
libgcrypt         no
libgcrypt kdf     no
nettle            no 
nettle kdf        no
libtasn1          no
curses support    no
virgl support     no
curl support      no
mingw32 support   no
Audio drivers     oss
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS support    no
VNC support       yes
VNC SASL support  no
VNC JPEG support  no
VNC PNG support   no
xen support       no
brlapi support    no
bluez  support    no
Documentation     no
PIE               yes
vde support       no
netmap support    no
Linux AIO support no
ATTR/XATTR support yes
Install blobs     yes
KVM support       yes
COLO support      yes
RDMA support      no
TCG interpreter   no
fdt support       yes
preadv support    yes
fdatasync         yes
madvise           yes
posix_madvise     yes
libcap-ng support no
vhost-net support yes
vhost-scsi support yes
vhost-vsock support yes
Trace backends    log
spice support     no 
rbd support       no
xfsctl support    no
smartcard support no
libusb            no
usb net redir     no
OpenGL support    no
OpenGL dmabufs    no
libiscsi support  no
libnfs support    no
build guest agent yes
QGA VSS support   no
QGA w32 disk info no
QGA MSI support   no
seccomp support   no
coroutine backend ucontext
coroutine pool    yes
debug stack usage no
GlusterFS support no
Archipelago support no
gcov              gcov
gcov enabled      no
TPM support       yes
libssh2 support   no
TPM passthrough   yes
QOM debugging     yes
lzo support       no
snappy support    no
bzip2 support     no
NUMA host support no
tcmalloc support  no
jemalloc support  no
avx2 optimization no
replication support yes
  GEN     x86_64-softmmu/config-devices.mak.tmp
  GEN     aarch64-softmmu/config-devices.mak.tmp
  GEN     config-host.h
  GEN     qemu-options.def
  GEN     qmp-commands.h
  GEN     qapi-types.h
  GEN     qapi-visit.h
  GEN     qapi-event.h
  GEN     qmp-introspect.h
  GEN     x86_64-softmmu/config-devices.mak
  GEN     aarch64-softmmu/config-devices.mak
  GEN     module_block.h
  GEN     tests/test-qapi-types.h
  GEN     tests/test-qapi-visit.h
  GEN     tests/test-qmp-commands.h
  GEN     tests/test-qapi-event.h
  GEN     tests/test-qmp-introspect.h
  GEN     config-all-devices.mak
  GEN     trace/generated-tracers.h
  GEN     trace/generated-tcg-tracers.h
  GEN     trace/generated-helpers-wrappers.h
  GEN     trace/generated-helpers.h
  CC      tests/qemu-iotests/socket_scm_helper.o
  GEN     qga/qapi-generated/qga-qmp-commands.h
  GEN     qga/qapi-generated/qga-qapi-types.h
  GEN     qga/qapi-generated/qga-qapi-visit.h
  GEN     qga/qapi-generated/qga-qapi-types.c
  GEN     qga/qapi-generated/qga-qapi-visit.c
  GEN     qga/qapi-generated/qga-qmp-marshal.c
  GEN     qmp-introspect.c
  GEN     qapi-types.c
  GEN     qapi-visit.c
  GEN     qapi-event.c
  CC      qapi/qapi-visit-core.o
  CC      qapi/qapi-dealloc-visitor.o
  CC      qapi/qobject-input-visitor.o
  CC      qapi/qobject-output-visitor.o
  CC      qapi/qmp-registry.o
  CC      qapi/qmp-dispatch.o
  CC      qapi/string-input-visitor.o
  CC      qapi/string-output-visitor.o
  CC      qapi/opts-visitor.o
  CC      qapi/qapi-clone-visitor.o
  CC      qapi/qmp-event.o
  CC      qapi/qapi-util.o
  CC      qobject/qnull.o
  CC      qobject/qint.o
  CC      qobject/qstring.o
  CC      qobject/qdict.o
  CC      qobject/qlist.o
  CC      qobject/qfloat.o
  CC      qobject/qbool.o
  CC      qobject/qjson.o
  CC      qobject/qobject.o
  CC      qobject/json-lexer.o
  CC      qobject/json-streamer.o
  CC      qobject/json-parser.o
  GEN     trace/generated-tracers.c
  CC      trace/control.o
  CC      trace/qmp.o
  CC      util/osdep.o
  CC      util/cutils.o
  CC      util/unicode.o
  CC      util/qemu-timer-common.o
  CC      util/bufferiszero.o
  CC      util/compatfd.o
  CC      util/event_notifier-posix.o
  CC      util/mmap-alloc.o
  CC      util/oslib-posix.o
  CC      util/qemu-openpty.o
  CC      util/qemu-thread-posix.o
  CC      util/memfd.o
  CC      util/envlist.o
  CC      util/path.o
  CC      util/module.o
  CC      util/bitmap.o
  CC      util/hbitmap.o
  CC      util/bitops.o
  CC      util/fifo8.o
  CC      util/acl.o
  CC      util/error.o
  CC      util/qemu-error.o
  CC      util/id.o
  CC      util/iov.o
  CC      util/qemu-config.o
  CC      util/qemu-sockets.o
  CC      util/uri.o
  CC      util/notify.o
  CC      util/qemu-option.o
  CC      util/qemu-progress.o
  CC      util/hexdump.o
  CC      util/crc32c.o
  CC      util/uuid.o
  CC      util/throttle.o
  CC      util/getauxval.o
  CC      util/readline.o
  CC      util/rcu.o
  CC      util/qemu-coroutine.o
  CC      util/qemu-coroutine-lock.o
  CC      util/qemu-coroutine-io.o
  CC      util/qemu-coroutine-sleep.o
  CC      util/coroutine-ucontext.o
  CC      util/buffer.o
  CC      util/timed-average.o
  CC      util/base64.o
  CC      util/log.o
  CC      util/qdist.o
  CC      util/qht.o
  CC      util/range.o
  CC      crypto/pbkdf-stub.o
  CC      stubs/arch-query-cpu-def.o
  CC      stubs/arch-query-cpu-model-expansion.o
  CC      stubs/arch-query-cpu-model-comparison.o
  CC      stubs/arch-query-cpu-model-baseline.o
  CC      stubs/bdrv-next-monitor-owned.o
  CC      stubs/blk-commit-all.o
  CC      stubs/blockdev-close-all-bdrv-states.o
  CC      stubs/clock-warp.o
  CC      stubs/cpu-get-clock.o
  CC      stubs/cpu-get-icount.o
  CC      stubs/dump.o
  CC      stubs/error-printf.o
  CC      stubs/fdset-add-fd.o
  CC      stubs/fdset-find-fd.o
  CC      stubs/fdset-get-fd.o
  CC      stubs/fdset-remove-fd.o
  CC      stubs/gdbstub.o
  CC      stubs/get-fd.o
  CC      stubs/get-next-serial.o
  CC      stubs/get-vm-name.o
  CC      stubs/iothread.o
  CC      stubs/iothread-lock.o
  CC      stubs/is-daemonized.o
  CC      stubs/machine-init-done.o
  CC      stubs/migr-blocker.o
  CC      stubs/mon-is-qmp.o
  CC      stubs/monitor-init.o
  CC      stubs/notify-event.o
  CC      stubs/qtest.o
  CC      stubs/replay.o
  CC      stubs/set-fd-handler.o
  CC      stubs/replay-user.o
  CC      stubs/reset.o
  CC      stubs/runstate-check.o
  CC      stubs/sysbus.o
  CC      stubs/slirp.o
  CC      stubs/trace-control.o
  CC      stubs/uuid.o
  CC      stubs/vm-stop.o
  CC      stubs/vmstate.o
  CC      stubs/cpus.o
  CC      stubs/qmp_pc_dimm_device_list.o
  CC      stubs/kvm.o
  CC      stubs/target-get-monitor-def.o
  CC      stubs/target-monitor-defs.o
  CC      stubs/vhost.o
  CC      stubs/iohandler.o
  CC      stubs/smbios_type_38.o
  CC      stubs/ipmi.o
  CC      stubs/pc_madt_cpu_entry.o
  CC      stubs/migration-colo.o
  CC      contrib/ivshmem-client/ivshmem-client.o
  CC      contrib/ivshmem-client/main.o
  CC      contrib/ivshmem-server/ivshmem-server.o
  CC      contrib/ivshmem-server/main.o
  CC      qemu-nbd.o
  CC      async.o
  CC      thread-pool.o
  CC      block.o
  CC      blockjob.o
  CC      iohandler.o
  CC      qemu-timer.o
  CC      main-loop.o
  CC      aio-posix.o
  CC      qemu-io-cmds.o
  CC      replication.o
  CC      block/raw_bsd.o
  CC      block/qcow.o
  CC      block/vdi.o
  CC      block/vmdk.o
  CC      block/cloop.o
  CC      block/bochs.o
  CC      block/vvfat.o
  CC      block/vpc.o
  CC      block/dmg.o
  CC      block/qcow2.o
  CC      block/qcow2-refcount.o
  CC      block/qcow2-cluster.o
  CC      block/qcow2-snapshot.o
  CC      block/qcow2-cache.o
  CC      block/qed.o
  CC      block/qed-l2-cache.o
  CC      block/qed-gencb.o
  CC      block/qed-table.o
  CC      block/qed-cluster.o
  CC      block/qed-check.o
  CC      block/vhdx.o
  CC      block/vhdx-endian.o
  CC      block/vhdx-log.o
  CC      block/quorum.o
  CC      block/parallels.o
  CC      block/blkdebug.o
  CC      block/blkverify.o
  CC      block/blkreplay.o
  CC      block/block-backend.o
  CC      block/snapshot.o
  CC      block/qapi.o
  CC      block/raw-posix.o
  CC      block/null.o
  CC      block/mirror.o
  CC      block/commit.o
  CC      block/io.o
  CC      block/throttle-groups.o
  CC      block/nbd.o
  CC      block/nbd-client.o
  CC      block/sheepdog.o
  CC      block/accounting.o
  CC      block/dirty-bitmap.o
  CC      block/write-threshold.o
  CC      block/backup.o
  CC      block/replication.o
  CC      block/crypto.o
  CC      nbd/server.o
  CC      nbd/client.o
  CC      nbd/common.o
  CC      crypto/hash.o
  CC      crypto/init.o
  CC      crypto/hash-glib.o
  CC      crypto/aes.o
  CC      crypto/desrfb.o
  CC      crypto/cipher.o
  CC      crypto/tlscreds.o
  CC      crypto/tlscredsanon.o
  CC      crypto/tlscredsx509.o
  CC      crypto/tlssession.o
  CC      crypto/secret.o
  CC      crypto/random-platform.o
  CC      crypto/pbkdf.o
  CC      crypto/ivgen.o
  CC      crypto/ivgen-essiv.o
  CC      crypto/ivgen-plain.o
  CC      crypto/ivgen-plain64.o
  CC      crypto/afsplit.o
  CC      crypto/xts.o
  CC      crypto/block.o
  CC      crypto/block-qcow.o
  CC      io/channel.o
  CC      crypto/block-luks.o
  CC      io/channel-buffer.o
  CC      io/channel-command.o
  CC      io/channel-file.o
  CC      io/channel-socket.o
  CC      io/channel-tls.o
  CC      io/channel-watch.o
  CC      io/channel-websock.o
  CC      io/channel-util.o
  CC      io/task.o
  CC      qom/object.o
  CC      qom/container.o
  CC      qom/qom-qobject.o
  CC      qom/object_interfaces.o
  GEN     qemu-img-cmds.h
  CC      qemu-io.o
  CC      qemu-bridge-helper.o
  CC      blockdev.o
  CC      blockdev-nbd.o
  CC      iothread.o
  CC      qdev-monitor.o
  CC      device-hotplug.o
  CC      os-posix.o
  CC      qemu-char.o
  CC      page_cache.o
  CC      accel.o
  CC      bt-host.o
  CC      bt-vhci.o
  CC      dma-helpers.o
  CC      vl.o
  CC      tpm.o
  CC      device_tree.o
  GEN     qmp-marshal.c
  CC      qmp.o
  CC      hmp.o
  CC      cpus-common.o
  CC      audio/audio.o
  CC      audio/noaudio.o
  CC      audio/wavaudio.o
  CC      audio/mixeng.o
  CC      audio/sdlaudio.o
  CC      audio/ossaudio.o
  CC      audio/wavcapture.o
  CC      backends/rng.o
  CC      backends/rng-egd.o
  CC      backends/rng-random.o
  CC      backends/msmouse.o
  CC      backends/testdev.o
  CC      backends/tpm.o
  CC      backends/hostmem.o
  CC      backends/hostmem-ram.o
  CC      backends/hostmem-file.o
  CC      backends/cryptodev.o
  CC      backends/cryptodev-builtin.o
  CC      block/stream.o
  CC      disas/arm.o
  CC      disas/i386.o
  CC      fsdev/qemu-fsdev-dummy.o
  CC      fsdev/qemu-fsdev-opts.o
  CC      hw/acpi/core.o
  CC      hw/acpi/piix4.o
  CC      hw/acpi/pcihp.o
  CC      hw/acpi/ich9.o
  CC      hw/acpi/tco.o
  CC      hw/acpi/cpu_hotplug.o
  CC      hw/acpi/memory_hotplug.o
  CC      hw/acpi/memory_hotplug_acpi_table.o
  CC      hw/acpi/cpu.o
  CC      hw/acpi/nvdimm.o
  CC      hw/acpi/acpi_interface.o
  CC      hw/acpi/bios-linker-loader.o
  CC      hw/acpi/aml-build.o
  CC      hw/acpi/ipmi.o
  CC      hw/audio/sb16.o
  CC      hw/audio/es1370.o
  CC      hw/audio/ac97.o
  CC      hw/audio/fmopl.o
  CC      hw/audio/adlib.o
  CC      hw/audio/gus.o
  CC      hw/audio/gusemu_hal.o
  CC      hw/audio/gusemu_mixer.o
  CC      hw/audio/cs4231a.o
  CC      hw/audio/intel-hda.o
  CC      hw/audio/hda-codec.o
  CC      hw/audio/pcspk.o
  CC      hw/audio/wm8750.o
  CC      hw/audio/pl041.o
  CC      hw/audio/lm4549.o
  CC      hw/audio/marvell_88w8618.o
  CC      hw/block/block.o
  CC      hw/block/cdrom.o
  CC      hw/block/hd-geometry.o
  CC      hw/block/fdc.o
  CC      hw/block/m25p80.o
  CC      hw/block/nand.o
  CC      hw/block/pflash_cfi01.o
  CC      hw/block/pflash_cfi02.o
  CC      hw/block/ecc.o
  CC      hw/block/nvme.o
  CC      hw/block/onenand.o
  CC      hw/bt/core.o
  CC      hw/bt/l2cap.o
  CC      hw/bt/sdp.o
  CC      hw/bt/hci.o
  CC      hw/bt/hid.o
  CC      hw/bt/hci-csr.o
  CC      hw/char/ipoctal232.o
  CC      hw/char/parallel.o
  CC      hw/char/pl011.o
  CC      hw/char/serial.o
  CC      hw/char/serial-isa.o
  CC      hw/char/serial-pci.o
  CC      hw/char/virtio-console.o
  CC      hw/char/cadence_uart.o
  CC      hw/char/debugcon.o
  CC      hw/char/imx_serial.o
  CC      hw/core/qdev-properties.o
  CC      hw/core/qdev.o
  CC      hw/core/bus.o
  CC      hw/core/fw-path-provider.o
  CC      hw/core/irq.o
  CC      hw/core/hotplug.o
  CC      hw/core/ptimer.o
  CC      hw/core/sysbus.o
  CC      hw/core/machine.o
  CC      hw/core/null-machine.o
  CC      hw/core/loader.o
  CC      hw/core/qdev-properties-system.o
  CC      hw/core/register.o
  CC      hw/core/or-irq.o
  CC      hw/core/platform-bus.o
  CC      hw/display/ads7846.o
  CC      hw/display/cirrus_vga.o
  CC      hw/display/pl110.o
  CC      hw/display/ssd0303.o
  CC      hw/display/ssd0323.o
  CC      hw/display/vga-pci.o
  CC      hw/display/vga-isa.o
  CC      hw/display/vmware_vga.o
  CC      hw/display/blizzard.o
  CC      hw/display/exynos4210_fimd.o
  CC      hw/display/framebuffer.o
  CC      hw/display/tc6393xb.o
  CC      hw/dma/pl080.o
  CC      hw/dma/pl330.o
  CC      hw/dma/i8257.o
  CC      hw/dma/xlnx-zynq-devcfg.o
  CC      hw/gpio/max7310.o
  CC      hw/gpio/pl061.o
  CC      hw/gpio/zaurus.o
  CC      hw/gpio/gpio_key.o
  CC      hw/i2c/core.o
  CC      hw/i2c/smbus.o
  CC      hw/i2c/smbus_eeprom.o
  CC      hw/i2c/i2c-ddc.o
  CC      hw/i2c/versatile_i2c.o
  CC      hw/i2c/smbus_ich9.o
  CC      hw/i2c/pm_smbus.o
  CC      hw/i2c/bitbang_i2c.o
  CC      hw/i2c/exynos4210_i2c.o
  CC      hw/i2c/imx_i2c.o
  CC      hw/i2c/aspeed_i2c.o
  CC      hw/ide/core.o
  CC      hw/ide/atapi.o
  CC      hw/ide/qdev.o
  CC      hw/ide/pci.o
  CC      hw/ide/isa.o
  CC      hw/ide/piix.o
  CC      hw/ide/microdrive.o
  CC      hw/ide/ahci.o
  CC      hw/ide/ich.o
  CC      hw/input/hid.o
  CC      hw/input/lm832x.o
  CC      hw/input/pckbd.o
  CC      hw/input/pl050.o
  CC      hw/input/ps2.o
  CC      hw/input/stellaris_input.o
  CC      hw/input/tsc2005.o
  CC      hw/input/vmmouse.o
  CC      hw/input/virtio-input.o
  CC      hw/input/virtio-input-hid.o
  CC      hw/input/virtio-input-host.o
  CC      hw/intc/i8259_common.o
  CC      hw/intc/i8259.o
  CC      hw/intc/pl190.o
  CC      hw/intc/imx_avic.o
  CC      hw/intc/realview_gic.o
  CC      hw/intc/ioapic_common.o
  CC      hw/intc/arm_gic_common.o
  CC      hw/intc/arm_gic.o
  CC      hw/intc/arm_gicv2m.o
  CC      hw/intc/arm_gicv3_common.o
  CC      hw/intc/arm_gicv3.o
  CC      hw/intc/arm_gicv3_dist.o
  CC      hw/intc/arm_gicv3_redist.o
  CC      hw/intc/arm_gicv3_its_common.o
  CC      hw/intc/intc.o
  CC      hw/ipack/ipack.o
  CC      hw/ipack/tpci200.o
  CC      hw/ipmi/ipmi.o
  CC      hw/ipmi/ipmi_bmc_sim.o
  CC      hw/ipmi/ipmi_bmc_extern.o
  CC      hw/ipmi/isa_ipmi_kcs.o
  CC      hw/isa/isa-bus.o
  CC      hw/ipmi/isa_ipmi_bt.o
  CC      hw/isa/apm.o
  CC      hw/mem/pc-dimm.o
  CC      hw/mem/nvdimm.o
  CC      hw/misc/applesmc.o
  CC      hw/misc/max111x.o
  CC      hw/misc/tmp105.o
  CC      hw/misc/debugexit.o
  CC      hw/misc/sga.o
  CC      hw/misc/pc-testdev.o
  CC      hw/misc/pci-testdev.o
  CC      hw/misc/arm_l2x0.o
  CC      hw/misc/arm_integrator_debug.o
  CC      hw/misc/arm11scu.o
  CC      hw/misc/a9scu.o
  CC      hw/net/ne2000.o
  CC      hw/net/eepro100.o
  CC      hw/net/pcnet-pci.o
  CC      hw/net/pcnet.o
  CC      hw/net/e1000.o
  CC      hw/net/e1000x_common.o
  CC      hw/net/net_tx_pkt.o
  CC      hw/net/net_rx_pkt.o
  CC      hw/net/e1000e.o
  CC      hw/net/e1000e_core.o
  CC      hw/net/rtl8139.o
  CC      hw/net/vmxnet3.o
  CC      hw/net/smc91c111.o
  CC      hw/net/lan9118.o
  CC      hw/net/ne2000-isa.o
  CC      hw/net/xgmac.o
  CC      hw/net/allwinner_emac.o
  CC      hw/net/imx_fec.o
  CC      hw/net/cadence_gem.o
  CC      hw/net/stellaris_enet.o
  CC      hw/net/rocker/rocker.o
  CC      hw/net/rocker/rocker_fp.o
  CC      hw/net/rocker/rocker_desc.o
  CC      hw/net/rocker/rocker_world.o
  CC      hw/net/rocker/rocker_of_dpa.o
  CC      hw/nvram/eeprom93xx.o
  CC      hw/nvram/fw_cfg.o
  CC      hw/nvram/chrp_nvram.o
  CC      hw/pci-bridge/pci_bridge_dev.o
  CC      hw/pci-bridge/pci_expander_bridge.o
  CC      hw/pci-bridge/xio3130_upstream.o
  CC      hw/pci-bridge/xio3130_downstream.o
  CC      hw/pci-bridge/ioh3420.o
  CC      hw/pci-bridge/i82801b11.o
  CC      hw/pci-host/pam.o
  CC      hw/pci-host/versatile.o
  CC      hw/pci-host/piix.o
  CC      hw/pci-host/q35.o
  CC      hw/pci-host/gpex.o
  CC      hw/pci/pci.o
  CC      hw/pci/pci_bridge.o
  CC      hw/pci/msix.o
  CC      hw/pci/msi.o
  CC      hw/pci/shpc.o
  CC      hw/pci/slotid_cap.o
  CC      hw/pci/pci_host.o
  CC      hw/pci/pcie_host.o
  CC      hw/pci/pcie.o
  CC      hw/pci/pcie_aer.o
  CC      hw/pci/pcie_port.o
  CC      hw/pci/pci-stub.o
  CC      hw/pcmcia/pcmcia.o
  CC      hw/scsi/scsi-disk.o
  CC      hw/scsi/scsi-generic.o
  CC      hw/scsi/lsi53c895a.o
  CC      hw/scsi/scsi-bus.o
  CC      hw/scsi/mptsas.o
  CC      hw/scsi/mptconfig.o
  CC      hw/scsi/mptendian.o
  CC      hw/scsi/megasas.o
  CC      hw/scsi/vmw_pvscsi.o
  CC      hw/scsi/esp.o
  CC      hw/scsi/esp-pci.o
  CC      hw/sd/pl181.o
  CC      hw/sd/ssi-sd.o
  CC      hw/sd/sd.o
  CC      hw/sd/core.o
  CC      hw/sd/sdhci.o
  CC      hw/smbios/smbios.o
  CC      hw/smbios/smbios_type_38.o
  CC      hw/ssi/pl022.o
  CC      hw/ssi/ssi.o
  CC      hw/ssi/xilinx_spips.o
  CC      hw/ssi/aspeed_smc.o
  CC      hw/ssi/stm32f2xx_spi.o
  CC      hw/timer/arm_mptimer.o
  CC      hw/timer/arm_timer.o
  CC      hw/timer/a9gtimer.o
  CC      hw/timer/cadence_ttc.o
  CC      hw/timer/ds1338.o
  CC      hw/timer/hpet.o
  CC      hw/timer/i8254_common.o
  CC      hw/timer/i8254.o
  CC      hw/timer/pl031.o
  CC      hw/timer/twl92230.o
  CC      hw/timer/imx_epit.o
  CC      hw/timer/imx_gpt.o
  CC      hw/timer/stm32f2xx_timer.o
  CC      hw/timer/aspeed_timer.o
  CC      hw/tpm/tpm_tis.o
  CC      hw/tpm/tpm_passthrough.o
  CC      hw/tpm/tpm_util.o
  CC      hw/usb/core.o
  CC      hw/usb/combined-packet.o
  CC      hw/usb/bus.o
/tmp/qemu-test/src/hw/nvram/fw_cfg.c: In function ‘fw_cfg_dma_transfer’:
/tmp/qemu-test/src/hw/nvram/fw_cfg.c:329: warning: ‘read’ may be used uninitialized in this function
  CC      hw/usb/libhw.o
  CC      hw/usb/desc.o
  CC      hw/usb/desc-msos.o
  CC      hw/usb/hcd-uhci.o
  CC      hw/usb/hcd-ohci.o
  CC      hw/usb/hcd-ehci.o
  CC      hw/usb/hcd-ehci-pci.o
  CC      hw/usb/hcd-ehci-sysbus.o
  CC      hw/usb/hcd-xhci.o
  CC      hw/usb/hcd-musb.o
  CC      hw/usb/dev-hub.o
  CC      hw/usb/dev-hid.o
  CC      hw/usb/dev-wacom.o
  CC      hw/usb/dev-storage.o
  CC      hw/usb/dev-uas.o
  CC      hw/usb/dev-audio.o
  CC      hw/usb/dev-serial.o
  CC      hw/usb/dev-network.o
  CC      hw/usb/dev-bluetooth.o
  CC      hw/usb/dev-smartcard-reader.o
  CC      hw/usb/dev-mtp.o
  CC      hw/usb/host-stub.o
  CC      hw/virtio/virtio-rng.o
  CC      hw/virtio/virtio-pci.o
  CC      hw/virtio/virtio-bus.o
  CC      hw/virtio/virtio-mmio.o
  CC      hw/watchdog/watchdog.o
  CC      hw/watchdog/wdt_i6300esb.o
  CC      hw/watchdog/wdt_ib700.o
  CC      migration/migration.o
  CC      migration/socket.o
  CC      migration/fd.o
  CC      migration/exec.o
  CC      migration/tls.o
  CC      migration/colo-comm.o
  CC      migration/colo.o
  CC      migration/colo-failover.o
  CC      migration/vmstate.o
  CC      migration/qemu-file.o
  CC      migration/qemu-file-channel.o
  CC      migration/xbzrle.o
  CC      migration/postcopy-ram.o
  CC      migration/qjson.o
  CC      migration/block.o
  CC      net/net.o
  CC      net/queue.o
  CC      net/checksum.o
  CC      net/util.o
  CC      net/socket.o
  CC      net/hub.o
  CC      net/dump.o
  CC      net/eth.o
  CC      net/l2tpv3.o
  CC      net/vhost-user.o
  CC      net/tap-linux.o
  CC      net/tap.o
  CC      net/slirp.o
  CC      net/filter.o
  CC      net/filter-buffer.o
  CC      net/filter-mirror.o
  CC      net/colo-compare.o
  CC      net/colo.o
  CC      net/filter-rewriter.o
  CC      replay/replay.o
  CC      qom/cpu.o
  CC      replay/replay-internal.o
  CC      replay/replay-events.o
/tmp/qemu-test/src/replay/replay-internal.c: In function ‘replay_put_array’:
/tmp/qemu-test/src/replay/replay-internal.c:65: warning: ignoring return value of ‘fwrite’, declared with attribute warn_unused_result
  CC      replay/replay-time.o
  CC      replay/replay-input.o
  CC      replay/replay-char.o
  CC      replay/replay-snapshot.o
  CC      slirp/cksum.o
  CC      slirp/if.o
  CC      slirp/ip_icmp.o
  CC      slirp/ip6_icmp.o
  CC      slirp/ip6_input.o
  CC      slirp/ip6_output.o
  CC      slirp/ip_input.o
  CC      slirp/ip_output.o
  CC      slirp/dnssearch.o
  CC      slirp/dhcpv6.o
  CC      slirp/slirp.o
  CC      slirp/mbuf.o
  CC      slirp/misc.o
  CC      slirp/sbuf.o
  CC      slirp/socket.o
  CC      slirp/tcp_input.o
  CC      slirp/tcp_subr.o
  CC      slirp/tcp_output.o
  CC      slirp/tcp_timer.o
  CC      slirp/udp.o
/tmp/qemu-test/src/slirp/tcp_input.c: In function ‘tcp_input’:
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_p’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_len’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_tos’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_id’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_off’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_ttl’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_sum’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_src.s_addr’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_dst.s_addr’ may be used uninitialized in this function
/tmp/qemu-test/src/slirp/tcp_input.c:220: warning: ‘save_ip6.ip_nh’ may be used uninitialized in this function
  CC      slirp/udp6.o
  CC      slirp/bootp.o
  CC      slirp/tftp.o
  CC      slirp/arp_table.o
  CC      slirp/ndp_table.o
  CC      ui/keymaps.o
  CC      ui/console.o
  CC      ui/cursor.o
  CC      ui/qemu-pixman.o
  CC      ui/input.o
  CC      ui/input-keymap.o
  CC      ui/input-legacy.o
  CC      ui/input-linux.o
  CC      ui/sdl.o
  CC      ui/sdl_zoom.o
  CC      ui/x_keymap.o
  CC      ui/vnc.o
  CC      ui/vnc-enc-zlib.o
  CC      ui/vnc-enc-hextile.o
  CC      ui/vnc-enc-tight.o
  CC      ui/vnc-palette.o
  CC      ui/vnc-enc-zrle.o
  CC      ui/vnc-auth-vencrypt.o
  CC      ui/vnc-ws.o
  CC      ui/vnc-jobs.o
  LINK    tests/qemu-iotests/socket_scm_helper
  CC      qga/commands.o
  CC      qga/guest-agent-command-state.o
  CC      qga/main.o
  CC      qga/commands-posix.o
  CC      qga/channel-posix.o
  CC      qga/qapi-generated/qga-qapi-types.o
  CC      qga/qapi-generated/qga-qapi-visit.o
  CC      qga/qapi-generated/qga-qmp-marshal.o
  CC      qmp-introspect.o
  CC      qapi-types.o
  CC      qapi-visit.o
  CC      qapi-event.o
  AS      optionrom/multiboot.o
  AS      optionrom/linuxboot.o
  CC      optionrom/linuxboot_dma.o
  AS      optionrom/kvmvapic.o
cc: unrecognized option '-no-integrated-as'
cc: unrecognized option '-no-integrated-as'
  BUILD   optionrom/multiboot.img
  AR      libqemustub.a
  CC      qemu-img.o
  BUILD   optionrom/linuxboot.img
  CC      qmp-marshal.o
  BUILD   optionrom/linuxboot_dma.img
  CC      trace/generated-tracers.o
  BUILD   optionrom/multiboot.raw
  BUILD   optionrom/linuxboot_dma.raw
  BUILD   optionrom/linuxboot.raw
  BUILD   optionrom/kvmvapic.img
  SIGN    optionrom/multiboot.bin
  SIGN    optionrom/linuxboot.bin
  SIGN    optionrom/linuxboot_dma.bin
  BUILD   optionrom/kvmvapic.raw
  SIGN    optionrom/kvmvapic.bin
  AR      libqemuutil.a
  LINK    qemu-ga
  LINK    ivshmem-client
  LINK    ivshmem-server
  LINK    qemu-nbd
  LINK    qemu-img
  LINK    qemu-io
  LINK    qemu-bridge-helper
  GEN     x86_64-softmmu/hmp-commands.h
  GEN     x86_64-softmmu/hmp-commands-info.h
  GEN     x86_64-softmmu/config-target.h
  GEN     aarch64-softmmu/config-target.h
  GEN     aarch64-softmmu/hmp-commands.h
  GEN     aarch64-softmmu/hmp-commands-info.h
  CC      x86_64-softmmu/cpu-exec.o
  CC      x86_64-softmmu/exec.o
  CC      x86_64-softmmu/translate-all.o
  CC      x86_64-softmmu/translate-common.o
  CC      x86_64-softmmu/cpu-exec-common.o
  CC      x86_64-softmmu/tcg/tcg.o
  CC      x86_64-softmmu/tcg/tcg-op.o
  CC      x86_64-softmmu/tcg/optimize.o
  CC      x86_64-softmmu/tcg/tcg-common.o
  CC      x86_64-softmmu/fpu/softfloat.o
  CC      x86_64-softmmu/disas.o
  CC      x86_64-softmmu/tcg-runtime.o
  CC      x86_64-softmmu/arch_init.o
  CC      x86_64-softmmu/cpus.o
  CC      x86_64-softmmu/monitor.o
  CC      x86_64-softmmu/gdbstub.o
  CC      x86_64-softmmu/balloon.o
  CC      x86_64-softmmu/ioport.o
  CC      x86_64-softmmu/numa.o
  CC      x86_64-softmmu/qtest.o
  CC      x86_64-softmmu/bootdevice.o
  CC      aarch64-softmmu/translate-all.o
  CC      aarch64-softmmu/exec.o
  CC      x86_64-softmmu/kvm-all.o
  CC      x86_64-softmmu/memory.o
  CC      x86_64-softmmu/cputlb.o
  CC      aarch64-softmmu/cpu-exec.o
  CC      aarch64-softmmu/translate-common.o
  CC      x86_64-softmmu/memory_mapping.o
  CC      x86_64-softmmu/dump.o
  CC      x86_64-softmmu/migration/ram.o
  CC      aarch64-softmmu/cpu-exec-common.o
  CC      aarch64-softmmu/tcg/tcg.o
  CC      aarch64-softmmu/tcg/tcg-op.o
  CC      aarch64-softmmu/tcg/optimize.o
  CC      x86_64-softmmu/migration/savevm.o
  CC      aarch64-softmmu/tcg/tcg-common.o
  CC      aarch64-softmmu/fpu/softfloat.o
  CC      x86_64-softmmu/xen-common-stub.o
  CC      aarch64-softmmu/disas.o
  CC      aarch64-softmmu/tcg-runtime.o
  CC      x86_64-softmmu/xen-hvm-stub.o
  GEN     aarch64-softmmu/gdbstub-xml.c
  CC      aarch64-softmmu/kvm-stub.o
  CC      x86_64-softmmu/hw/block/virtio-blk.o
  CC      aarch64-softmmu/arch_init.o
  CC      aarch64-softmmu/cpus.o
  CC      aarch64-softmmu/monitor.o
  CC      aarch64-softmmu/gdbstub.o
  CC      aarch64-softmmu/balloon.o
  CC      aarch64-softmmu/ioport.o
  CC      x86_64-softmmu/hw/block/dataplane/virtio-blk.o
  CC      aarch64-softmmu/numa.o
  CC      aarch64-softmmu/qtest.o
  CC      aarch64-softmmu/bootdevice.o
  CC      aarch64-softmmu/memory.o
  CC      aarch64-softmmu/cputlb.o
  CC      x86_64-softmmu/hw/char/virtio-serial-bus.o
  CC      x86_64-softmmu/hw/core/nmi.o
  CC      x86_64-softmmu/hw/core/generic-loader.o
  CC      x86_64-softmmu/hw/cpu/core.o
  CC      x86_64-softmmu/hw/display/vga.o
  CC      aarch64-softmmu/memory_mapping.o
  CC      aarch64-softmmu/dump.o
  CC      x86_64-softmmu/hw/display/virtio-gpu.o
  CC      aarch64-softmmu/migration/ram.o
  CC      aarch64-softmmu/migration/savevm.o
  CC      aarch64-softmmu/xen-common-stub.o
  CC      x86_64-softmmu/hw/display/virtio-gpu-3d.o
  CC      x86_64-softmmu/hw/display/virtio-gpu-pci.o
  CC      aarch64-softmmu/xen-hvm-stub.o
  CC      x86_64-softmmu/hw/display/virtio-vga.o
  CC      aarch64-softmmu/hw/adc/stm32f2xx_adc.o
  CC      aarch64-softmmu/hw/block/virtio-blk.o
  CC      aarch64-softmmu/hw/block/dataplane/virtio-blk.o
  CC      x86_64-softmmu/hw/intc/apic.o
  CC      aarch64-softmmu/hw/char/exynos4210_uart.o
  CC      aarch64-softmmu/hw/char/omap_uart.o
  CC      aarch64-softmmu/hw/char/digic-uart.o
  CC      aarch64-softmmu/hw/char/stm32f2xx_usart.o
  CC      aarch64-softmmu/hw/char/bcm2835_aux.o
  CC      aarch64-softmmu/hw/char/virtio-serial-bus.o
  CC      x86_64-softmmu/hw/intc/apic_common.o
  CC      x86_64-softmmu/hw/intc/ioapic.o
  CC      x86_64-softmmu/hw/isa/lpc_ich9.o
  CC      aarch64-softmmu/hw/core/nmi.o
  CC      aarch64-softmmu/hw/core/generic-loader.o
  CC      aarch64-softmmu/hw/cpu/arm11mpcore.o
  CC      aarch64-softmmu/hw/cpu/realview_mpcore.o
  CC      x86_64-softmmu/hw/misc/vmport.o
  CC      aarch64-softmmu/hw/cpu/a9mpcore.o
  CC      x86_64-softmmu/hw/misc/ivshmem.o
  CC      aarch64-softmmu/hw/cpu/a15mpcore.o
  CC      x86_64-softmmu/hw/misc/pvpanic.o
  CC      x86_64-softmmu/hw/misc/edu.o
  CC      x86_64-softmmu/hw/misc/hyperv_testdev.o
  CC      aarch64-softmmu/hw/cpu/core.o
  CC      x86_64-softmmu/hw/net/virtio-net.o
  CC      x86_64-softmmu/hw/net/vhost_net.o
  CC      aarch64-softmmu/hw/display/omap_dss.o
  CC      x86_64-softmmu/hw/scsi/virtio-scsi.o
  CC      x86_64-softmmu/hw/scsi/virtio-scsi-dataplane.o
  CC      x86_64-softmmu/hw/scsi/vhost-scsi.o
  CC      x86_64-softmmu/hw/timer/mc146818rtc.o
  CC      aarch64-softmmu/hw/display/omap_lcdc.o
  CC      x86_64-softmmu/hw/vfio/common.o
  CC      x86_64-softmmu/hw/vfio/pci.o
  CC      x86_64-softmmu/hw/vfio/pci-quirks.o
  CC      aarch64-softmmu/hw/display/pxa2xx_lcd.o
  CC      x86_64-softmmu/hw/vfio/platform.o
  CC      aarch64-softmmu/hw/display/bcm2835_fb.o
  CC      aarch64-softmmu/hw/display/vga.o
  CC      aarch64-softmmu/hw/display/virtio-gpu.o
  CC      aarch64-softmmu/hw/display/virtio-gpu-3d.o
  CC      aarch64-softmmu/hw/display/virtio-gpu-pci.o
  CC      aarch64-softmmu/hw/display/dpcd.o
  CC      aarch64-softmmu/hw/display/xlnx_dp.o
  CC      aarch64-softmmu/hw/dma/xlnx_dpdma.o
  CC      aarch64-softmmu/hw/dma/omap_dma.o
  CC      aarch64-softmmu/hw/dma/soc_dma.o
  CC      aarch64-softmmu/hw/dma/pxa2xx_dma.o
  CC      aarch64-softmmu/hw/dma/bcm2835_dma.o
  CC      x86_64-softmmu/hw/vfio/calxeda-xgmac.o
  CC      x86_64-softmmu/hw/vfio/amd-xgbe.o
  CC      aarch64-softmmu/hw/gpio/omap_gpio.o
  CC      aarch64-softmmu/hw/gpio/imx_gpio.o
  CC      aarch64-softmmu/hw/i2c/omap_i2c.o
  CC      aarch64-softmmu/hw/input/pxa2xx_keypad.o
  CC      x86_64-softmmu/hw/vfio/spapr.o
  CC      x86_64-softmmu/hw/virtio/virtio.o
  CC      x86_64-softmmu/hw/virtio/virtio-balloon.o
  CC      x86_64-softmmu/hw/virtio/vhost.o
  CC      aarch64-softmmu/hw/input/tsc210x.o
  CC      x86_64-softmmu/hw/virtio/vhost-backend.o
  CC      aarch64-softmmu/hw/intc/armv7m_nvic.o
  CC      x86_64-softmmu/hw/virtio/vhost-user.o
  CC      aarch64-softmmu/hw/intc/exynos4210_gic.o
  CC      aarch64-softmmu/hw/intc/exynos4210_combiner.o
  CC      x86_64-softmmu/hw/virtio/vhost-vsock.o
  CC      aarch64-softmmu/hw/intc/omap_intc.o
  CC      x86_64-softmmu/hw/virtio/virtio-crypto.o
  CC      aarch64-softmmu/hw/intc/bcm2835_ic.o
  CC      aarch64-softmmu/hw/intc/bcm2836_control.o
  CC      x86_64-softmmu/hw/virtio/virtio-crypto-pci.o
  CC      aarch64-softmmu/hw/intc/allwinner-a10-pic.o
  CC      x86_64-softmmu/hw/i386/multiboot.o
  CC      x86_64-softmmu/hw/i386/pc_piix.o
  CC      x86_64-softmmu/hw/i386/pc.o
  CC      aarch64-softmmu/hw/intc/aspeed_vic.o
  CC      x86_64-softmmu/hw/i386/pc_q35.o
  CC      x86_64-softmmu/hw/i386/pc_sysfw.o
/tmp/qemu-test/src/hw/i386/pc_piix.c: In function ‘igd_passthrough_isa_bridge_create’:
/tmp/qemu-test/src/hw/i386/pc_piix.c:1046: warning: ‘pch_rev_id’ may be used uninitialized in this function
  CC      aarch64-softmmu/hw/intc/arm_gicv3_cpuif.o
  CC      aarch64-softmmu/hw/misc/ivshmem.o
  CC      aarch64-softmmu/hw/misc/arm_sysctl.o
  CC      x86_64-softmmu/hw/i386/x86-iommu.o
  CC      aarch64-softmmu/hw/misc/cbus.o
  CC      x86_64-softmmu/hw/i386/intel_iommu.o
  CC      x86_64-softmmu/hw/i386/amd_iommu.o
  CC      aarch64-softmmu/hw/misc/exynos4210_pmu.o
  CC      x86_64-softmmu/hw/i386/kvmvapic.o
  CC      aarch64-softmmu/hw/misc/imx_ccm.o
  CC      aarch64-softmmu/hw/misc/imx31_ccm.o
  CC      aarch64-softmmu/hw/misc/imx25_ccm.o
  CC      x86_64-softmmu/hw/i386/acpi-build.o
  CC      x86_64-softmmu/hw/i386/pci-assign-load-rom.o
  CC      x86_64-softmmu/hw/i386/kvm/clock.o
  CC      aarch64-softmmu/hw/misc/imx6_ccm.o
  CC      x86_64-softmmu/hw/i386/kvm/apic.o
  CC      aarch64-softmmu/hw/misc/imx6_src.o
  CC      aarch64-softmmu/hw/misc/mst_fpga.o
  CC      x86_64-softmmu/hw/i386/kvm/i8259.o
  CC      aarch64-softmmu/hw/misc/omap_gpmc.o
  CC      aarch64-softmmu/hw/misc/omap_clk.o
  CC      x86_64-softmmu/hw/i386/kvm/ioapic.o
  CC      x86_64-softmmu/hw/i386/kvm/i8254.o
  CC      aarch64-softmmu/hw/misc/omap_l4.o
  CC      aarch64-softmmu/hw/misc/omap_sdrc.o
  CC      aarch64-softmmu/hw/misc/omap_tap.o
  CC      x86_64-softmmu/hw/i386/kvm/pci-assign.o
  CC      x86_64-softmmu/target-i386/translate.o
  CC      aarch64-softmmu/hw/misc/bcm2835_mbox.o
  CC      x86_64-softmmu/target-i386/helper.o
  CC      x86_64-softmmu/target-i386/cpu.o
  CC      aarch64-softmmu/hw/misc/bcm2835_property.o
  CC      x86_64-softmmu/target-i386/bpt_helper.o
  CC      aarch64-softmmu/hw/misc/zynq_slcr.o
  CC      aarch64-softmmu/hw/misc/zynq-xadc.o
  CC      x86_64-softmmu/target-i386/excp_helper.o
  CC      x86_64-softmmu/target-i386/fpu_helper.o
  CC      x86_64-softmmu/target-i386/cc_helper.o
  CC      x86_64-softmmu/target-i386/int_helper.o
  CC      x86_64-softmmu/target-i386/svm_helper.o
  CC      x86_64-softmmu/target-i386/smm_helper.o
  CC      aarch64-softmmu/hw/misc/stm32f2xx_syscfg.o
  CC      x86_64-softmmu/target-i386/misc_helper.o
  CC      x86_64-softmmu/target-i386/mem_helper.o
  CC      aarch64-softmmu/hw/misc/edu.o
  CC      x86_64-softmmu/target-i386/seg_helper.o
  CC      x86_64-softmmu/target-i386/mpx_helper.o
  CC      aarch64-softmmu/hw/misc/auxbus.o
  CC      x86_64-softmmu/target-i386/gdbstub.o
  CC      aarch64-softmmu/hw/misc/aspeed_scu.o
  CC      x86_64-softmmu/target-i386/machine.o
/tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_switch_address_space’:
/tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: implicit declaration of function ‘trace_vtd_switch_address_space’
/tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: nested extern declaration of ‘trace_vtd_switch_address_space’
/tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_find_add_as’:
/tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: ‘name’ undeclared (first use in this function)
/tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: (Each undeclared identifier is reported only once
/tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: for each function it appears in.)
make[1]: *** [hw/i386/intel_iommu.o] Error 1
make[1]: *** Waiting for unfinished jobs....
  CC      x86_64-softmmu/target-i386/arch_memory_mapping.o
  CC      aarch64-softmmu/hw/misc/aspeed_sdmc.o
  CC      aarch64-softmmu/hw/net/virtio-net.o
  CC      aarch64-softmmu/hw/net/vhost_net.o
  CC      aarch64-softmmu/hw/pcmcia/pxa2xx.o
  CC      aarch64-softmmu/hw/scsi/virtio-scsi.o
  CC      aarch64-softmmu/hw/scsi/virtio-scsi-dataplane.o
  CC      aarch64-softmmu/hw/sd/omap_mmc.o
  CC      aarch64-softmmu/hw/scsi/vhost-scsi.o
  CC      aarch64-softmmu/hw/sd/pxa2xx_mmci.o
  CC      aarch64-softmmu/hw/ssi/omap_spi.o
  CC      aarch64-softmmu/hw/ssi/imx_spi.o
  CC      aarch64-softmmu/hw/timer/exynos4210_mct.o
  CC      aarch64-softmmu/hw/timer/exynos4210_pwm.o
  CC      aarch64-softmmu/hw/timer/exynos4210_rtc.o
  CC      aarch64-softmmu/hw/timer/omap_gptimer.o
  CC      aarch64-softmmu/hw/timer/omap_synctimer.o
  CC      aarch64-softmmu/hw/timer/pxa2xx_timer.o
  CC      aarch64-softmmu/hw/timer/digic-timer.o
  CC      aarch64-softmmu/hw/timer/allwinner-a10-pit.o
  CC      aarch64-softmmu/hw/usb/tusb6010.o
  CC      aarch64-softmmu/hw/vfio/common.o
  CC      aarch64-softmmu/hw/vfio/pci.o
  CC      aarch64-softmmu/hw/vfio/pci-quirks.o
  CC      aarch64-softmmu/hw/vfio/platform.o
  CC      aarch64-softmmu/hw/vfio/calxeda-xgmac.o
  CC      aarch64-softmmu/hw/vfio/amd-xgbe.o
  CC      aarch64-softmmu/hw/vfio/spapr.o
  CC      aarch64-softmmu/hw/virtio/virtio.o
  CC      aarch64-softmmu/hw/virtio/virtio-balloon.o
  CC      aarch64-softmmu/hw/virtio/vhost.o
  CC      aarch64-softmmu/hw/virtio/vhost-backend.o
  CC      aarch64-softmmu/hw/virtio/vhost-user.o
  CC      aarch64-softmmu/hw/virtio/vhost-vsock.o
  CC      aarch64-softmmu/hw/virtio/virtio-crypto.o
  CC      aarch64-softmmu/hw/virtio/virtio-crypto-pci.o
  CC      aarch64-softmmu/hw/arm/boot.o
  CC      aarch64-softmmu/hw/arm/collie.o
  CC      aarch64-softmmu/hw/arm/exynos4_boards.o
  CC      aarch64-softmmu/hw/arm/gumstix.o
  CC      aarch64-softmmu/hw/arm/highbank.o
  CC      aarch64-softmmu/hw/arm/digic_boards.o
  CC      aarch64-softmmu/hw/arm/integratorcp.o
  CC      aarch64-softmmu/hw/arm/mainstone.o
  CC      aarch64-softmmu/hw/arm/musicpal.o
  CC      aarch64-softmmu/hw/arm/nseries.o
  CC      aarch64-softmmu/hw/arm/omap_sx1.o
  CC      aarch64-softmmu/hw/arm/palm.o
  CC      aarch64-softmmu/hw/arm/realview.o
  CC      aarch64-softmmu/hw/arm/spitz.o
  CC      aarch64-softmmu/hw/arm/stellaris.o
  CC      aarch64-softmmu/hw/arm/tosa.o
  CC      aarch64-softmmu/hw/arm/versatilepb.o
  CC      aarch64-softmmu/hw/arm/vexpress.o
  CC      aarch64-softmmu/hw/arm/virt.o
/tmp/qemu-test/src/hw/i386/acpi-build.c: In function ‘build_append_pci_bus_devices’:
/tmp/qemu-test/src/hw/i386/acpi-build.c:501: warning: ‘notify_method’ may be used uninitialized in this function
  CC      aarch64-softmmu/hw/arm/xilinx_zynq.o
  CC      aarch64-softmmu/hw/arm/z2.o
  CC      aarch64-softmmu/hw/arm/virt-acpi-build.o
  CC      aarch64-softmmu/hw/arm/netduino2.o
  CC      aarch64-softmmu/hw/arm/sysbus-fdt.o
  CC      aarch64-softmmu/hw/arm/exynos4210.o
  CC      aarch64-softmmu/hw/arm/armv7m.o
  CC      aarch64-softmmu/hw/arm/pxa2xx.o
  CC      aarch64-softmmu/hw/arm/pxa2xx_gpio.o
  CC      aarch64-softmmu/hw/arm/pxa2xx_pic.o
  CC      aarch64-softmmu/hw/arm/digic.o
  CC      aarch64-softmmu/hw/arm/omap1.o
  CC      aarch64-softmmu/hw/arm/omap2.o
  CC      aarch64-softmmu/hw/arm/strongarm.o
  CC      aarch64-softmmu/hw/arm/allwinner-a10.o
  CC      aarch64-softmmu/hw/arm/cubieboard.o
  CC      aarch64-softmmu/hw/arm/bcm2835_peripherals.o
  CC      aarch64-softmmu/hw/arm/bcm2836.o
  CC      aarch64-softmmu/hw/arm/raspi.o
  CC      aarch64-softmmu/hw/arm/stm32f205_soc.o
  CC      aarch64-softmmu/hw/arm/xlnx-zynqmp.o
  CC      aarch64-softmmu/hw/arm/xlnx-ep108.o
  CC      aarch64-softmmu/hw/arm/fsl-imx25.o
  CC      aarch64-softmmu/hw/arm/imx25_pdk.o
  CC      aarch64-softmmu/hw/arm/fsl-imx31.o
  CC      aarch64-softmmu/hw/arm/kzm.o
  CC      aarch64-softmmu/hw/arm/fsl-imx6.o
  CC      aarch64-softmmu/hw/arm/sabrelite.o
  CC      aarch64-softmmu/hw/arm/aspeed_soc.o
  CC      aarch64-softmmu/hw/arm/aspeed.o
  CC      aarch64-softmmu/target-arm/arm-semi.o
  CC      aarch64-softmmu/target-arm/machine.o
  CC      aarch64-softmmu/target-arm/psci.o
  CC      aarch64-softmmu/target-arm/arch_dump.o
  CC      aarch64-softmmu/target-arm/monitor.o
  CC      aarch64-softmmu/target-arm/kvm-stub.o
  CC      aarch64-softmmu/target-arm/translate.o
  CC      aarch64-softmmu/target-arm/helper.o
  CC      aarch64-softmmu/target-arm/op_helper.o
  CC      aarch64-softmmu/target-arm/cpu.o
  CC      aarch64-softmmu/target-arm/neon_helper.o
  CC      aarch64-softmmu/target-arm/iwmmxt_helper.o
  CC      aarch64-softmmu/target-arm/gdbstub.o
  CC      aarch64-softmmu/target-arm/cpu64.o
  CC      aarch64-softmmu/target-arm/translate-a64.o
  CC      aarch64-softmmu/target-arm/helper-a64.o
  CC      aarch64-softmmu/target-arm/gdbstub64.o
  CC      aarch64-softmmu/target-arm/crypto_helper.o
  CC      aarch64-softmmu/target-arm/arm-powerctl.o
  GEN     trace/generated-helpers.c
  CC      aarch64-softmmu/trace/control-target.o
  CC      aarch64-softmmu/gdbstub-xml.o
  CC      aarch64-softmmu/trace/generated-helpers.o
make: *** [subdir-x86_64-softmmu] Error 2
make: *** Waiting for unfinished jobs....
/tmp/qemu-test/src/target-arm/translate-a64.c: In function ‘handle_shri_with_rndacc’:
/tmp/qemu-test/src/target-arm/translate-a64.c:6391: warning: ‘tcg_src_hi’ may be used uninitialized in this function
/tmp/qemu-test/src/target-arm/translate-a64.c: In function ‘disas_simd_scalar_two_reg_misc’:
/tmp/qemu-test/src/target-arm/translate-a64.c:8118: warning: ‘rmode’ may be used uninitialized in this function
  LINK    aarch64-softmmu/qemu-system-aarch64
make[1]: *** [docker-run] Error 2
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-0ljnwqpe/src'
make: *** [docker-run-test-quick@centos6] Error 2
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20  6:38       ` Peter Xu
@ 2016-12-21  0:04         ` Alex Williamson
  2016-12-21  3:19           ` Peter Xu
  2016-12-21  3:49           ` David Gibson
  0 siblings, 2 replies; 18+ messages in thread
From: Alex Williamson @ 2016-12-21  0:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Tue, 20 Dec 2016 14:38:01 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:
> 
> [...]
> 
> > > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > > it once and for all. I think we can do better than this in the future,
> > > for example, one address space per guest IOMMU domain (as you have
> > > mentioned before). However I suppose that will need more work (which I
> > > still can't estimate on the amount of work). So I am considering to
> > > enable the device assignments functionally first, then we can further
> > > improve based on a workable version. Same thoughts apply to the IOMMU
> > > replay RFC series.  
> > 
> > I'm not arguing against it, I'm just trying to set expectations for
> > where this gets us.  An AddressSpace per guest iommu domain seems like
> > the right model for QEMU, but it has some fundamental issues with
> > vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> > represents the host IOMMU context.  The AddressSpace of a device is
> > currently assumed to be fixed in QEMU, guest IOMMU domains clearly
> > are not.  vfio only let's us have access to a device while it's
> > protected within a container.  Therefore in order to move a device to a
> > different AddressSpace based on the guest domain configuration, we'd
> > need to tear down the vfio configuration, including releasing the
> > device.  
> 
> I assume this is VT-d specific issue, right? Looks like ppc is using a
> totally differnet way to manage the mapping, and devices can share the
> same address space.

It's only VT-d specific in that VT-d is the only vIOMMU we have for
x86.  ppc has a much different host IOMMU architcture and their VM
architecture requires an IOMMU.  The ppc model has a notion of
preregistration to help with this, among other things.

> > > Regarding to the locked memory accounting issue: do we have existing
> > > way to do the accounting? If so, would you (or anyone) please
> > > elaborate a bit? If not, is that an ongoing/planned work?  
> > 
> > As I describe above, there's a vfio container per AddressSpace, each
> > container is an IOMMU domain in the host.  In the guest, an IOMMU
> > domain can include multiple AddressSpaces, one for each context entry
> > that's part of the domain.  When the guest programs a translation for
> > an IOMMU domain, that maps a guest IOVA to a guest physical address,
> > for each AddressSpace.  Each AddressSpace is backed by a vfio
> > container, which needs to pin the pages of that translation in order to
> > get a host physical address, which then gets programmed into the host
> > IOMMU domain with the guest-IOVA and host physical address.  The
> > pinning process is where page accounting is done.  It's done per vfio
> > context.  The worst case scenario for accounting is thus when VT-d is
> > present but disabled (or in passthrough mode) as each AddressSpace
> > duplicates address_space_memory and every page of guest memory is
> > pinned and accounted for each vfio container.  
> 
> IIUC this accounting issue will solve itself if we can solve the
> previous issue. While we don't have it now, so ...

Not sure what "previous issue" is referring to here.

> > That's the existing way we do accounting.  There is no current
> > development that I'm aware of to change this.  As above, the simplest
> > stop-gap solution is that libvirt would need to be aware when VT-d is
> > present for a VM and use a different algorithm to set QEMU locked
> > memory limit, but it's not without its downsides.  
> 
> ... here I think it's sensible to consider a specific algorithm for
> vt-d use case. I am just curious about how should we define this
> algorithm.
> 
> First of all, when the devices are not sharing domain (or say, one
> guest iommu domain per assigned device), everything should be fine.

No, each domain could map the entire guest address space.  If we're
talking about a domain per device for use with the Linux DMA API, then
it's unlikely that the sum of mapped pages across all the domains will
exceed the current libvirt set locked memory limit.  However, that's
exactly the configuration where we expect to have abysmal performance.
As soon as we recommend the guest boot with iommu=pt, then each
container will be mapping and pinning the entire VM address space.

> No
> special algorithm needed. IMHO the problem will happen only if there
> are assigned devices that share a same address space (either system,
> or specific iommu domain). In that case, the accounted value (or say,
> current->mm->locked_vm iiuc) will be bigger than the real locked
> memory size.
> 
> However, I think the problem is whether devices will be put into same
> address space depends on guest behavior - the guest can either use
> iommu=pt, or manually putting devices into the same guest iommu region
> to achieve that. But from hypervisor POV, how should we estimate this?
> Can we really?

The simple answer is that each device needs to be able to map the
entire VM address space and therefore when a VM is configured with
VT-d, libvirt needs to multiply the current locked memory settings for
assigned devices by the number of devices (groups actually) assigned.
There are (at least) two problems with this though.  The first is that
we expect QEMU to use this increased locked memory limits for duplicate
accounting of the same pages, but an exploited user process could take
advantage of it and cause problems.  Not optimal.  The second problem
relates to the usage of the IOVA address space and the assumption that
a given container will map no more than the VM address space.  When no
vIOMMU is exposed to the VM, QEMU manages the container IOVA space and
we know that QEMU is only mapping VM RAM and therefore mappings are
bound by the size of the VM.  With a vIOMMU, the guest is in control of
the IOVA space and can map up to the limits of the vIOMMU.  The guest
can map a single 4KB page to every IOVA up to that limit and we'll
account that page each time.  So even valid (though perhaps not useful)
cases within the guest can hit that locking limit.

This suggests that we not only need a vfio IOMMU model that tracks pfns
per domain to avoid duplicate accounting, but we need some way to share
that tracking between domains.  Then we can go back to allowing a
locked memory limit up to the VM RAM size as the correct and complete
solution (plus some sort of shadow page table based mapping for any
hope of bearable performance for dynamic usage).
 
> > Alternatively, a new
> > IOMMU model would need to be developed for vfio.  The type1 model was
> > only ever intended to be used for relatively static user mappings and I
> > expect it to have horrendous performance when backing a dynamic guest
> > IOMMU domain.  Really the only guest IOMMU usage model that makes any
> > sort of sense with type1 is to run the guest with passthrough (iommu=pt)
> > and only pull devices out of passthrough for relatively static mapping
> > cases within the guest userspace (nested assigned devices or dpdk).  If
> > the expectation is that we just need this one little bit more code to
> > make vfio usable in the guest, that may be true, but it really is just
> > barely usable.  It's not going to be fast for any sort of dynamic
> > mapping and it's going to have accounting issues that are not
> > compatible with how libvirt sets locked memory limits for QEMU as soon
> > as you go beyond a single device.  Thanks,  
> 
> I can totally understand that the performance will suck if dynamic
> mapping is used. AFAIU this work will only be used with static dma
> mapping like running DPDK in guest (besides other trivial goals, like,
> development purpose).

We can't control how a feature is used, which is why I'm trying to make
sure this doesn't come as a surprise to anyone.
 
> Regarding to "the other" iommu model you mentioned besides type1, is
> there any existing discussions out there? Any further learning
> material/links would be greatly welcomed.

Nope.  You and Aviv are the only ones doing work that suggests a need
for a new IOMMU model.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20  4:16   ` Peter Xu
@ 2016-12-21  2:53     ` David Gibson
  2016-12-21 10:05       ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2016-12-21  2:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 9999 bytes --]

On Tue, Dec 20, 2016 at 12:16:50PM +0800, Peter Xu wrote:
> On Tue, Dec 20, 2016 at 10:30:12AM +1100, David Gibson wrote:
> 
> [...]
> 
> > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled)
> > > +{
> > > +    GHashTableIter iter;
> > > +    VTDBus *vtd_bus;
> > > +    VTDAddressSpace *as;
> > > +    int i;
> > > +
> > > +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > > +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> > > +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> > > +            as = vtd_bus->dev_as[i];
> > > +            if (as == NULL) {
> > > +                continue;
> > > +            }
> > > +            trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus),
> > > +                                           VTD_PCI_SLOT(i), VTD_PCI_FUNC(i),
> > > +                                           enabled);
> > > +            if (enabled) {
> > > +                memory_region_add_subregion_overlap(&as->root, 0,
> > > +                                                    &as->iommu, 2);
> > > +            } else {
> > > +                memory_region_del_subregion(&as->root, &as->iommu);
> > 
> > Why not use memory_region_set_enabled() rather than actually
> > adding/deleting the subregion?
> 
> Good idea, thanks. :-)
> 
> [...]
> 
> > > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > >          vtd_dev_as->devfn = (uint8_t)devfn;
> > >          vtd_dev_as->iommu_state = s;
> > >          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> > > +
> > > +        /*
> > > +         * When DMAR is disabled, memory region relationships looks
> > > +         * like:
> > > +         *
> > > +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> > > +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> > > +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> > > +         *
> > > +         * When DMAR is disabled, it becomes:
> > > +         *
> > > +         * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root
> > > +         *  0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu
> > > +         *  0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias
> > > +         *  00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir
> > > +         *
> > > +         * The intel_iommu region is dynamically added/removed.
> > > +         */
> > >          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
> > >                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);
> > 
> > I'm almost certain UINT64_MAX is wrong here.  For one thing it would
> > collide with PCI BARs.  For another, I can't imagine that the IOMMU
> > page tables can really span an entire 2^64 space.
> 
> Could you explain why here device address space has things to do with
> PCI BARs? I thought BARs are for CPU address space only (so that CPU
> can access PCI registers via MMIO manner), am I wrong?

In short, yes.  So, first think about vanilla PCI - most things are
PCI-E these days, but the PCI addressing model which was designed for
the old hardware is still mostly the same.

With plain PCI, you have a physical bus over which address and data
cycles pass.  Those cycles don't distinguish between transfers from
host to device or device to host.  Each address cycle just gives which
address space: configuration, IO or memory, and an address.

Devices respond to addresses within their BARs, typically such cycles
will come from the host, but they don't have to - a device is able to
send cycles to another device (peer to peer DMA).  Meanwhile the host
bridge will respond to addresses within certain DMA windows,
propagating those access onwards to system memory.  How many DMA
windows there are, their size, location and whether they're mapped
directly or via an IOMMU depends on the model of host bridge.

On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped
directly to memory addresses 0..<somewhere>, identity mapping RAM into
PCI space.  BARs would be assigned above <somewhere>, so they don't
collide.  I suspect old enough machines will have <somewhere> == 2G,
leaving 2G..4G for the BARs of 32-bit devices.  More modern x86
bridges must have provisions for accessing memory above 4G, but I'm
not entirely certain how that works.

PAPR traditionally also had a DMA window from 0..2G, however instead
of being direct mapped to RAM, it is always translated via an IOMMU.
More modern PAPR systems have that window by default, but allow the
OS to remove it and configure up to 2 DMA windows of variable length
and page size.  Various other platforms have various other DMA window
arrangements.

With PCI-E, of course, upstream and downstream cycles are distinct,
and peer to peer DMA isn't usually possible (unless a switch is
configured specially to allow it by forwarding cycles from one
downstream port to another).  But the address mndel remains logically
the same: there is just one PCI memory space and both device BARs and
host DMA windows live within it.  Firmware and/or the OS need to know
the details of the platform's host bridge, and configure both the BARs
and the DMA windows so that they don't collide.

> I think we should have a big enough IOMMU region size here. If device
> writes to invalid addresses, IMHO we should trap it and report to
> guest. If we have a smaller size than UINT64_MAX, how we can trap this
> behavior and report for the whole address space (it should cover [0,
> 2^64-1])?

That's not how the IOMMU works.  How it traps is dependent on the
specific IOMMU model, but generally they'll only even look at cycles
which lie within the IOMMU's DMA window.  On x86 I'm pretty sure that
window will be large, but it won't be 2^64.  It's also likely to have
a gap between 2..4GiB to allow room for the BARs of 32-bit devices.

> > 
> > > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > > +                                 "vtd_sys_alias", get_system_memory(),
> > > +                                 0, memory_region_size(get_system_memory()));
> > 
> > I strongly suspect using memory_region_size(get_system_memory()) is
> > also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> > you you can't actually access all of that via PCI space (again, it
> > would collide with actual PCI BARs).  I also suspect you can't reach
> > CPU MMIO regions via the PCI DMA space.
> 
> Hmm, sounds correct.
> 
> However if so we will have the same problem if without IOMMU? See
> pci_device_iommu_address_space() - address_space_memory will be the
> default if we have no IOMMU protection, and that will cover e.g. CPU
> MMIO regions as well.

True.  That default is basically assuming that both the host bridge's
DMA windows, and its outbound IO and memory windows are identity
mapped between the system bus and the PCI address space.  I suspect
that's rarely 100% true, but it's close enough to work on a fair few
platforms.

But since you're building a more accurate model of the x86 host
bridge's behaviour here, you might as well try to get it as accurate
as possible.

> > So, I think you should find out what this limit actually is and
> > restrict the alias to that window.
> 
> /me needs some more reading to figure this out. Still not quite
> familiar with the whole VM memory regions. Hints are welcomed...

It's not really a question of VM memory regions.  Rather it's about
how the host bridge translates between cpu side and PCI side
addresses.

> > >          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
> > >                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
> > >                                VTD_INTERRUPT_ADDR_SIZE);
> > > -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> > > -                                    &vtd_dev_as->iommu_ir);
> > > -        address_space_init(&vtd_dev_as->as,
> > > -                           &vtd_dev_as->iommu, "intel_iommu");
> > > +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> > > +                           "vtd_root", UINT64_MAX);
> > > +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> > > +                                            VTD_INTERRUPT_ADDR_FIRST,
> > > +                                            &vtd_dev_as->iommu_ir, 64);
> > > +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> > > +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> > > +                                            &vtd_dev_as->sys_alias, 1);
> > > +        if (s->dmar_enabled) {
> > > +            memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> > > +                                                &vtd_dev_as->iommu, 2);
> > > +        }
> > 
> > Hmm.  You have the IOMMU translated region overlaying the
> > direct-mapped alias.  You enable and disable the IOMMU subregion, but
> > you always leave the direct mapping enabled.  You might get away with
> > this because the size of the IOMMU region is UINT64_MAX, which will
> > overlay everything - but as above, I think that's wrong.  If that's
> > changed then guest devices may be able to access portions of the raw
> > address space outside the IOMMU mapped region, which could break the
> > guest's expectations of device isolation.
> > 
> > I think it would be much safer to disable the system memory alias when
> > the IOMMU is enabled.
> 
> Reasonable. Will adopt.
> 
> Thanks!
> 
> -- peterx
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-21  0:04         ` Alex Williamson
@ 2016-12-21  3:19           ` Peter Xu
  2016-12-21  3:49           ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Xu @ 2016-12-21  3:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, david

On Tue, Dec 20, 2016 at 05:04:33PM -0700, Alex Williamson wrote:
> On Tue, 20 Dec 2016 14:38:01 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:
> > 
> > [...]
> > 
> > > > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > > > it once and for all. I think we can do better than this in the future,
> > > > for example, one address space per guest IOMMU domain (as you have
> > > > mentioned before). However I suppose that will need more work (which I
> > > > still can't estimate on the amount of work). So I am considering to
> > > > enable the device assignments functionally first, then we can further
> > > > improve based on a workable version. Same thoughts apply to the IOMMU
> > > > replay RFC series.  
> > > 
> > > I'm not arguing against it, I'm just trying to set expectations for
> > > where this gets us.  An AddressSpace per guest iommu domain seems like
> > > the right model for QEMU, but it has some fundamental issues with
> > > vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> > > represents the host IOMMU context.  The AddressSpace of a device is
> > > currently assumed to be fixed in QEMU, guest IOMMU domains clearly
> > > are not.  vfio only let's us have access to a device while it's
> > > protected within a container.  Therefore in order to move a device to a
> > > different AddressSpace based on the guest domain configuration, we'd
> > > need to tear down the vfio configuration, including releasing the
> > > device.  
> > 
> > I assume this is VT-d specific issue, right? Looks like ppc is using a
> > totally differnet way to manage the mapping, and devices can share the
> > same address space.
> 
> It's only VT-d specific in that VT-d is the only vIOMMU we have for
> x86.  ppc has a much different host IOMMU architcture and their VM
> architecture requires an IOMMU.  The ppc model has a notion of
> preregistration to help with this, among other things.
> 
> > > > Regarding to the locked memory accounting issue: do we have existing
> > > > way to do the accounting? If so, would you (or anyone) please
> > > > elaborate a bit? If not, is that an ongoing/planned work?  
> > > 
> > > As I describe above, there's a vfio container per AddressSpace, each
> > > container is an IOMMU domain in the host.  In the guest, an IOMMU
> > > domain can include multiple AddressSpaces, one for each context entry
> > > that's part of the domain.  When the guest programs a translation for
> > > an IOMMU domain, that maps a guest IOVA to a guest physical address,
> > > for each AddressSpace.  Each AddressSpace is backed by a vfio
> > > container, which needs to pin the pages of that translation in order to
> > > get a host physical address, which then gets programmed into the host
> > > IOMMU domain with the guest-IOVA and host physical address.  The
> > > pinning process is where page accounting is done.  It's done per vfio
> > > context.  The worst case scenario for accounting is thus when VT-d is
> > > present but disabled (or in passthrough mode) as each AddressSpace
> > > duplicates address_space_memory and every page of guest memory is
> > > pinned and accounted for each vfio container.  
> > 
> > IIUC this accounting issue will solve itself if we can solve the
> > previous issue. While we don't have it now, so ...
> 
> Not sure what "previous issue" is referring to here.

Here I meant if we can let devices share the same VFIOAddressSpace in
the future for VT-d emulation (just like ppc), then we won't need to
worry about the duplicated accounting issue again. In that case, when
N devices are put into the same guest iommu domain, it'll share a
single VFIOAddressSpace in QEMU, and the mapping will be counted only
once.

But I think this does not solve the problem you mentioned below - yes
looks like guest user space driver can map the whole 39/48 bits
address space. That's something I failed to realize before...

> 
> > > That's the existing way we do accounting.  There is no current
> > > development that I'm aware of to change this.  As above, the simplest
> > > stop-gap solution is that libvirt would need to be aware when VT-d is
> > > present for a VM and use a different algorithm to set QEMU locked
> > > memory limit, but it's not without its downsides.  
> > 
> > ... here I think it's sensible to consider a specific algorithm for
> > vt-d use case. I am just curious about how should we define this
> > algorithm.
> > 
> > First of all, when the devices are not sharing domain (or say, one
> > guest iommu domain per assigned device), everything should be fine.
> 
> No, each domain could map the entire guest address space.  If we're
> talking about a domain per device for use with the Linux DMA API, then
> it's unlikely that the sum of mapped pages across all the domains will
> exceed the current libvirt set locked memory limit.  However, that's
> exactly the configuration where we expect to have abysmal performance.
> As soon as we recommend the guest boot with iommu=pt, then each
> container will be mapping and pinning the entire VM address space.
> 
> > No
> > special algorithm needed. IMHO the problem will happen only if there
> > are assigned devices that share a same address space (either system,
> > or specific iommu domain). In that case, the accounted value (or say,
> > current->mm->locked_vm iiuc) will be bigger than the real locked
> > memory size.
> > 
> > However, I think the problem is whether devices will be put into same
> > address space depends on guest behavior - the guest can either use
> > iommu=pt, or manually putting devices into the same guest iommu region
> > to achieve that. But from hypervisor POV, how should we estimate this?
> > Can we really?
> 
> The simple answer is that each device needs to be able to map the
> entire VM address space and therefore when a VM is configured with
> VT-d, libvirt needs to multiply the current locked memory settings for
> assigned devices by the number of devices (groups actually) assigned.
> There are (at least) two problems with this though.  The first is that
> we expect QEMU to use this increased locked memory limits for duplicate
> accounting of the same pages, but an exploited user process could take
> advantage of it and cause problems.  Not optimal.  The second problem
> relates to the usage of the IOVA address space and the assumption that
> a given container will map no more than the VM address space.  When no
> vIOMMU is exposed to the VM, QEMU manages the container IOVA space and
> we know that QEMU is only mapping VM RAM and therefore mappings are
> bound by the size of the VM.  With a vIOMMU, the guest is in control of
> the IOVA space and can map up to the limits of the vIOMMU.  The guest
> can map a single 4KB page to every IOVA up to that limit and we'll
> account that page each time.  So even valid (though perhaps not useful)
> cases within the guest can hit that locking limit.

So what will happen if a VM uses over the locked memory limit? Are we
only using cgroup so that the DMA_MAP ioctl() will just fail? Or there
is anything more to do with libvirt when VM exceeds this limitation?

Another question to vfio-pci module: I see that
vfio_listener_region_add() will crash the VM if it fails to do dma map
and we'll get this:

  hw_error("vfio: DMA mapping failed, unable to continue");

However in vfio_iommu_map_notify(), we don't have such a hard
requirement - when dma map fails, we just error_report() without quit
the VM. Could I ask why we are having different behavior here? Any
special concerns?

IMHO if the guest program abuse the vIOMMU mapping usage and reaches
the locked memory limit (I think for now N*VM_RAM_SIZE is fairly big
enough) that libvirt assigned to this VM, we should just crash the VM
to make sure it won't affect others (I assume using over N*VM_RAM_SIZE
for mapping is a hard clue that guest is doing something dangerous and
illegal).

[...]

> > I can totally understand that the performance will suck if dynamic
> > mapping is used. AFAIU this work will only be used with static dma
> > mapping like running DPDK in guest (besides other trivial goals, like,
> > development purpose).
> 
> We can't control how a feature is used, which is why I'm trying to make
> sure this doesn't come as a surprise to anyone.

Yes. Agree that we'd better try our best to let people know this
before they start to use it.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20  4:52     ` Alex Williamson
  2016-12-20  6:38       ` Peter Xu
@ 2016-12-21  3:30       ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: David Gibson @ 2016-12-21  3:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Xu, qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	jasowang, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 8077 bytes --]

On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:
> On Tue, 20 Dec 2016 11:44:41 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote:
> > > On Mon, 19 Dec 2016 22:41:26 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > > VT-d protection. The old VT-d codes is using static IOMMU region, and
> > > > that won't satisfy vfio-pci device listeners.
> > > > 
> > > > Let me explain.
> > > > 
> > > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > > mechanism to make sure the device mapping is coherent with the guest
> > > > even if there are domain switches. And there are two kinds of domain
> > > > switches:
> > > > 
> > > >   (1) switch from domain A -> B
> > > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > > 
> > > > Case (1) is handled by the context entry invalidation handling by the
> > > > VT-d replay logic. What the replay function should do here is to replay
> > > > the existing page mappings in domain B.
> > > > 
> > > > However for case (2), we don't want to replay any domain mappings - we
> > > > just need the default GPA->HPA mappings (the address_space_memory
> > > > mapping). And this patch helps on case (2) to build up the mapping
> > > > automatically by leveraging the vfio-pci memory listeners.
> > > > 
> > > > Another important thing that this patch does is to seperate
> > > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> > > > depend on the DMAR region (like before this patch). It should be a
> > > > standalone region, and it should be able to be activated without
> > > > DMAR (which is a common behavior of Linux kernel - by default it enables
> > > > IR while disabled DMAR).  
> > > 
> > > 
> > > This seems like an improvement, but I will note that there are existing
> > > locked memory accounting issues inherent with VT-d and vfio.  With
> > > VT-d, each device has a unique AddressSpace.  This requires that each
> > > is managed via a separate vfio container.  Each container is accounted
> > > for separately for locked pages.  libvirt currently only knows that if
> > > any vfio devices are attached that the locked memory limit for the
> > > process needs to be set sufficient for the VM memory.  When VT-d is
> > > involved, we either need to figure out how to associate otherwise
> > > independent vfio containers to share locked page accounting or teach
> > > libvirt that the locked memory requirement needs to be multiplied by
> > > the number of attached vfio devices.  The latter seems far less
> > > complicated but reduces the containment of QEMU a bit since the
> > > process has the ability to lock potentially many multiples of the VM
> > > address size.  Thanks,  
> > 
> > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > it once and for all. I think we can do better than this in the future,
> > for example, one address space per guest IOMMU domain (as you have
> > mentioned before). However I suppose that will need more work (which I
> > still can't estimate on the amount of work). So I am considering to
> > enable the device assignments functionally first, then we can further
> > improve based on a workable version. Same thoughts apply to the IOMMU
> > replay RFC series.
> 
> I'm not arguing against it, I'm just trying to set expectations for
> where this gets us.  An AddressSpace per guest iommu domain seems like
> the right model for QEMU, but it has some fundamental issues with
> vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> represents the host IOMMU context.  The AddressSpace of a device is
> currently assumed to be fixed in QEMU,

Actually, I think we can work around this: you could set up a separate
AddressSpace for each device which consists of nothing but a big alias
into an AddressSpace associated with the current IOMMU domain.  As the
device is moved between domains you remove/replace the alias region -
or even replace it with an alias direct into system memory when the
IOMMU is disabled.

> est IOMMU domains clearly
> are not.  vfio only let's us have access to a device while it's
> protected within a container.  Therefore in order to move a device to a
> different AddressSpace based on the guest domain configuration, we'd
> need to tear down the vfio configuration, including releasing the
> device.
>  
> > Regarding to the locked memory accounting issue: do we have existing
> > way to do the accounting? If so, would you (or anyone) please
> > elaborate a bit? If not, is that an ongoing/planned work?
> 
> As I describe above, there's a vfio container per AddressSpace, each
> container is an IOMMU domain in the host.  In the guest, an IOMMU
> domain can include multiple AddressSpaces, one for each context entry
> that's part of the domain.  When the guest programs a translation for
> an IOMMU domain, that maps a guest IOVA to a guest physical address,
> for each AddressSpace.  Each AddressSpace is backed by a vfio
> container, which needs to pin the pages of that translation in order to
> get a host physical address, which then gets programmed into the host
> IOMMU domain with the guest-IOVA and host physical address.  The
> pinning process is where page accounting is done.

Ah.. and I take it the accounting isn't smart enough to tell that the
same page is already pinned elsewhere.  I guess that would take rather
a lot of extra bookkeeping.

> It's done per vfio
> context.  The worst case scenario for accounting is thus when VT-d is
> present but disabled (or in passthrough mode) as each AddressSpace
> duplicates address_space_memory and every page of guest memory is
> pinned and accounted for each vfio container.

Hmm.  I imagine you'll need a copy of the current translation tables
for a guest domain regardless of VFIO involvement.  So, when a domain
is unused - i.e. has no devices in it, won't the container have all
the groups detached and so give up all the memory.  Obviously when a
device is assigned to the domain you'll need to replay the current
mappings into VFIO.

> That's the existing way we do accounting.  There is no current
> development that I'm aware of to change this.  As above, the simplest
> stop-gap solution is that libvirt would need to be aware when VT-d is
> present for a VM and use a different algorithm to set QEMU locked
> memory limit, but it's not without its downsides.  Alternatively, a new
> IOMMU model would need to be developed for vfio.  The type1 model was
> only ever intended to be used for relatively static user mappings and I
> expect it to have horrendous performance when backing a dynamic guest
> IOMMU domain.  Really the only guest IOMMU usage model that makes any
> sort of sense with type1 is to run the guest with passthrough (iommu=pt)
> and only pull devices out of passthrough for relatively static mapping
> cases within the guest userspace (nested assigned devices or dpdk).  If
> the expectation is that we just need this one little bit more code to
> make vfio usable in the guest, that may be true, but it really is just
> barely usable.  It's not going to be fast for any sort of dynamic
> mapping and it's going to have accounting issues that are not
> compatible with how libvirt sets locked memory limits for QEMU as soon
> as you go beyond a single device.  Thanks,

Maybe we should revisit the idea of a "type2" IOMMU which could handle
both guest VT-d and guest PAPR TCEs.  I'm not excessively fond of the
pre-registration model that PAPR uses at the moment, but it might be
the best available way to deal with the accounting issue.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20 23:02 ` no-reply
@ 2016-12-21  3:33   ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2016-12-21  3:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: famz, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv, david

On Tue, Dec 20, 2016 at 03:02:54PM -0800, no-reply@patchew.org wrote:

[...]

> === OUTPUT BEGIN ===
> Checking PATCH 1/1: intel_iommu: allow dynamic switch of IOMMU region...
> ERROR: "(foo**)" should be "(foo **)"
> #55: FILE: hw/i386/intel_iommu.c:1190:
> +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> 
> ERROR: space prohibited between function name and open parenthesis '('
> #55: FILE: hw/i386/intel_iommu.c:1190:
> +    while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> 
> total: 2 errors, 0 warnings, 118 lines checked

Hmm, let me fix.

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-20 23:57 ` no-reply
@ 2016-12-21  3:39   ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2016-12-21  3:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: famz, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv, david

On Tue, Dec 20, 2016 at 03:57:59PM -0800, no-reply@patchew.org wrote:

[...]

> /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_switch_address_space’:
> /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: implicit declaration of function ‘trace_vtd_switch_address_space’
> /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: nested extern declaration of ‘trace_vtd_switch_address_space’
> /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_find_add_as’:
> /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: ‘name’ undeclared (first use in this function)
> /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: (Each undeclared identifier is reported only once
> /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: for each function it appears in.)

I believe I have got something missing for the rebase on the
trace-event file... Will repost a cleaner v2 with existing comments.

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-21  0:04         ` Alex Williamson
  2016-12-21  3:19           ` Peter Xu
@ 2016-12-21  3:49           ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: David Gibson @ 2016-12-21  3:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Xu, qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	jasowang, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 9901 bytes --]

On Tue, Dec 20, 2016 at 05:04:33PM -0700, Alex Williamson wrote:
> On Tue, 20 Dec 2016 14:38:01 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:
> > 
> > [...]
> > 
> > > > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > > > it once and for all. I think we can do better than this in the future,
> > > > for example, one address space per guest IOMMU domain (as you have
> > > > mentioned before). However I suppose that will need more work (which I
> > > > still can't estimate on the amount of work). So I am considering to
> > > > enable the device assignments functionally first, then we can further
> > > > improve based on a workable version. Same thoughts apply to the IOMMU
> > > > replay RFC series.  
> > > 
> > > I'm not arguing against it, I'm just trying to set expectations for
> > > where this gets us.  An AddressSpace per guest iommu domain seems like
> > > the right model for QEMU, but it has some fundamental issues with
> > > vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> > > represents the host IOMMU context.  The AddressSpace of a device is
> > > currently assumed to be fixed in QEMU, guest IOMMU domains clearly
> > > are not.  vfio only let's us have access to a device while it's
> > > protected within a container.  Therefore in order to move a device to a
> > > different AddressSpace based on the guest domain configuration, we'd
> > > need to tear down the vfio configuration, including releasing the
> > > device.  
> > 
> > I assume this is VT-d specific issue, right? Looks like ppc is using a
> > totally differnet way to manage the mapping, and devices can share the
> > same address space.
> 
> It's only VT-d specific in that VT-d is the only vIOMMU we have for
> x86.  ppc has a much different host IOMMU architcture and their VM
> architecture requires an IOMMU.  The ppc model has a notion of
> preregistration to help with this, among other things.

Preregistration addresses two different problems:

1) Preregistered pages are accounted as locked only once per host mm,
even if registered against multiple containers, so it avoids the
problem of double counting locked memory.

2) By moving the accounting to preregistration, it removes one barrier
to having decent performance for dynamic mappings.  (e.g. for PAPR
accounting at map time made it basically impossible to write a fast
KVM acceleration for this).

> > > > Regarding to the locked memory accounting issue: do we have existing
> > > > way to do the accounting? If so, would you (or anyone) please
> > > > elaborate a bit? If not, is that an ongoing/planned work?  
> > > 
> > > As I describe above, there's a vfio container per AddressSpace, each
> > > container is an IOMMU domain in the host.  In the guest, an IOMMU
> > > domain can include multiple AddressSpaces, one for each context entry
> > > that's part of the domain.  When the guest programs a translation for
> > > an IOMMU domain, that maps a guest IOVA to a guest physical address,
> > > for each AddressSpace.  Each AddressSpace is backed by a vfio
> > > container, which needs to pin the pages of that translation in order to
> > > get a host physical address, which then gets programmed into the host
> > > IOMMU domain with the guest-IOVA and host physical address.  The
> > > pinning process is where page accounting is done.  It's done per vfio
> > > context.  The worst case scenario for accounting is thus when VT-d is
> > > present but disabled (or in passthrough mode) as each AddressSpace
> > > duplicates address_space_memory and every page of guest memory is
> > > pinned and accounted for each vfio container.  
> > 
> > IIUC this accounting issue will solve itself if we can solve the
> > previous issue. While we don't have it now, so ...
> 
> Not sure what "previous issue" is referring to here.
> 
> > > That's the existing way we do accounting.  There is no current
> > > development that I'm aware of to change this.  As above, the simplest
> > > stop-gap solution is that libvirt would need to be aware when VT-d is
> > > present for a VM and use a different algorithm to set QEMU locked
> > > memory limit, but it's not without its downsides.  
> > 
> > ... here I think it's sensible to consider a specific algorithm for
> > vt-d use case. I am just curious about how should we define this
> > algorithm.
> > 
> > First of all, when the devices are not sharing domain (or say, one
> > guest iommu domain per assigned device), everything should be fine.
> 
> No, each domain could map the entire guest address space.  If we're
> talking about a domain per device for use with the Linux DMA API, then
> it's unlikely that the sum of mapped pages across all the domains will
> exceed the current libvirt set locked memory limit.  However, that's
> exactly the configuration where we expect to have abysmal performance.
> As soon as we recommend the guest boot with iommu=pt, then each
> container will be mapping and pinning the entire VM address space.
> 
> > No
> > special algorithm needed. IMHO the problem will happen only if there
> > are assigned devices that share a same address space (either system,
> > or specific iommu domain). In that case, the accounted value (or say,
> > current->mm->locked_vm iiuc) will be bigger than the real locked
> > memory size.
> > 
> > However, I think the problem is whether devices will be put into same
> > address space depends on guest behavior - the guest can either use
> > iommu=pt, or manually putting devices into the same guest iommu region
> > to achieve that. But from hypervisor POV, how should we estimate this?
> > Can we really?
> 
> The simple answer is that each device needs to be able to map the
> entire VM address space and therefore when a VM is configured with
> VT-d, libvirt needs to multiply the current locked memory settings for
> assigned devices by the number of devices (groups actually) assigned.
> There are (at least) two problems with this though.  The first is that
> we expect QEMU to use this increased locked memory limits for duplicate
> accounting of the same pages, but an exploited user process could take
> advantage of it and cause problems.  Not optimal.  The second problem
> relates to the usage of the IOVA address space and the assumption that
> a given container will map no more than the VM address space.  When no
> vIOMMU is exposed to the VM, QEMU manages the container IOVA space and
> we know that QEMU is only mapping VM RAM and therefore mappings are
> bound by the size of the VM.  With a vIOMMU, the guest is in control of
> the IOVA space and can map up to the limits of the vIOMMU.  The guest
> can map a single 4KB page to every IOVA up to that limit and we'll
> account that page each time.  So even valid (though perhaps not useful)
> cases within the guest can hit that locking limit.
> 
> This suggests that we not only need a vfio IOMMU model that tracks pfns
> per domain to avoid duplicate accounting, but we need some way to share
> that tracking between domains.

Sounds pretty much identical to PAPR's preregistration.

> Then we can go back to allowing a
> locked memory limit up to the VM RAM size as the correct and complete
> solution (plus some sort of shadow page table based mapping for any
> hope of bearable performance for dynamic usage).
>  
> > > Alternatively, a new
> > > IOMMU model would need to be developed for vfio.  The type1 model was
> > > only ever intended to be used for relatively static user mappings and I
> > > expect it to have horrendous performance when backing a dynamic guest
> > > IOMMU domain.  Really the only guest IOMMU usage model that makes any
> > > sort of sense with type1 is to run the guest with passthrough (iommu=pt)
> > > and only pull devices out of passthrough for relatively static mapping
> > > cases within the guest userspace (nested assigned devices or dpdk).  If
> > > the expectation is that we just need this one little bit more code to
> > > make vfio usable in the guest, that may be true, but it really is just
> > > barely usable.  It's not going to be fast for any sort of dynamic
> > > mapping and it's going to have accounting issues that are not
> > > compatible with how libvirt sets locked memory limits for QEMU as soon
> > > as you go beyond a single device.  Thanks,  
> > 
> > I can totally understand that the performance will suck if dynamic
> > mapping is used. AFAIU this work will only be used with static dma
> > mapping like running DPDK in guest (besides other trivial goals, like,
> > development purpose).
> 
> We can't control how a feature is used, which is why I'm trying to make
> sure this doesn't come as a surprise to anyone.
>  
> > Regarding to "the other" iommu model you mentioned besides type1, is
> > there any existing discussions out there? Any further learning
> > material/links would be greatly welcomed.
> 
> Nope.  You and Aviv are the only ones doing work that suggests a need
> for a new IOMMU model.  Thanks,

I've also thought about it idly in the past.  Currently we have some
gratuitous differences in the spapr IOMMU mode.  We do need some
things different than Type1, but less than we thought when the
interfaces were designed.  It would be nice to converge these
together.  Preregistration, plus some more powerful introspection
about what IOVA windows the IOMMU model can support should get us most
of the way to being able to support both PAPR and VT-d backends with
the same interface.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-21  2:53     ` David Gibson
@ 2016-12-21 10:05       ` Peter Xu
  2016-12-21 22:56         ` David Gibson
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2016-12-21 10:05 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote:

[...]

> > Could you explain why here device address space has things to do with
> > PCI BARs? I thought BARs are for CPU address space only (so that CPU
> > can access PCI registers via MMIO manner), am I wrong?
> 
> In short, yes.  So, first think about vanilla PCI - most things are
> PCI-E these days, but the PCI addressing model which was designed for
> the old hardware is still mostly the same.
> 
> With plain PCI, you have a physical bus over which address and data
> cycles pass.  Those cycles don't distinguish between transfers from
> host to device or device to host.  Each address cycle just gives which
> address space: configuration, IO or memory, and an address.
> 
> Devices respond to addresses within their BARs, typically such cycles
> will come from the host, but they don't have to - a device is able to
> send cycles to another device (peer to peer DMA).  Meanwhile the host
> bridge will respond to addresses within certain DMA windows,
> propagating those access onwards to system memory.  How many DMA
> windows there are, their size, location and whether they're mapped
> directly or via an IOMMU depends on the model of host bridge.
> 
> On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped
> directly to memory addresses 0..<somewhere>, identity mapping RAM into
> PCI space.  BARs would be assigned above <somewhere>, so they don't
> collide.  I suspect old enough machines will have <somewhere> == 2G,
> leaving 2G..4G for the BARs of 32-bit devices.  More modern x86
> bridges must have provisions for accessing memory above 4G, but I'm
> not entirely certain how that works.
> 
> PAPR traditionally also had a DMA window from 0..2G, however instead
> of being direct mapped to RAM, it is always translated via an IOMMU.
> More modern PAPR systems have that window by default, but allow the
> OS to remove it and configure up to 2 DMA windows of variable length
> and page size.  Various other platforms have various other DMA window
> arrangements.
> 
> With PCI-E, of course, upstream and downstream cycles are distinct,
> and peer to peer DMA isn't usually possible (unless a switch is
> configured specially to allow it by forwarding cycles from one
> downstream port to another).  But the address mndel remains logically
> the same: there is just one PCI memory space and both device BARs and
> host DMA windows live within it.  Firmware and/or the OS need to know
> the details of the platform's host bridge, and configure both the BARs
> and the DMA windows so that they don't collide.

Thanks for the thorough explanation. :)

So we should mask out all the MMIO regions (including BAR address
ranges) for PCI device address space, right? Since they should not be
able to access such addresses, but system ram?

> 
> > I think we should have a big enough IOMMU region size here. If device
> > writes to invalid addresses, IMHO we should trap it and report to
> > guest. If we have a smaller size than UINT64_MAX, how we can trap this
> > behavior and report for the whole address space (it should cover [0,
> > 2^64-1])?
> 
> That's not how the IOMMU works.  How it traps is dependent on the
> specific IOMMU model, but generally they'll only even look at cycles
> which lie within the IOMMU's DMA window.  On x86 I'm pretty sure that
> window will be large, but it won't be 2^64.  It's also likely to have
> a gap between 2..4GiB to allow room for the BARs of 32-bit devices.

But for x86 IOMMU region, I don't know anything like "DMA window" -
device has its own context entry, which will point to a whole page
table. In that sense I think at least all addresses from (0, 2^39-1)
should be legal addresses? And that range should depend on how many
address space bits the specific Intel IOMMU support, currently the
emulated VT-d one supports 39 bits.

An example would be: one with VT-d should be able to map the address
3G (0xc0000000, here it is an IOVA address) to any physical address
he/she wants, as long as he/she setup the page table correctly.

Hope I didn't miss anything important..

> 
> > > 
> > > > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > > > +                                 "vtd_sys_alias", get_system_memory(),
> > > > +                                 0, memory_region_size(get_system_memory()));
> > > 
> > > I strongly suspect using memory_region_size(get_system_memory()) is
> > > also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> > > you you can't actually access all of that via PCI space (again, it
> > > would collide with actual PCI BARs).  I also suspect you can't reach
> > > CPU MMIO regions via the PCI DMA space.
> > 
> > Hmm, sounds correct.
> > 
> > However if so we will have the same problem if without IOMMU? See
> > pci_device_iommu_address_space() - address_space_memory will be the
> > default if we have no IOMMU protection, and that will cover e.g. CPU
> > MMIO regions as well.
> 
> True.  That default is basically assuming that both the host bridge's
> DMA windows, and its outbound IO and memory windows are identity
> mapped between the system bus and the PCI address space.  I suspect
> that's rarely 100% true, but it's close enough to work on a fair few
> platforms.
> 
> But since you're building a more accurate model of the x86 host
> bridge's behaviour here, you might as well try to get it as accurate
> as possible.

Yes, but even if we can fix this problem, it should be for the
no-iommu case as well? If so, I think it might be more suitable for
another standalone patch.

Anyway, I noted this down. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
  2016-12-21 10:05       ` Peter Xu
@ 2016-12-21 22:56         ` David Gibson
  0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2016-12-21 22:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 7081 bytes --]

On Wed, Dec 21, 2016 at 06:05:49PM +0800, Peter Xu wrote:
> On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote:
> 
> [...]
> 
> > > Could you explain why here device address space has things to do with
> > > PCI BARs? I thought BARs are for CPU address space only (so that CPU
> > > can access PCI registers via MMIO manner), am I wrong?
> > 
> > In short, yes.  So, first think about vanilla PCI - most things are
> > PCI-E these days, but the PCI addressing model which was designed for
> > the old hardware is still mostly the same.
> > 
> > With plain PCI, you have a physical bus over which address and data
> > cycles pass.  Those cycles don't distinguish between transfers from
> > host to device or device to host.  Each address cycle just gives which
> > address space: configuration, IO or memory, and an address.
> > 
> > Devices respond to addresses within their BARs, typically such cycles
> > will come from the host, but they don't have to - a device is able to
> > send cycles to another device (peer to peer DMA).  Meanwhile the host
> > bridge will respond to addresses within certain DMA windows,
> > propagating those access onwards to system memory.  How many DMA
> > windows there are, their size, location and whether they're mapped
> > directly or via an IOMMU depends on the model of host bridge.
> > 
> > On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped
> > directly to memory addresses 0..<somewhere>, identity mapping RAM into
> > PCI space.  BARs would be assigned above <somewhere>, so they don't
> > collide.  I suspect old enough machines will have <somewhere> == 2G,
> > leaving 2G..4G for the BARs of 32-bit devices.  More modern x86
> > bridges must have provisions for accessing memory above 4G, but I'm
> > not entirely certain how that works.
> > 
> > PAPR traditionally also had a DMA window from 0..2G, however instead
> > of being direct mapped to RAM, it is always translated via an IOMMU.
> > More modern PAPR systems have that window by default, but allow the
> > OS to remove it and configure up to 2 DMA windows of variable length
> > and page size.  Various other platforms have various other DMA window
> > arrangements.
> > 
> > With PCI-E, of course, upstream and downstream cycles are distinct,
> > and peer to peer DMA isn't usually possible (unless a switch is
> > configured specially to allow it by forwarding cycles from one
> > downstream port to another).  But the address mndel remains logically
> > the same: there is just one PCI memory space and both device BARs and
> > host DMA windows live within it.  Firmware and/or the OS need to know
> > the details of the platform's host bridge, and configure both the BARs
> > and the DMA windows so that they don't collide.
> 
> Thanks for the thorough explanation. :)
> 
> So we should mask out all the MMIO regions (including BAR address
> ranges) for PCI device address space, right?

Uhh.. I think "masking out" is treating the problem backwards.
I think you should allow only a window covering RAM, not take
everything then try to remove MMIO.

> Since they should not be
> able to access such addresses, but system ram?

What is "they"?

> > > I think we should have a big enough IOMMU region size here. If device
> > > writes to invalid addresses, IMHO we should trap it and report to
> > > guest. If we have a smaller size than UINT64_MAX, how we can trap this
> > > behavior and report for the whole address space (it should cover [0,
> > > 2^64-1])?
> > 
> > That's not how the IOMMU works.  How it traps is dependent on the
> > specific IOMMU model, but generally they'll only even look at cycles
> > which lie within the IOMMU's DMA window.  On x86 I'm pretty sure that
> > window will be large, but it won't be 2^64.  It's also likely to have
> > a gap between 2..4GiB to allow room for the BARs of 32-bit devices.
> 
> But for x86 IOMMU region, I don't know anything like "DMA window" -
> device has its own context entry, which will point to a whole page
> table. In that sense I think at least all addresses from (0, 2^39-1)
> should be legal addresses? And that range should depend on how many
> address space bits the specific Intel IOMMU support, currently the
> emulated VT-d one supports 39 bits.

Well, sounds like the DMA window is 0..2^39-1 then.  If there's a
maximum number of bits in the specification - even if those aren't
implemented on any current model - that would also be a reasonable
choice.

Again, I strongly suspect there's a gap in the range 2..4GiB, to allow
for 32-bit BARs.  Maybe not the whole range, but some chunk of it.  I
believe that's what's called the "io hole" on x86.

This more or less has to be there.  If all of 0..4GiB was mapped to
RAM, and you had both a DMA capable device and a 32-bit device hanging
off a PCI-E to PCI bridge, the 32-bit device's BARs could pick up
cycles from the DMA device that were intended to go to RAM.

> An example would be: one with VT-d should be able to map the address
> 3G (0xc0000000, here it is an IOVA address) to any physical address
> he/she wants, as long as he/she setup the page table correctly.
> 
> Hope I didn't miss anything important..
> 
> > 
> > > > 
> > > > > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > > > > +                                 "vtd_sys_alias", get_system_memory(),
> > > > > +                                 0, memory_region_size(get_system_memory()));
> > > > 
> > > > I strongly suspect using memory_region_size(get_system_memory()) is
> > > > also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> > > > you you can't actually access all of that via PCI space (again, it
> > > > would collide with actual PCI BARs).  I also suspect you can't reach
> > > > CPU MMIO regions via the PCI DMA space.
> > > 
> > > Hmm, sounds correct.
> > > 
> > > However if so we will have the same problem if without IOMMU? See
> > > pci_device_iommu_address_space() - address_space_memory will be the
> > > default if we have no IOMMU protection, and that will cover e.g. CPU
> > > MMIO regions as well.
> > 
> > True.  That default is basically assuming that both the host bridge's
> > DMA windows, and its outbound IO and memory windows are identity
> > mapped between the system bus and the PCI address space.  I suspect
> > that's rarely 100% true, but it's close enough to work on a fair few
> > platforms.
> > 
> > But since you're building a more accurate model of the x86 host
> > bridge's behaviour here, you might as well try to get it as accurate
> > as possible.
> 
> Yes, but even if we can fix this problem, it should be for the
> no-iommu case as well? If so, I think it might be more suitable for
> another standalone patch.

Hm, perhaps.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-12-21 22:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2016-12-19 16:56 ` Alex Williamson
2016-12-20  3:44   ` Peter Xu
2016-12-20  4:52     ` Alex Williamson
2016-12-20  6:38       ` Peter Xu
2016-12-21  0:04         ` Alex Williamson
2016-12-21  3:19           ` Peter Xu
2016-12-21  3:49           ` David Gibson
2016-12-21  3:30       ` David Gibson
2016-12-19 23:30 ` David Gibson
2016-12-20  4:16   ` Peter Xu
2016-12-21  2:53     ` David Gibson
2016-12-21 10:05       ` Peter Xu
2016-12-21 22:56         ` David Gibson
2016-12-20 23:02 ` no-reply
2016-12-21  3:33   ` Peter Xu
2016-12-20 23:57 ` no-reply
2016-12-21  3:39   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.