[Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid
@ 2017-11-13 12:29 zhuyijun
  2017-11-13 12:29 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: zhuyijun @ 2017-11-13 12:29 UTC (permalink / raw)
  To: qemu-arm, qemu-devel, eric.auger
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

From: Zhu Yijun <zhuyijun@huawei.com>

With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
PCI reserved window which has to be excluded from Guest iova allocations.

And on certain HiSilicon platforms (hip06/hip07), the GIC ITS and PCIe RC 
deviates from the standard implementation will reserve the hw msi regions in
the smmu-v3 driver which means these address regions will not be translated.

https://www.spinics.net/lists/linux-pci/msg65125.html

On such platforms, reserved memory regions will export like this:

root:~# cat /sys/kernel/iommu_groups/7/reserved_regions
0x00000000a8800000 0x00000000affeffff reserved
0x00000000c6000000 0x00000000c601ffff msi

However, it falls within the Qemu default virtual memory address space.

Take a look at hw/arm/virt.c:

[VIRT_MEM] =             { 0x40000000, RAMLIMIT_BYTES },
    . . .

    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
                                                     machine->ram_size);
memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);

So suppose we allocate 4GB mem to a VM, there is a chance that the reserved
regions will get allocated for a Guest VF DMA iova and it will fail.

This patchset create holes in the Qemu RAM address space which exclude
the sysfs exported regions.


Zhu Yijun (5):
  hw/vfio: Add function for getting reserved_region of device iommu
    group
  hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  hw/arm: add scattered RAM memory region support
  hw/arm/boot: set fdt size cell of memory node from mem_list
  hw/arm/virt-acpi-build: Build srat table according to mem_list

 hw/arm/boot.c                 | 155 +++++++++++++++++++++++++++++++-----------
 hw/arm/virt-acpi-build.c      |  40 +++++++++--
 hw/arm/virt.c                 | 120 ++++++++++++++++++++++++++++++--
 hw/vfio/common.c              |  67 ++++++++++++++++++
 hw/vfio/pci.c                 |   2 +
 hw/vfio/platform.c            |   2 +
 include/exec/memory.h         |   7 ++
 include/hw/arm/arm.h          |   1 +
 include/hw/arm/virt.h         |   1 +
 include/hw/vfio/vfio-common.h |   3 +
 10 files changed, 347 insertions(+), 51 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-13 12:29 [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid zhuyijun
@ 2017-11-13 12:29 ` zhuyijun
  2017-11-13 12:29 ` [Qemu-devel] [RFC 3/5] hw/arm: add scattered RAM memory region support zhuyijun
  2017-11-13 20:01 ` [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid Auger Eric
  2 siblings, 0 replies; 16+ messages in thread
From: zhuyijun @ 2017-11-13 12:29 UTC (permalink / raw)
  To: qemu-arm, qemu-devel, eric.auger
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

From: Zhu Yijun <zhuyijun@huawei.com>

With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
PCI reserved window which has to be excluded from Guest iova allocations.

However, If it falls within the Qemu default virtual memory address space,
then reserved regions may get allocated for a Guest VF DMA iova and it will
fail.

So get those reserved regions in this patch and create some holes in the Qemu
ram address in next patchset.

Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
---
 hw/vfio/common.c              | 67 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  2 ++
 hw/vfio/platform.c            |  2 ++
 include/exec/memory.h         |  7 +++++
 include/hw/vfio/vfio-common.h |  3 ++
 5 files changed, 81 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7b2924c..01bdbbd 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,8 @@ struct vfio_group_head vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 struct vfio_as_head vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
+struct reserved_ram_head reserved_ram_regions =
+    QLIST_HEAD_INITIALIZER(reserved_ram_regions);
 
 #ifdef CONFIG_KVM
 /*
@@ -52,6 +54,71 @@ struct vfio_as_head vfio_address_spaces =
 static int vfio_kvm_device_fd = -1;
 #endif
 
+void update_reserved_regions(hwaddr addr, hwaddr size)
+{
+    RAMRegion *reg, *new;
+
+    new = g_new(RAMRegion, 1);
+    new->base = addr;
+    new->size = size;
+
+    if (QLIST_EMPTY(&reserved_ram_regions)) {
+        QLIST_INSERT_HEAD(&reserved_ram_regions, new, next);
+        return;
+    }
+
+    /* make the base of reserved_ram_regions increase */
+    QLIST_FOREACH(reg, &reserved_ram_regions, next) {
+        if (addr > (reg->base + reg->size - 1)) {
+            if (!QLIST_NEXT(reg, next)) {
+                QLIST_INSERT_AFTER(reg, new, next);
+                break;
+            }
+            continue;
+        } else if (addr >= reg->base && addr <= (reg->base + reg->size - 1)) {
+            /* discard the duplicate entry */
+            if (addr == reg->base && size == reg->size) {
+                g_free(new);
+                break;
+            } else {
+                QLIST_INSERT_AFTER(reg, new, next);
+                break;
+            }
+        } else {
+            QLIST_INSERT_BEFORE(reg, new, next);
+            break;
+        }
+    }
+}
+
+void vfio_get_iommu_group_reserved_region(char *group_path)
+{
+    char *filename;
+    FILE *fp;
+    hwaddr start, end;
+    char str[10];
+    struct stat st;
+
+    filename = g_strdup_printf("%s/iommu_group/reserved_regions", group_path);
+    if (stat(filename, &st) < 0) {
+        g_free(filename);
+        return;
+    }
+
+    fp = fopen(filename, "r");
+    if (!fp) {
+        g_free(filename);
+        return;
+    }
+
+    while (fscanf(fp, "%"PRIx64" %"PRIx64" %s", &start, &end, str) != EOF) {
+        update_reserved_regions(start, (end - start + 1));
+    }
+
+    g_free(filename);
+    fclose(fp);
+}
+
 /*
  * Common VFIO interrupt disable
  */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee3..9bffb38 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2674,6 +2674,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
 
+    vfio_get_iommu_group_reserved_region(vdev->vbasedev.sysfsdev);
+
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index da84abf..d5bbc3a 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -578,6 +578,8 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
+    vfio_get_iommu_group_reserved_region(vbasedev->sysfsdev);
+
     tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5ed4042..2bcc83b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -46,6 +46,13 @@
         OBJECT_GET_CLASS(IOMMUMemoryRegionClass, (obj), \
                          TYPE_IOMMU_MEMORY_REGION)
 
+/* Scattered RAM memory region struct */
+typedef struct RAMRegion {
+    hwaddr base;
+    hwaddr size;
+    QLIST_ENTRY(RAMRegion) next;
+} RAMRegion;
+
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f3a2ac9..3483ca6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -161,10 +161,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+void update_reserved_regions(hwaddr addr, hwaddr size);
+void vfio_get_iommu_group_reserved_region(char *group_path);
 
 extern const MemoryRegionOps vfio_region_ops;
 extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
 extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
+extern QLIST_HEAD(reserved_ram_head, RAMRegion) reserved_ram_regions;
 
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC 3/5] hw/arm: add scattered RAM memory region support
  2017-11-13 12:29 [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid zhuyijun
  2017-11-13 12:29 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
@ 2017-11-13 12:29 ` zhuyijun
  2017-11-13 20:01 ` [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid Auger Eric
  2 siblings, 0 replies; 16+ messages in thread
From: zhuyijun @ 2017-11-13 12:29 UTC (permalink / raw)
  To: qemu-arm, qemu-devel, eric.auger
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

From: Zhu Yijun <zhuyijun@huawei.com>

Dig out reserved memory holes and collect scattered RAM memory
regions by adding mem_list member in arm_boot_info struct.

Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
---
 hw/arm/boot.c        |   8 ++++
 hw/arm/virt.c        | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/hw/arm/arm.h |   1 +
 3 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index c2720c8..30438f4 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -836,6 +836,14 @@ static void arm_load_kernel_notify(Notifier *notifier, void *data)
      */
     assert(!(info->secure_board_setup && kvm_enabled()));
 
+    /* If machine is not virt, the mem_list will empty. */
+    if (QLIST_EMPTY(&vms->bootinfo.mem_list)) {
+        RAMRegion *new = g_new(RAMRegion, 1);
+        new->base = info->loader_start;
+        new->size = info->ram_size;
+        QLIST_INSERT_HEAD(&vms->bootinfo.mem_list, new, next);
+    }
+
     info->dtb_filename = qemu_opt_get(qemu_get_machine_opts(), "dtb");
 
     /* Load the kernel.  */
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ddde5e1..ff334c1 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -56,6 +56,7 @@
 #include "hw/smbios/smbios.h"
 #include "qapi/visitor.h"
 #include "standard-headers/linux/input.h"
+#include "hw/vfio/vfio-common.h"
 
 #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
     static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
@@ -1225,6 +1226,98 @@ void virt_machine_done(Notifier *notifier, void *data)
     virt_build_smbios(vms);
 }
 
+static void handle_reserved_ram_region_overlap(void)
+{
+    hwaddr cur_end, next_end;
+    RAMRegion *reg, *next_reg, *tmp_reg;
+
+    QLIST_FOREACH(reg, &reserved_ram_regions, next) {
+        next_reg = QLIST_NEXT(reg, next);
+
+        while (next_reg && next_reg->base <= (reg->base + reg->size)) {
+            next_end = next_reg->base + next_reg->size;
+            cur_end = reg->base + reg->size;
+            if (next_end > cur_end) {
+                reg->size += (next_end - cur_end);
+            }
+
+            tmp_reg = QLIST_NEXT(next_reg, next);
+            QLIST_REMOVE(next_reg, next);
+            g_free(next_reg);
+            next_reg = tmp_reg;
+        }
+    }
+}
+
+static void update_memory_regions(VirtMachineState *vms, hwaddr ram_size)
+{
+
+    RAMRegion *new, *reg, *last = NULL;
+    hwaddr virt_start, virt_end;
+    virt_start = vms->memmap[VIRT_MEM].base;
+    virt_end = virt_start + ram_size - 1;
+
+    handle_reserved_ram_region_overlap();
+
+    QLIST_FOREACH(reg, &reserved_ram_regions, next) {
+        if (reg->base >= virt_start && reg->base < virt_end) {
+            if (reg->base == virt_start) {
+                virt_start += reg->size;
+                virt_end += reg->size;
+                continue;
+            } else {
+                new = g_new(RAMRegion, 1);
+                new->base = virt_start;
+                new->size = reg->base - virt_start;
+                virt_start = reg->base + reg->size;
+            }
+
+            if (QLIST_EMPTY(&vms->bootinfo.mem_list)) {
+                QLIST_INSERT_HEAD(&vms->bootinfo.mem_list, new, next);
+            } else {
+                QLIST_INSERT_AFTER(last, new, next);
+            }
+
+            last = new;
+            ram_size -= new->size;
+            virt_end += reg->size;
+        }
+    }
+
+    if (ram_size > 0) {
+        new = g_new(RAMRegion, 1);
+        new->base = virt_start;
+        new->size = ram_size;
+
+        if (QLIST_EMPTY(&vms->bootinfo.mem_list)) {
+            QLIST_INSERT_HEAD(&vms->bootinfo.mem_list, new, next);
+        } else {
+            QLIST_INSERT_AFTER(last, new, next);
+        }
+    }
+}
+
+static void create_ram_alias(VirtMachineState *vms,
+                             MemoryRegion *sysmem,
+                             MemoryRegion *ram)
+{
+    RAMRegion *reg;
+    MemoryRegion *ram_memory;
+    char *nodename;
+    hwaddr sz = 0;
+
+    QLIST_FOREACH(reg, &vms->bootinfo.mem_list, next) {
+        nodename = g_strdup_printf("ram@%" PRIx64, reg->base);
+        ram_memory = g_new(MemoryRegion, 1);
+        memory_region_init_alias(ram_memory, NULL, nodename, ram, sz,
+                                 reg->size);
+        memory_region_add_subregion(sysmem, reg->base, ram_memory);
+        sz += reg->size;
+
+        g_free(nodename);
+    }
+}
+
 static void virt_ram_memory_region_init(Notifier *notifier, void *data)
 {
     MachineState *machine = MACHINE(qdev_get_machine());
@@ -1232,10 +1325,15 @@ static void virt_ram_memory_region_init(Notifier *notifier, void *data)
     MemoryRegion *ram = g_new(MemoryRegion, 1);
     VirtMachineState *vms = container_of(notifier, VirtMachineState,
                                          ram_memory_region_init);
+    RAMRegion *first_mem_reg;
 
     memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
                                          machine->ram_size);
-    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
+    update_memory_regions(vms, machine->ram_size);
+    create_ram_alias(vms, sysmem, ram);
+
+    first_mem_reg = QLIST_FIRST(&vms->bootinfo.mem_list);
+    vms->bootinfo.loader_start = first_mem_reg->base;
 }
 
 static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
@@ -1458,7 +1556,6 @@ static void machvirt_init(MachineState *machine)
     vms->bootinfo.initrd_filename = machine->initrd_filename;
     vms->bootinfo.nb_cpus = smp_cpus;
     vms->bootinfo.board_id = -1;
-    vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
     vms->bootinfo.get_dtb = machvirt_dtb;
     vms->bootinfo.firmware_loaded = firmware_loaded;
     arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
index ce769bd..d953026 100644
--- a/include/hw/arm/arm.h
+++ b/include/hw/arm/arm.h
@@ -124,6 +124,7 @@ struct arm_boot_info {
     bool secure_board_setup;
 
     arm_endianness endianness;
+    QLIST_HEAD(, RAMRegion) mem_list;
 };
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid
  2017-11-13 12:29 [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid zhuyijun
  2017-11-13 12:29 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
  2017-11-13 12:29 ` [Qemu-devel] [RFC 3/5] hw/arm: add scattered RAM memory region support zhuyijun
@ 2017-11-13 20:01 ` Auger Eric
  2 siblings, 0 replies; 16+ messages in thread
From: Auger Eric @ 2017-11-13 20:01 UTC (permalink / raw)
  To: zhuyijun, qemu-arm, qemu-devel
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

Hello Zhu,

On 13/11/2017 13:29, zhuyijun@huawei.com wrote:
> From: Zhu Yijun <zhuyijun@huawei.com>
> 
> With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
> PCI reserved window which has to be excluded from Guest iova allocations.
> 
> And on certain HiSilicon platforms (hip06/hip07), the GIC ITS and PCIe RC 
> deviates from the standard implementation will reserve the hw msi regions in
> the smmu-v3 driver which means these address regions will not be translated.
> 
> https://www.spinics.net/lists/linux-pci/msg65125.html
> 
> On such platforms, reserved memory regions will export like this:
> 
> root:~# cat /sys/kernel/iommu_groups/7/reserved_regions
> 0x00000000a8800000 0x00000000affeffff reserved
> 0x00000000c6000000 0x00000000c601ffff msi
> 
> However, it falls within the Qemu default virtual memory address space.
> 
> Take a look at hw/arm/virt.c:
> 
> [VIRT_MEM] =             { 0x40000000, RAMLIMIT_BYTES },
>     . . .
> 
>     memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
>                                                      machine->ram_size);
> memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> 
> So suppose we allocate 4GB mem to a VM, there is a chance that the reserved
> regions will get allocated for a Guest VF DMA iova and it will fail.
> 
> This patchset create holes in the Qemu RAM address space which exclude
> the sysfs exported regions.
> 
> 
> Zhu Yijun (5):
>   hw/vfio: Add function for getting reserved_region of device iommu
>     group
>   hw/arm/virt: Enable dynamic generation of guest RAM memory regions
>   hw/arm: add scattered RAM memory region support
>   hw/arm/boot: set fdt size cell of memory node from mem_list
>   hw/arm/virt-acpi-build: Build srat table according to mem_list
I have not received all the patches of this series. I miss 2, 4, 5.

Looking at
https://lists.nongnu.org/archive/html/qemu-devel/2017-11/threads.html#02198,
I don't see them either.

Thanks

Eric
> 
>  hw/arm/boot.c                 | 155 +++++++++++++++++++++++++++++++-----------
>  hw/arm/virt-acpi-build.c      |  40 +++++++++--
>  hw/arm/virt.c                 | 120 ++++++++++++++++++++++++++++++--
>  hw/vfio/common.c              |  67 ++++++++++++++++++
>  hw/vfio/pci.c                 |   2 +
>  hw/vfio/platform.c            |   2 +
>  include/exec/memory.h         |   7 ++
>  include/hw/arm/arm.h          |   1 +
>  include/hw/arm/virt.h         |   1 +
>  include/hw/vfio/vfio-common.h |   3 +
>  10 files changed, 347 insertions(+), 51 deletions(-)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid
@ 2017-11-14  1:15 zhuyijun
  2017-11-14  1:15 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
  0 siblings, 1 reply; 16+ messages in thread
From: zhuyijun @ 2017-11-14  1:15 UTC (permalink / raw)
  To: qemu-arm, qemu-devel, eric.auger
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

From: Zhu Yijun <zhuyijun@huawei.com>

With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
PCI reserved window which has to be excluded from Guest iova allocations.

And on certain HiSilicon platforms (hip06/hip07), the GIC ITS and PCIe RC 
deviates from the standard implementation will reserve the hw msi regions in
the smmu-v3 driver which means these address regions will not be translated.

https://www.spinics.net/lists/linux-pci/msg65125.html

On such platforms, reserved memory regions will export like this:

root:~# cat /sys/kernel/iommu_groups/7/reserved_regions
0x00000000a8800000 0x00000000affeffff reserved
0x00000000c6000000 0x00000000c601ffff msi

However, it falls within the Qemu default virtual memory address space.

Take a look at hw/arm/virt.c:

[VIRT_MEM] =             { 0x40000000, RAMLIMIT_BYTES },
    . . .

    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
                                                     machine->ram_size);
memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);

So suppose we allocate 4GB mem to a VM, there is a chance that the reserved
regions will get allocated for a Guest VF DMA iova and it will fail.

This patchset create holes in the Qemu RAM address space which exclude
the sysfs exported regions.


Zhu Yijun (5):
  hw/vfio: Add function for getting reserved_region of device iommu
    group
  hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  hw/arm: add scattered RAM memory region support
  hw/arm/boot: set fdt size cell of memory node from mem_list
  hw/arm/virt-acpi-build: Build srat table according to mem_list

 hw/arm/boot.c                 | 155 +++++++++++++++++++++++++++++++-----------
 hw/arm/virt-acpi-build.c      |  40 +++++++++--
 hw/arm/virt.c                 | 120 ++++++++++++++++++++++++++++++--
 hw/vfio/common.c              |  67 ++++++++++++++++++
 hw/vfio/pci.c                 |   2 +
 hw/vfio/platform.c            |   2 +
 include/exec/memory.h         |   7 ++
 include/hw/arm/arm.h          |   1 +
 include/hw/arm/virt.h         |   1 +
 include/hw/vfio/vfio-common.h |   3 +
 10 files changed, 347 insertions(+), 51 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-14  1:15 zhuyijun
@ 2017-11-14  1:15 ` zhuyijun
  2017-11-14 15:47   ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: zhuyijun @ 2017-11-14  1:15 UTC (permalink / raw)
  To: qemu-arm, qemu-devel, eric.auger
  Cc: peter.maydell, shameerali.kolothum.thodi, zhaoshenglong

From: Zhu Yijun <zhuyijun@huawei.com>

With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
PCI reserved window which has to be excluded from Guest iova allocations.

However, If it falls within the Qemu default virtual memory address space,
then reserved regions may get allocated for a Guest VF DMA iova and it will
fail.

So get those reserved regions in this patch and create some holes in the Qemu
ram address in next patchset.

Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
---
 hw/vfio/common.c              | 67 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  2 ++
 hw/vfio/platform.c            |  2 ++
 include/exec/memory.h         |  7 +++++
 include/hw/vfio/vfio-common.h |  3 ++
 5 files changed, 81 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7b2924c..01bdbbd 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,8 @@ struct vfio_group_head vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 struct vfio_as_head vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
+struct reserved_ram_head reserved_ram_regions =
+    QLIST_HEAD_INITIALIZER(reserved_ram_regions);
 
 #ifdef CONFIG_KVM
 /*
@@ -52,6 +54,71 @@ struct vfio_as_head vfio_address_spaces =
 static int vfio_kvm_device_fd = -1;
 #endif
 
+void update_reserved_regions(hwaddr addr, hwaddr size)
+{
+    RAMRegion *reg, *new;
+
+    new = g_new(RAMRegion, 1);
+    new->base = addr;
+    new->size = size;
+
+    if (QLIST_EMPTY(&reserved_ram_regions)) {
+        QLIST_INSERT_HEAD(&reserved_ram_regions, new, next);
+        return;
+    }
+
+    /* make the base of reserved_ram_regions increase */
+    QLIST_FOREACH(reg, &reserved_ram_regions, next) {
+        if (addr > (reg->base + reg->size - 1)) {
+            if (!QLIST_NEXT(reg, next)) {
+                QLIST_INSERT_AFTER(reg, new, next);
+                break;
+            }
+            continue;
+        } else if (addr >= reg->base && addr <= (reg->base + reg->size - 1)) {
+            /* discard the duplicate entry */
+            if (addr == reg->base && size == reg->size) {
+                g_free(new);
+                break;
+            } else {
+                QLIST_INSERT_AFTER(reg, new, next);
+                break;
+            }
+        } else {
+            QLIST_INSERT_BEFORE(reg, new, next);
+            break;
+        }
+    }
+}
+
+void vfio_get_iommu_group_reserved_region(char *group_path)
+{
+    char *filename;
+    FILE *fp;
+    hwaddr start, end;
+    char str[10];
+    struct stat st;
+
+    filename = g_strdup_printf("%s/iommu_group/reserved_regions", group_path);
+    if (stat(filename, &st) < 0) {
+        g_free(filename);
+        return;
+    }
+
+    fp = fopen(filename, "r");
+    if (!fp) {
+        g_free(filename);
+        return;
+    }
+
+    while (fscanf(fp, "%"PRIx64" %"PRIx64" %s", &start, &end, str) != EOF) {
+        update_reserved_regions(start, (end - start + 1));
+    }
+
+    g_free(filename);
+    fclose(fp);
+}
+
 /*
  * Common VFIO interrupt disable
  */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee3..9bffb38 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2674,6 +2674,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
 
+    vfio_get_iommu_group_reserved_region(vdev->vbasedev.sysfsdev);
+
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index da84abf..d5bbc3a 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -578,6 +578,8 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
+    vfio_get_iommu_group_reserved_region(vbasedev->sysfsdev);
+
     tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5ed4042..2bcc83b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -46,6 +46,13 @@
         OBJECT_GET_CLASS(IOMMUMemoryRegionClass, (obj), \
                          TYPE_IOMMU_MEMORY_REGION)
 
+/* Scattered RAM memory region struct */
+typedef struct RAMRegion {
+    hwaddr base;
+    hwaddr size;
+    QLIST_ENTRY(RAMRegion) next;
+} RAMRegion;
+
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f3a2ac9..3483ca6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -161,10 +161,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+void update_reserved_regions(hwaddr addr, hwaddr size);
+void vfio_get_iommu_group_reserved_region(char *group_path);
 
 extern const MemoryRegionOps vfio_region_ops;
 extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
 extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
+extern QLIST_HEAD(reserved_ram_head, RAMRegion) reserved_ram_regions;
 
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-14  1:15 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
@ 2017-11-14 15:47   ` Alex Williamson
  2017-11-15  9:49     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2017-11-14 15:47 UTC (permalink / raw)
  To: zhuyijun
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell,
	shameerali.kolothum.thodi, zhaoshenglong

On Tue, 14 Nov 2017 09:15:50 +0800
<zhuyijun@huawei.com> wrote:

> From: Zhu Yijun <zhuyijun@huawei.com>
> 
> With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved window and
> PCI reserved window which has to be excluded from Guest iova allocations.
> 
> However, If it falls within the Qemu default virtual memory address space,
> then reserved regions may get allocated for a Guest VF DMA iova and it will
> fail.
> 
> So get those reserved regions in this patch and create some holes in the Qemu
> ram address in next patchset.
> 
> Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> ---
>  hw/vfio/common.c              | 67 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  2 ++
>  hw/vfio/platform.c            |  2 ++
>  include/exec/memory.h         |  7 +++++
>  include/hw/vfio/vfio-common.h |  3 ++
>  5 files changed, 81 insertions(+)

I generally prefer the vfio interface to be more self sufficient, if
there are regions the IOMMU cannot map, we should be describing those
via capabilities on the container through the vfio interface.  If we're
just scraping together things from sysfs, the user can just as easily
do that and provide an explicit memory map for the VM taking the
devices into account.  Of course having a device associated with a
restricted memory map that conflicts with the default memory map for
the VM implies that you can never support hot-add of such devices.
Please cc me on vfio related patches.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-14 15:47   ` Alex Williamson
@ 2017-11-15  9:49     ` Shameerali Kolothum Thodi
  2017-11-15 18:25       ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-11-15  9:49 UTC (permalink / raw)
  To: Alex Williamson, Zhuyijun
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, Zhaoshenglong

Hi Alex,

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, November 14, 2017 3:48 PM
> To: Zhuyijun <zhuyijun@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>; Zhaoshenglong
> <zhaoshenglong@huawei.com>
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> On Tue, 14 Nov 2017 09:15:50 +0800
> <zhuyijun@huawei.com> wrote:
> 
> > From: Zhu Yijun <zhuyijun@huawei.com>
> >
> > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > window and PCI reserved window which has to be excluded from Guest iova
> allocations.
> >
> > However, If it falls within the Qemu default virtual memory address
> > space, then reserved regions may get allocated for a Guest VF DMA iova
> > and it will fail.
> >
> > So get those reserved regions in this patch and create some holes in
> > the Qemu ram address in next patchset.
> >
> > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > ---
> >  hw/vfio/common.c              | 67
> +++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  2 ++
> >  hw/vfio/platform.c            |  2 ++
> >  include/exec/memory.h         |  7 +++++
> >  include/hw/vfio/vfio-common.h |  3 ++
> >  5 files changed, 81 insertions(+)
> 
> I generally prefer the vfio interface to be more self sufficient, if there are
> regions the IOMMU cannot map, we should be describing those via capabilities
> on the container through the vfio interface.  If we're just scraping together
> things from sysfs, the user can just as easily do that and provide an explicit
> memory map for the VM taking the devices into account. 

Ok. I was under the impression that the purpose of introducing the 
/sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions 
that are reserved(MSI or non-mappable) for Qemu or other apps to
make use of.  I think this was introduced as part of the "KVM/MSI passthrough
support on ARM" patch series. And if I remember correctly, Eric had 
an approach where the user space can retrieve all the reserved regions through
the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the 
sysfs interface.

May be I am missing something here.

> Of course having a
> device associated with a restricted memory map that conflicts with the default
> memory map for the VM implies that you can never support hot-add of such
> devices.

True.  Hot-add and migration will have issues on these platforms.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-15  9:49     ` Shameerali Kolothum Thodi
@ 2017-11-15 18:25       ` Alex Williamson
  2017-11-20 11:58         ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2017-11-15 18:25 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Zhuyijun, qemu-arm, qemu-devel, eric.auger, peter.maydell, Zhaoshenglong

On Wed, 15 Nov 2017 09:49:41 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> Hi Alex,
> 
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, November 14, 2017 3:48 PM
> > To: Zhuyijun <zhuyijun@huawei.com>
> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > eric.auger@redhat.com; peter.maydell@linaro.org; Shameerali Kolothum
> > Thodi <shameerali.kolothum.thodi@huawei.com>; Zhaoshenglong
> > <zhaoshenglong@huawei.com>
> > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > reserved_region of device iommu group
> > 
> > On Tue, 14 Nov 2017 09:15:50 +0800
> > <zhuyijun@huawei.com> wrote:
> >   
> > > From: Zhu Yijun <zhuyijun@huawei.com>
> > >
> > > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > > window and PCI reserved window which has to be excluded from Guest iova  
> > allocations.  
> > >
> > > However, If it falls within the Qemu default virtual memory address
> > > space, then reserved regions may get allocated for a Guest VF DMA iova
> > > and it will fail.
> > >
> > > So get those reserved regions in this patch and create some holes in
> > > the Qemu ram address in next patchset.
> > >
> > > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > > ---
> > >  hw/vfio/common.c              | 67  
> > +++++++++++++++++++++++++++++++++++++++++++  
> > >  hw/vfio/pci.c                 |  2 ++
> > >  hw/vfio/platform.c            |  2 ++
> > >  include/exec/memory.h         |  7 +++++
> > >  include/hw/vfio/vfio-common.h |  3 ++
> > >  5 files changed, 81 insertions(+)  
> > 
> > I generally prefer the vfio interface to be more self sufficient, if there are
> > regions the IOMMU cannot map, we should be describing those via capabilities
> > on the container through the vfio interface.  If we're just scraping together
> > things from sysfs, the user can just as easily do that and provide an explicit
> > memory map for the VM taking the devices into account.   
> 
> Ok. I was under the impression that the purpose of introducing the 
> /sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions 
> that are reserved(MSI or non-mappable) for Qemu or other apps to
> make use of.  I think this was introduced as part of the "KVM/MSI passthrough
> support on ARM" patch series. And if I remember correctly, Eric had 
> an approach where the user space can retrieve all the reserved regions through
> the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the 
> sysfs interface.
> 
> May be I am missing something here.

And sysfs is a good interface if the user wants to use it to configure
the VM in a way that's compatible with a device.  For instance, in your
case, a user could evaluate these reserved regions across all devices in
a system, or even across an entire cluster, and instantiate the VM with
a memory map compatible with hotplugging any of those evaluated
devices (QEMU implementation of allowing the user to do this TBD).
Having the vfio device evaluate these reserved regions only helps in
the cold-plug case.  So the proposed solution is limited in scope and
doesn't address similar needs on other platforms.  There is value to
verifying that a device's IOVA space is compatible with a VM memory map
and modifying the memory map on cold-plug or rejecting the device on
hot-plug, but isn't that why we have an ioctl within vfio to expose
information about the IOMMU?  Why take the path of allowing QEMU to
rummage through sysfs files outside of vfio, implying additional
security and access concerns, rather than filling the gap within the
vifo API?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-15 18:25       ` Alex Williamson
@ 2017-11-20 11:58         ` Shameerali Kolothum Thodi
  2017-11-20 15:57           ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-11-20 11:58 UTC (permalink / raw)
  To: Alex Williamson, eric.auger
  Cc: Zhuyijun, qemu-arm, qemu-devel, peter.maydell, Zhaoshenglong, Linuxarm



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, November 15, 2017 6:25 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Zhuyijun <zhuyijun@huawei.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; eric.auger@redhat.com; peter.maydell@linaro.org;
> Zhaoshenglong <zhaoshenglong@huawei.com>
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> On Wed, 15 Nov 2017 09:49:41 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > Hi Alex,
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Tuesday, November 14, 2017 3:48 PM
> > > To: Zhuyijun <zhuyijun@huawei.com>
> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > > eric.auger@redhat.com; peter.maydell@linaro.org; Shameerali Kolothum
> > > Thodi <shameerali.kolothum.thodi@huawei.com>; Zhaoshenglong
> > > <zhaoshenglong@huawei.com>
> > > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > > reserved_region of device iommu group
> > >
> > > On Tue, 14 Nov 2017 09:15:50 +0800
> > > <zhuyijun@huawei.com> wrote:
> > >
> > > > From: Zhu Yijun <zhuyijun@huawei.com>
> > > >
> > > > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > > > window and PCI reserved window which has to be excluded from Guest
> iova
> > > allocations.
> > > >
> > > > However, If it falls within the Qemu default virtual memory address
> > > > space, then reserved regions may get allocated for a Guest VF DMA iova
> > > > and it will fail.
> > > >
> > > > So get those reserved regions in this patch and create some holes in
> > > > the Qemu ram address in next patchset.
> > > >
> > > > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > > > ---
> > > >  hw/vfio/common.c              | 67
> > > +++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  2 ++
> > > >  hw/vfio/platform.c            |  2 ++
> > > >  include/exec/memory.h         |  7 +++++
> > > >  include/hw/vfio/vfio-common.h |  3 ++
> > > >  5 files changed, 81 insertions(+)
> > >
> > > I generally prefer the vfio interface to be more self sufficient, if there are
> > > regions the IOMMU cannot map, we should be describing those via
> capabilities
> > > on the container through the vfio interface.  If we're just scraping together
> > > things from sysfs, the user can just as easily do that and provide an explicit
> > > memory map for the VM taking the devices into account.
> >
> > Ok. I was under the impression that the purpose of introducing the
> > /sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions
> > that are reserved(MSI or non-mappable) for Qemu or other apps to
> > make use of.  I think this was introduced as part of the "KVM/MSI passthrough
> > support on ARM" patch series. And if I remember correctly, Eric had
> > an approach where the user space can retrieve all the reserved regions
> through
> > the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the
> > sysfs interface.
> >
> > May be I am missing something here.
> 
> And sysfs is a good interface if the user wants to use it to configure
> the VM in a way that's compatible with a device.  For instance, in your
> case, a user could evaluate these reserved regions across all devices in
> a system, or even across an entire cluster, and instantiate the VM with
> a memory map compatible with hotplugging any of those evaluated
> devices (QEMU implementation of allowing the user to do this TBD).
> Having the vfio device evaluate these reserved regions only helps in
> the cold-plug case.  So the proposed solution is limited in scope and
> doesn't address similar needs on other platforms.  There is value to
> verifying that a device's IOVA space is compatible with a VM memory map
> and modifying the memory map on cold-plug or rejecting the device on
> hot-plug, but isn't that why we have an ioctl within vfio to expose
> information about the IOMMU?  Why take the path of allowing QEMU to
> rummage through sysfs files outside of vfio, implying additional
> security and access concerns, rather than filling the gap within the
> vifo API?  

Thanks Alex for the explanation. 

I came across this patch[1] from Eric where he introduced the IOCTL interface to
retrieve the reserved regions. It looks like this can be reworked to accommodate 
the above requirement.

Hi Eric,

Please let us know if you have any plans to respin this patch or else we can take a
look at this and do the rework if it is Ok.

Thanks,
Shameer 

1. https://lists.linuxfoundation.org/pipermail/iommu/2016-November/019002.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-20 11:58         ` Shameerali Kolothum Thodi
@ 2017-11-20 15:57           ` Alex Williamson
  2017-11-20 16:31             ` Shameerali Kolothum Thodi
  2017-12-06 10:30             ` Shameerali Kolothum Thodi
  0 siblings, 2 replies; 16+ messages in thread
From: Alex Williamson @ 2017-11-20 15:57 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: eric.auger, Zhuyijun, qemu-arm, qemu-devel, peter.maydell,
	Zhaoshenglong, Linuxarm

On Mon, 20 Nov 2017 11:58:43 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, November 15, 2017 6:25 PM
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: Zhuyijun <zhuyijun@huawei.com>; qemu-arm@nongnu.org; qemu-
> > devel@nongnu.org; eric.auger@redhat.com; peter.maydell@linaro.org;
> > Zhaoshenglong <zhaoshenglong@huawei.com>
> > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > reserved_region of device iommu group
> > 
> > On Wed, 15 Nov 2017 09:49:41 +0000
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > -----Original Message-----
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Tuesday, November 14, 2017 3:48 PM
> > > > To: Zhuyijun <zhuyijun@huawei.com>
> > > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > > > eric.auger@redhat.com; peter.maydell@linaro.org; Shameerali Kolothum
> > > > Thodi <shameerali.kolothum.thodi@huawei.com>; Zhaoshenglong
> > > > <zhaoshenglong@huawei.com>
> > > > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > > > reserved_region of device iommu group
> > > >
> > > > On Tue, 14 Nov 2017 09:15:50 +0800
> > > > <zhuyijun@huawei.com> wrote:
> > > >  
> > > > > From: Zhu Yijun <zhuyijun@huawei.com>
> > > > >
> > > > > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > > > > window and PCI reserved window which has to be excluded from Guest  
> > iova  
> > > > allocations.  
> > > > >
> > > > > However, If it falls within the Qemu default virtual memory address
> > > > > space, then reserved regions may get allocated for a Guest VF DMA iova
> > > > > and it will fail.
> > > > >
> > > > > So get those reserved regions in this patch and create some holes in
> > > > > the Qemu ram address in next patchset.
> > > > >
> > > > > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > > > > ---
> > > > >  hw/vfio/common.c              | 67  
> > > > +++++++++++++++++++++++++++++++++++++++++++  
> > > > >  hw/vfio/pci.c                 |  2 ++
> > > > >  hw/vfio/platform.c            |  2 ++
> > > > >  include/exec/memory.h         |  7 +++++
> > > > >  include/hw/vfio/vfio-common.h |  3 ++
> > > > >  5 files changed, 81 insertions(+)  
> > > >
> > > > I generally prefer the vfio interface to be more self sufficient, if there are
> > > > regions the IOMMU cannot map, we should be describing those via  
> > capabilities  
> > > > on the container through the vfio interface.  If we're just scraping together
> > > > things from sysfs, the user can just as easily do that and provide an explicit
> > > > memory map for the VM taking the devices into account.  
> > >
> > > Ok. I was under the impression that the purpose of introducing the
> > > /sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions
> > > that are reserved(MSI or non-mappable) for Qemu or other apps to
> > > make use of.  I think this was introduced as part of the "KVM/MSI passthrough
> > > support on ARM" patch series. And if I remember correctly, Eric had
> > > an approach where the user space can retrieve all the reserved regions  
> > through  
> > > the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the
> > > sysfs interface.
> > >
> > > May be I am missing something here.  
> > 
> > And sysfs is a good interface if the user wants to use it to configure
> > the VM in a way that's compatible with a device.  For instance, in your
> > case, a user could evaluate these reserved regions across all devices in
> > a system, or even across an entire cluster, and instantiate the VM with
> > a memory map compatible with hotplugging any of those evaluated
> > devices (QEMU implementation of allowing the user to do this TBD).
> > Having the vfio device evaluate these reserved regions only helps in
> > the cold-plug case.  So the proposed solution is limited in scope and
> > doesn't address similar needs on other platforms.  There is value to
> > verifying that a device's IOVA space is compatible with a VM memory map
> > and modifying the memory map on cold-plug or rejecting the device on
> > hot-plug, but isn't that why we have an ioctl within vfio to expose
> > information about the IOMMU?  Why take the path of allowing QEMU to
> > rummage through sysfs files outside of vfio, implying additional
> > security and access concerns, rather than filling the gap within the
> > vifo API?    
> 
> Thanks Alex for the explanation. 
> 
> I came across this patch[1] from Eric where he introduced the IOCTL interface to
> retrieve the reserved regions. It looks like this can be reworked to accommodate 
> the above requirement.

I don't think we need a new ioctl for this, nor do I think that
describing the holes is the correct approach.  The existing
VFIO_IOMMU_GET_INFO ioctl can be extended to support capability chains,
as we've done for VFIO_DEVICE_GET_REGION_INFO.  IMO, we should try to
describe the available fixed IOVA regions which are available for
mapping rather than describing various holes within the address space
which are unavailable.  The latter method always fails to describe the
end of the mappable IOVA space and gets bogged down in trying to
classify the types of holes that might exist.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-20 15:57           ` Alex Williamson
@ 2017-11-20 16:31             ` Shameerali Kolothum Thodi
  2017-12-06 10:30             ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-11-20 16:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Zhuyijun, qemu-arm, qemu-devel, peter.maydell,
	Zhaoshenglong, Linuxarm



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Monday, November 20, 2017 3:58 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> On Mon, 20 Nov 2017 11:58:43 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, November 15, 2017 6:25 PM
> > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > > Cc: Zhuyijun <zhuyijun@huawei.com>; qemu-arm@nongnu.org; qemu-
> > > devel@nongnu.org; eric.auger@redhat.com; peter.maydell@linaro.org;
> > > Zhaoshenglong <zhaoshenglong@huawei.com>
> > > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > > reserved_region of device iommu group
> > >
> > > On Wed, 15 Nov 2017 09:49:41 +0000
> > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Tuesday, November 14, 2017 3:48 PM
> > > > > To: Zhuyijun <zhuyijun@huawei.com>
> > > > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > > > > eric.auger@redhat.com; peter.maydell@linaro.org; Shameerali
> Kolothum
> > > > > Thodi <shameerali.kolothum.thodi@huawei.com>; Zhaoshenglong
> > > > > <zhaoshenglong@huawei.com>
> > > > > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > > > > reserved_region of device iommu group
> > > > >
> > > > > On Tue, 14 Nov 2017 09:15:50 +0800
> > > > > <zhuyijun@huawei.com> wrote:
> > > > >
> > > > > > From: Zhu Yijun <zhuyijun@huawei.com>
> > > > > >
> > > > > > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > > > > > window and PCI reserved window which has to be excluded from
> Guest
> > > iova
> > > > > allocations.
> > > > > >
> > > > > > However, If it falls within the Qemu default virtual memory address
> > > > > > space, then reserved regions may get allocated for a Guest VF DMA
> iova
> > > > > > and it will fail.
> > > > > >
> > > > > > So get those reserved regions in this patch and create some holes in
> > > > > > the Qemu ram address in next patchset.
> > > > > >
> > > > > > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > > > > > ---
> > > > > >  hw/vfio/common.c              | 67
> > > > > +++++++++++++++++++++++++++++++++++++++++++
> > > > > >  hw/vfio/pci.c                 |  2 ++
> > > > > >  hw/vfio/platform.c            |  2 ++
> > > > > >  include/exec/memory.h         |  7 +++++
> > > > > >  include/hw/vfio/vfio-common.h |  3 ++
> > > > > >  5 files changed, 81 insertions(+)
> > > > >
> > > > > I generally prefer the vfio interface to be more self sufficient, if there
> are
> > > > > regions the IOMMU cannot map, we should be describing those via
> > > capabilities
> > > > > on the container through the vfio interface.  If we're just scraping
> together
> > > > > things from sysfs, the user can just as easily do that and provide an
> explicit
> > > > > memory map for the VM taking the devices into account.
> > > >
> > > > Ok. I was under the impression that the purpose of introducing the
> > > > /sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions
> > > > that are reserved(MSI or non-mappable) for Qemu or other apps to
> > > > make use of.  I think this was introduced as part of the "KVM/MSI
> passthrough
> > > > support on ARM" patch series. And if I remember correctly, Eric had
> > > > an approach where the user space can retrieve all the reserved regions
> > > through
> > > > the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with
> the
> > > > sysfs interface.
> > > >
> > > > May be I am missing something here.
> > >
> > > And sysfs is a good interface if the user wants to use it to configure
> > > the VM in a way that's compatible with a device.  For instance, in your
> > > case, a user could evaluate these reserved regions across all devices in
> > > a system, or even across an entire cluster, and instantiate the VM with
> > > a memory map compatible with hotplugging any of those evaluated
> > > devices (QEMU implementation of allowing the user to do this TBD).
> > > Having the vfio device evaluate these reserved regions only helps in
> > > the cold-plug case.  So the proposed solution is limited in scope and
> > > doesn't address similar needs on other platforms.  There is value to
> > > verifying that a device's IOVA space is compatible with a VM memory map
> > > and modifying the memory map on cold-plug or rejecting the device on
> > > hot-plug, but isn't that why we have an ioctl within vfio to expose
> > > information about the IOMMU?  Why take the path of allowing QEMU to
> > > rummage through sysfs files outside of vfio, implying additional
> > > security and access concerns, rather than filling the gap within the
> > > vifo API?
> >
> > Thanks Alex for the explanation.
> >
> > I came across this patch[1] from Eric where he introduced the IOCTL
> interface to
> > retrieve the reserved regions. It looks like this can be reworked to
> accommodate
> > the above requirement.
> 
> I don't think we need a new ioctl for this, nor do I think that
> describing the holes is the correct approach.  The existing
> VFIO_IOMMU_GET_INFO ioctl can be extended to support capability chains,
> as we've done for VFIO_DEVICE_GET_REGION_INFO. 

Right, as far as I can see the above mentioned patch is doing exactly the same, 
extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.

> IMO, we should try to
> describe the available fixed IOVA regions which are available for
> mapping rather than describing various holes within the address space
> which are unavailable.  The latter method always fails to describe the
> end of the mappable IOVA space and gets bogged down in trying to
> classify the types of holes that might exist.  Thanks,

Ok. I guess that is to take care iommu max address width case.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-11-20 15:57           ` Alex Williamson
  2017-11-20 16:31             ` Shameerali Kolothum Thodi
@ 2017-12-06 10:30             ` Shameerali Kolothum Thodi
  2017-12-06 14:01               ` Auger Eric
  1 sibling, 1 reply; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-12-06 10:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Zhuyijun, qemu-arm, qemu-devel, peter.maydell,
	Zhaoshenglong, Linuxarm

Hi Alex,

> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: Monday, November 20, 2017 4:31 PM
> To: 'Alex Williamson' <alex.williamson@redhat.com>
> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: RE: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
[...]
> > > > And sysfs is a good interface if the user wants to use it to
> > > > configure the VM in a way that's compatible with a device.  For
> > > > instance, in your case, a user could evaluate these reserved
> > > > regions across all devices in a system, or even across an entire
> > > > cluster, and instantiate the VM with a memory map compatible with
> > > > hotplugging any of those evaluated devices (QEMU implementation of
> allowing the user to do this TBD).
> > > > Having the vfio device evaluate these reserved regions only helps
> > > > in the cold-plug case.  So the proposed solution is limited in
> > > > scope and doesn't address similar needs on other platforms.  There
> > > > is value to verifying that a device's IOVA space is compatible
> > > > with a VM memory map and modifying the memory map on cold-plug or
> > > > rejecting the device on hot-plug, but isn't that why we have an
> > > > ioctl within vfio to expose information about the IOMMU?  Why take
> > > > the path of allowing QEMU to rummage through sysfs files outside
> > > > of vfio, implying additional security and access concerns, rather
> > > > than filling the gap within the vifo API?
> > >
> > > Thanks Alex for the explanation.
> > >
> > > I came across this patch[1] from Eric where he introduced the IOCTL
> > interface to
> > > retrieve the reserved regions. It looks like this can be reworked to
> > accommodate
> > > the above requirement.
> >
> > I don't think we need a new ioctl for this, nor do I think that
> > describing the holes is the correct approach.  The existing
> > VFIO_IOMMU_GET_INFO ioctl can be extended to support capability
> > chains, as we've done for VFIO_DEVICE_GET_REGION_INFO.
> 
> Right, as far as I can see the above mentioned patch is doing exactly the same,
> extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.
> 
> > IMO, we should try to
> > describe the available fixed IOVA regions which are available for
> > mapping rather than describing various holes within the address space
> > which are unavailable.  The latter method always fails to describe the
> > end of the mappable IOVA space and gets bogged down in trying to
> > classify the types of holes that might exist.  Thanks,
> 

I was going through this and noticed that it is possible to have multiple 
iommu domains associated with a container. If that's true, is it safe to
make the assumption that all the domains will have the same iova
geometry or not?

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-12-06 10:30             ` Shameerali Kolothum Thodi
@ 2017-12-06 14:01               ` Auger Eric
  2017-12-06 14:38                 ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 16+ messages in thread
From: Auger Eric @ 2017-12-06 14:01 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Alex Williamson
  Cc: peter.maydell, qemu-devel, Linuxarm, qemu-arm, Zhaoshenglong, Zhuyijun

Hi Shameer,

On 06/12/17 11:30, Shameerali Kolothum Thodi wrote:
> Hi Alex,
> 
>> -----Original Message-----
>> From: Shameerali Kolothum Thodi
>> Sent: Monday, November 20, 2017 4:31 PM
>> To: 'Alex Williamson' <alex.williamson@redhat.com>
>> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
>> <linuxarm@huawei.com>
>> Subject: RE: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
>> reserved_region of device iommu group
> [...]
>>>>> And sysfs is a good interface if the user wants to use it to
>>>>> configure the VM in a way that's compatible with a device.  For
>>>>> instance, in your case, a user could evaluate these reserved
>>>>> regions across all devices in a system, or even across an entire
>>>>> cluster, and instantiate the VM with a memory map compatible with
>>>>> hotplugging any of those evaluated devices (QEMU implementation of
>> allowing the user to do this TBD).
>>>>> Having the vfio device evaluate these reserved regions only helps
>>>>> in the cold-plug case.  So the proposed solution is limited in
>>>>> scope and doesn't address similar needs on other platforms.  There
>>>>> is value to verifying that a device's IOVA space is compatible
>>>>> with a VM memory map and modifying the memory map on cold-plug or
>>>>> rejecting the device on hot-plug, but isn't that why we have an
>>>>> ioctl within vfio to expose information about the IOMMU?  Why take
>>>>> the path of allowing QEMU to rummage through sysfs files outside
>>>>> of vfio, implying additional security and access concerns, rather
>>>>> than filling the gap within the vifo API?
>>>>
>>>> Thanks Alex for the explanation.
>>>>
>>>> I came across this patch[1] from Eric where he introduced the IOCTL
>>> interface to
>>>> retrieve the reserved regions. It looks like this can be reworked to
>>> accommodate
>>>> the above requirement.
>>>
>>> I don't think we need a new ioctl for this, nor do I think that
>>> describing the holes is the correct approach.  The existing
>>> VFIO_IOMMU_GET_INFO ioctl can be extended to support capability
>>> chains, as we've done for VFIO_DEVICE_GET_REGION_INFO.
>>
>> Right, as far as I can see the above mentioned patch is doing exactly the same,
>> extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.
>>
>>> IMO, we should try to
>>> describe the available fixed IOVA regions which are available for
>>> mapping rather than describing various holes within the address space
>>> which are unavailable.  The latter method always fails to describe the
>>> end of the mappable IOVA space and gets bogged down in trying to
>>> classify the types of holes that might exist.  Thanks,
>>
> 
> I was going through this and noticed that it is possible to have multiple 
> iommu domains associated with a container. If that's true, is it safe to
> make the assumption that all the domains will have the same iova
> geometry or not?
To me the answer is no.

There are several iommu domains "underneath" a single container. You
attach vfio groups to a container. Each of them is associated to an
iommu group and an iommu domain. See vfio_iommu_type1_attach_group().

Besides, the reserved regions are per iommu group.

Thanks

Eric

> 
> Thanks,
> Shameer
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-12-06 14:01               ` Auger Eric
@ 2017-12-06 14:38                 ` Shameerali Kolothum Thodi
  2017-12-06 14:59                   ` Auger Eric
  0 siblings, 1 reply; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-12-06 14:38 UTC (permalink / raw)
  To: Auger Eric, Alex Williamson
  Cc: peter.maydell, qemu-devel, Linuxarm, qemu-arm, Zhaoshenglong, Zhuyijun

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Wednesday, December 06, 2017 2:01 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Alex Williamson <alex.williamson@redhat.com>
> Cc: peter.maydell@linaro.org; qemu-devel@nongnu.org; Linuxarm
> <linuxarm@huawei.com>; qemu-arm@nongnu.org; Zhaoshenglong
> <zhaoshenglong@huawei.com>; Zhuyijun <zhuyijun@huawei.com>
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> Hi Shameer,
> 
> On 06/12/17 11:30, Shameerali Kolothum Thodi wrote:
> > Hi Alex,
> >
> >> -----Original Message-----
> >> From: Shameerali Kolothum Thodi
> >> Sent: Monday, November 20, 2017 4:31 PM
> >> To: 'Alex Williamson' <alex.williamson@redhat.com>
> >> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
> >> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> >> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
> >> <linuxarm@huawei.com>
> >> Subject: RE: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> >> reserved_region of device iommu group
> > [...]
> >>>>> And sysfs is a good interface if the user wants to use it to
> >>>>> configure the VM in a way that's compatible with a device.  For
> >>>>> instance, in your case, a user could evaluate these reserved
> >>>>> regions across all devices in a system, or even across an entire
> >>>>> cluster, and instantiate the VM with a memory map compatible with
> >>>>> hotplugging any of those evaluated devices (QEMU implementation of
> >> allowing the user to do this TBD).
> >>>>> Having the vfio device evaluate these reserved regions only helps
> >>>>> in the cold-plug case.  So the proposed solution is limited in
> >>>>> scope and doesn't address similar needs on other platforms.  There
> >>>>> is value to verifying that a device's IOVA space is compatible
> >>>>> with a VM memory map and modifying the memory map on cold-plug or
> >>>>> rejecting the device on hot-plug, but isn't that why we have an
> >>>>> ioctl within vfio to expose information about the IOMMU?  Why take
> >>>>> the path of allowing QEMU to rummage through sysfs files outside
> >>>>> of vfio, implying additional security and access concerns, rather
> >>>>> than filling the gap within the vifo API?
> >>>>
> >>>> Thanks Alex for the explanation.
> >>>>
> >>>> I came across this patch[1] from Eric where he introduced the IOCTL
> >>> interface to
> >>>> retrieve the reserved regions. It looks like this can be reworked to
> >>> accommodate
> >>>> the above requirement.
> >>>
> >>> I don't think we need a new ioctl for this, nor do I think that
> >>> describing the holes is the correct approach.  The existing
> >>> VFIO_IOMMU_GET_INFO ioctl can be extended to support capability
> >>> chains, as we've done for VFIO_DEVICE_GET_REGION_INFO.
> >>
> >> Right, as far as I can see the above mentioned patch is doing exactly the
> same,
> >> extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.
> >>
> >>> IMO, we should try to
> >>> describe the available fixed IOVA regions which are available for
> >>> mapping rather than describing various holes within the address space
> >>> which are unavailable.  The latter method always fails to describe the
> >>> end of the mappable IOVA space and gets bogged down in trying to
> >>> classify the types of holes that might exist.  Thanks,
> >>
> >
> > I was going through this and noticed that it is possible to have multiple
> > iommu domains associated with a container. If that's true, is it safe to
> > make the assumption that all the domains will have the same iova
> > geometry or not?
> To me the answer is no.
> 
> There are several iommu domains "underneath" a single container. You
> attach vfio groups to a container. Each of them is associated to an
> iommu group and an iommu domain. See vfio_iommu_type1_attach_group().
> 
> Besides, the reserved regions are per iommu group.
> 

Thanks for your reply. Yes, container can have multiple groups(hence multiple 
iommu domains) and reserved regions are per group. Hence while deciding
the default supported iova geometry we have to go through all the domains
in the container and select smallest aperture as the supported default iova
range.

Please find below snippet from a patch I am working on. Please
let me know your thoughts on this.

Thanks,
Shameer

-- >8 --
+static int vfio_build_iommu_iova_caps(struct vfio_iommu *iommu,
+                               struct vfio_info_cap *caps) {
+       struct iommu_resv_region *resv, *resv_next;
+       struct vfio_iommu_iova *iova, *iova_next;
+       struct list_head group_resv_regions, vfio_iova_regions;
+       struct vfio_domain *domain;
+       struct vfio_group *g;
+       phys_addr_t start, end;
+       int ret = 0;
+
+       domain = list_first_entry(&iommu->domain_list,
+                                 struct vfio_domain, next);
+       /* Get the default iova range supported */
+       start = domain->domain->geometry.aperture_start;
+       end = domain->domain->geometry.aperture_end;

This is where I am confused. I think instead I should go over
the domain_list and select the smallest aperture as default
iova range.

+       INIT_LIST_HEAD(&vfio_iova_regions);
+       vfio_insert_iova(start, end, &vfio_iova_regions);
+
+       /* Get reserved regions if any */
+       INIT_LIST_HEAD(&group_resv_regions);
+       list_for_each_entry(g, &domain->group_list, next)
+               iommu_get_group_resv_regions(g->iommu_group,
+                                               &group_resv_regions);
+       list_sort(NULL, &group_resv_regions, vfio_resv_cmp);
+
+       /* Update iova range excluding reserved regions */
 ...
-- >8 --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-12-06 14:38                 ` Shameerali Kolothum Thodi
@ 2017-12-06 14:59                   ` Auger Eric
  2017-12-06 15:19                     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 16+ messages in thread
From: Auger Eric @ 2017-12-06 14:59 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Alex Williamson
  Cc: peter.maydell, qemu-devel, Linuxarm, qemu-arm, Zhaoshenglong, Zhuyijun

Hi Shameer,

On 06/12/17 15:38, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -----Original Message-----
>> From: Auger Eric [mailto:eric.auger@redhat.com]
>> Sent: Wednesday, December 06, 2017 2:01 PM
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> Alex Williamson <alex.williamson@redhat.com>
>> Cc: peter.maydell@linaro.org; qemu-devel@nongnu.org; Linuxarm
>> <linuxarm@huawei.com>; qemu-arm@nongnu.org; Zhaoshenglong
>> <zhaoshenglong@huawei.com>; Zhuyijun <zhuyijun@huawei.com>
>> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
>> reserved_region of device iommu group
>>
>> Hi Shameer,
>>
>> On 06/12/17 11:30, Shameerali Kolothum Thodi wrote:
>>> Hi Alex,
>>>
>>>> -----Original Message-----
>>>> From: Shameerali Kolothum Thodi
>>>> Sent: Monday, November 20, 2017 4:31 PM
>>>> To: 'Alex Williamson' <alex.williamson@redhat.com>
>>>> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>>>> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
>>>> <linuxarm@huawei.com>
>>>> Subject: RE: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
>>>> reserved_region of device iommu group
>>> [...]
>>>>>>> And sysfs is a good interface if the user wants to use it to
>>>>>>> configure the VM in a way that's compatible with a device.  For
>>>>>>> instance, in your case, a user could evaluate these reserved
>>>>>>> regions across all devices in a system, or even across an entire
>>>>>>> cluster, and instantiate the VM with a memory map compatible with
>>>>>>> hotplugging any of those evaluated devices (QEMU implementation of
>>>> allowing the user to do this TBD).
>>>>>>> Having the vfio device evaluate these reserved regions only helps
>>>>>>> in the cold-plug case.  So the proposed solution is limited in
>>>>>>> scope and doesn't address similar needs on other platforms.  There
>>>>>>> is value to verifying that a device's IOVA space is compatible
>>>>>>> with a VM memory map and modifying the memory map on cold-plug or
>>>>>>> rejecting the device on hot-plug, but isn't that why we have an
>>>>>>> ioctl within vfio to expose information about the IOMMU?  Why take
>>>>>>> the path of allowing QEMU to rummage through sysfs files outside
>>>>>>> of vfio, implying additional security and access concerns, rather
>>>>>>> than filling the gap within the vifo API?
>>>>>>
>>>>>> Thanks Alex for the explanation.
>>>>>>
>>>>>> I came across this patch[1] from Eric where he introduced the IOCTL
>>>>> interface to
>>>>>> retrieve the reserved regions. It looks like this can be reworked to
>>>>> accommodate
>>>>>> the above requirement.
>>>>>
>>>>> I don't think we need a new ioctl for this, nor do I think that
>>>>> describing the holes is the correct approach.  The existing
>>>>> VFIO_IOMMU_GET_INFO ioctl can be extended to support capability
>>>>> chains, as we've done for VFIO_DEVICE_GET_REGION_INFO.
>>>>
>>>> Right, as far as I can see the above mentioned patch is doing exactly the
>> same,
>>>> extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.
>>>>
>>>>> IMO, we should try to
>>>>> describe the available fixed IOVA regions which are available for
>>>>> mapping rather than describing various holes within the address space
>>>>> which are unavailable.  The latter method always fails to describe the
>>>>> end of the mappable IOVA space and gets bogged down in trying to
>>>>> classify the types of holes that might exist.  Thanks,
>>>>
>>>
>>> I was going through this and noticed that it is possible to have multiple
>>> iommu domains associated with a container. If that's true, is it safe to
>>> make the assumption that all the domains will have the same iova
>>> geometry or not?
>> To me the answer is no.
>>
>> There are several iommu domains "underneath" a single container. You
>> attach vfio groups to a container. Each of them is associated to an
>> iommu group and an iommu domain. See vfio_iommu_type1_attach_group().
>>
>> Besides, the reserved regions are per iommu group.
>>
> 
> Thanks for your reply. Yes, container can have multiple groups(hence multiple 
> iommu domains) and reserved regions are per group. Hence while deciding
> the default supported iova geometry we have to go through all the domains
> in the container and select smallest aperture as the supported default iova
> range.
> 
> Please find below snippet from a patch I am working on. Please
> let me know your thoughts on this.
> 
> Thanks,
> Shameer
> 
> -- >8 --
> +static int vfio_build_iommu_iova_caps(struct vfio_iommu *iommu,
> +                               struct vfio_info_cap *caps) {
> +       struct iommu_resv_region *resv, *resv_next;
> +       struct vfio_iommu_iova *iova, *iova_next;
> +       struct list_head group_resv_regions, vfio_iova_regions;
> +       struct vfio_domain *domain;
> +       struct vfio_group *g;
> +       phys_addr_t start, end;
> +       int ret = 0;
> +
> +       domain = list_first_entry(&iommu->domain_list,
> +                                 struct vfio_domain, next);
> +       /* Get the default iova range supported */
> +       start = domain->domain->geometry.aperture_start;
> +       end = domain->domain->geometry.aperture_end;
> 
> This is where I am confused. I think instead I should go over
> the domain_list and select the smallest aperture as default
> iova range.
yes that's correct. I Just want to warn you Pierre is working on the
same topic. May be worth to sync together.

[PATCH] vfio/iommu_type1: report the IOMMU aperture info
(https://patchwork.kernel.org/patch/10084655/)

I think he plans to rework his series with capability chain too.

Thanks

Eric


> 
> +       INIT_LIST_HEAD(&vfio_iova_regions);
> +       vfio_insert_iova(start, end, &vfio_iova_regions);
> +
> +       /* Get reserved regions if any */
> +       INIT_LIST_HEAD(&group_resv_regions);
> +       list_for_each_entry(g, &domain->group_list, next)
> +               iommu_get_group_resv_regions(g->iommu_group,
> +                                               &group_resv_regions);
> +       list_sort(NULL, &group_resv_regions, vfio_resv_cmp);
> +
> +       /* Update iova range excluding reserved regions */
>  ...
> -- >8 --
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group
  2017-12-06 14:59                   ` Auger Eric
@ 2017-12-06 15:19                     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2017-12-06 15:19 UTC (permalink / raw)
  To: Auger Eric, Alex Williamson
  Cc: peter.maydell, qemu-devel, Linuxarm, qemu-arm, Zhaoshenglong, Zhuyijun



> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Wednesday, December 06, 2017 3:00 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Alex Williamson <alex.williamson@redhat.com>
> Cc: peter.maydell@linaro.org; qemu-devel@nongnu.org; Linuxarm
> <linuxarm@huawei.com>; qemu-arm@nongnu.org; Zhaoshenglong
> <zhaoshenglong@huawei.com>; Zhuyijun <zhuyijun@huawei.com>
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> Hi Shameer,
> 
> On 06/12/17 15:38, Shameerali Kolothum Thodi wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Auger Eric [mailto:eric.auger@redhat.com]
> >> Sent: Wednesday, December 06, 2017 2:01 PM
> >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> >> Alex Williamson <alex.williamson@redhat.com>
> >> Cc: peter.maydell@linaro.org; qemu-devel@nongnu.org; Linuxarm
> >> <linuxarm@huawei.com>; qemu-arm@nongnu.org; Zhaoshenglong
> >> <zhaoshenglong@huawei.com>; Zhuyijun <zhuyijun@huawei.com>
> >> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> >> reserved_region of device iommu group
> >>
> >> Hi Shameer,
> >>
> >> On 06/12/17 11:30, Shameerali Kolothum Thodi wrote:
> >>> Hi Alex,
> >>>
> >>>> -----Original Message-----
> >>>> From: Shameerali Kolothum Thodi
> >>>> Sent: Monday, November 20, 2017 4:31 PM
> >>>> To: 'Alex Williamson' <alex.williamson@redhat.com>
> >>>> Cc: eric.auger@redhat.com; Zhuyijun <zhuyijun@huawei.com>; qemu-
> >>>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> >>>> Zhaoshenglong <zhaoshenglong@huawei.com>; Linuxarm
> >>>> <linuxarm@huawei.com>
> >>>> Subject: RE: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> >>>> reserved_region of device iommu group
> >>> [...]
> >>>>>>> And sysfs is a good interface if the user wants to use it to
> >>>>>>> configure the VM in a way that's compatible with a device.  For
> >>>>>>> instance, in your case, a user could evaluate these reserved
> >>>>>>> regions across all devices in a system, or even across an entire
> >>>>>>> cluster, and instantiate the VM with a memory map compatible with
> >>>>>>> hotplugging any of those evaluated devices (QEMU implementation of
> >>>> allowing the user to do this TBD).
> >>>>>>> Having the vfio device evaluate these reserved regions only helps
> >>>>>>> in the cold-plug case.  So the proposed solution is limited in
> >>>>>>> scope and doesn't address similar needs on other platforms.  There
> >>>>>>> is value to verifying that a device's IOVA space is compatible
> >>>>>>> with a VM memory map and modifying the memory map on cold-plug
> or
> >>>>>>> rejecting the device on hot-plug, but isn't that why we have an
> >>>>>>> ioctl within vfio to expose information about the IOMMU?  Why take
> >>>>>>> the path of allowing QEMU to rummage through sysfs files outside
> >>>>>>> of vfio, implying additional security and access concerns, rather
> >>>>>>> than filling the gap within the vifo API?
> >>>>>>
> >>>>>> Thanks Alex for the explanation.
> >>>>>>
> >>>>>> I came across this patch[1] from Eric where he introduced the IOCTL
> >>>>> interface to
> >>>>>> retrieve the reserved regions. It looks like this can be reworked to
> >>>>> accommodate
> >>>>>> the above requirement.
> >>>>>
> >>>>> I don't think we need a new ioctl for this, nor do I think that
> >>>>> describing the holes is the correct approach.  The existing
> >>>>> VFIO_IOMMU_GET_INFO ioctl can be extended to support capability
> >>>>> chains, as we've done for VFIO_DEVICE_GET_REGION_INFO.
> >>>>
> >>>> Right, as far as I can see the above mentioned patch is doing exactly the
> >> same,
> >>>> extending the VFIO_IOMMU_GET_INFO ioctl with capability chain.
> >>>>
> >>>>> IMO, we should try to
> >>>>> describe the available fixed IOVA regions which are available for
> >>>>> mapping rather than describing various holes within the address space
> >>>>> which are unavailable.  The latter method always fails to describe the
> >>>>> end of the mappable IOVA space and gets bogged down in trying to
> >>>>> classify the types of holes that might exist.  Thanks,
> >>>>
> >>>
> >>> I was going through this and noticed that it is possible to have multiple
> >>> iommu domains associated with a container. If that's true, is it safe to
> >>> make the assumption that all the domains will have the same iova
> >>> geometry or not?
> >> To me the answer is no.
> >>
> >> There are several iommu domains "underneath" a single container. You
> >> attach vfio groups to a container. Each of them is associated to an
> >> iommu group and an iommu domain. See
> vfio_iommu_type1_attach_group().
> >>
> >> Besides, the reserved regions are per iommu group.
> >>
> >
> > Thanks for your reply. Yes, container can have multiple groups(hence
> multiple
> > iommu domains) and reserved regions are per group. Hence while deciding
> > the default supported iova geometry we have to go through all the domains
> > in the container and select smallest aperture as the supported default iova
> > range.
> >
> > Please find below snippet from a patch I am working on. Please
> > let me know your thoughts on this.
> >
> > Thanks,
> > Shameer
> >
> > -- >8 --
> > +static int vfio_build_iommu_iova_caps(struct vfio_iommu *iommu,
> > +                               struct vfio_info_cap *caps) {
> > +       struct iommu_resv_region *resv, *resv_next;
> > +       struct vfio_iommu_iova *iova, *iova_next;
> > +       struct list_head group_resv_regions, vfio_iova_regions;
> > +       struct vfio_domain *domain;
> > +       struct vfio_group *g;
> > +       phys_addr_t start, end;
> > +       int ret = 0;
> > +
> > +       domain = list_first_entry(&iommu->domain_list,
> > +                                 struct vfio_domain, next);
> > +       /* Get the default iova range supported */
> > +       start = domain->domain->geometry.aperture_start;
> > +       end = domain->domain->geometry.aperture_end;
> >
> > This is where I am confused. I think instead I should go over
> > the domain_list and select the smallest aperture as default
> > iova range.
> yes that's correct. I Just want to warn you Pierre is working on the
> same topic. May be worth to sync together.
> 
> [PATCH] vfio/iommu_type1: report the IOMMU aperture info
> (https://patchwork.kernel.org/patch/10084655/)
> 
> I think he plans to rework his series with capability chain too.

Thanks Eric for the pointer. I will sent out my patch as an RFC then and take
it from there (without the default iova aperture changes, as I can see there are
some discussions  in Pierre's patch to solve that)

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-12-06 15:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13 12:29 [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid zhuyijun
2017-11-13 12:29 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
2017-11-13 12:29 ` [Qemu-devel] [RFC 3/5] hw/arm: add scattered RAM memory region support zhuyijun
2017-11-13 20:01 ` [Qemu-devel] [RFC 0/5] arm: Exclude reserved memory regions of iommu to avoid Auger Eric
2017-11-14  1:15 zhuyijun
2017-11-14  1:15 ` [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group zhuyijun
2017-11-14 15:47   ` Alex Williamson
2017-11-15  9:49     ` Shameerali Kolothum Thodi
2017-11-15 18:25       ` Alex Williamson
2017-11-20 11:58         ` Shameerali Kolothum Thodi
2017-11-20 15:57           ` Alex Williamson
2017-11-20 16:31             ` Shameerali Kolothum Thodi
2017-12-06 10:30             ` Shameerali Kolothum Thodi
2017-12-06 14:01               ` Auger Eric
2017-12-06 14:38                 ` Shameerali Kolothum Thodi
2017-12-06 14:59                   ` Auger Eric
2017-12-06 15:19                     ` Shameerali Kolothum Thodi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.