[Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
@ 2018-05-16 15:20 Shameer Kolothum
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel Shameer Kolothum
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

When the kernel reports valid iova ranges as non-contiguous,
memory should be allocated to Guest in such a way that
reserved regions(holes) are not visible by Guest.

This series retrieves the valid iova ranges based on the new
proposed VFIO interface for kernel [1]. It then populates the
first 1GB ram as a non-pluggable dimm and rest as a pc-dimm if
the valid iova ranges are non-contiguous.

Patch #3 of this series is loosely based on an earlier attempt
by Kwangwoo Lee to add hotplug/pc-dimm support to arm64[2]

RFC v1[3] --> RFCv2
 -Based on new VFIO kernel interface
 -Part of Mem modelled as pc-dimm 
 -Rebased to qemu 2.12.0

[1] https://lkml.org/lkml/2018/4/18/293
[2] https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04600.html
[3] https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02412.html

Shameer Kolothum (6):
  hw/vfio: Retrieve valid iova ranges from kernel
  hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  hw/arm/virt: Add pc-dimm mem hotplug framework
  hw/arm: Changes required to accommodate non-contiguous DT mem nodes
  hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  hw/arm: Populate non-contiguous memory regions

 default-configs/aarch64-softmmu.mak |   1 +
 hw/arm/boot.c                       |  91 ++++++---
 hw/arm/virt-acpi-build.c            |  24 ++-
 hw/arm/virt.c                       | 367 +++++++++++++++++++++++++++++++++++-
 hw/vfio/common.c                    | 108 ++++++++++-
 include/hw/arm/arm.h                |  12 ++
 include/hw/arm/virt.h               |   3 +
 include/hw/vfio/vfio-common.h       |   7 +
 linux-headers/linux/vfio.h          |  23 +++
 9 files changed, 589 insertions(+), 47 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions Shameer Kolothum
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

This makes use of the newly introduced iova cap chains added
to the  type1 VFIO_IOMMU_GET_INFO ioctl.

The retrieved iova info is stored in a list for later use.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/vfio/common.c              | 108 +++++++++++++++++++++++++++++++++++++++---
 include/hw/vfio/vfio-common.h |   7 +++
 linux-headers/linux/vfio.h    |  23 +++++++++
 3 files changed, 132 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 07ffa0b..94d7b24 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,8 @@ struct vfio_group_head vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 struct vfio_as_head vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
+struct vfio_iova_head vfio_iova_regions =
+    QLIST_HEAD_INITIALIZER(vfio_iova_regions);
 
 #ifdef CONFIG_KVM
 /*
@@ -1030,6 +1032,85 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
     }
 }
 
+static void vfio_iommu_get_iova_ranges(struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_iova_range *cap_iova;
+    VFIOIovaRange *iova, *tmp, *prev = NULL;
+    void *ptr = info;
+    bool found = false;
+    int i;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
+            found = true;
+            break;
+        }
+    }
+
+    if (!found) {
+        return;
+    }
+
+    /* purge the current iova list, if any */
+    QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, tmp) {
+        QLIST_REMOVE(iova, next);
+        g_free(iova);
+    }
+
+    cap_iova = container_of(hdr, struct vfio_iommu_type1_info_cap_iova_range,
+                            header);
+
+    /* populate the list */
+    for (i = 0; i < cap_iova->nr_iovas; i++) {
+        iova = g_malloc0(sizeof(*iova));
+        iova->start = cap_iova->iova_ranges[i].start;
+        iova->end = cap_iova->iova_ranges[i].end;
+
+        if (prev) {
+            QLIST_INSERT_AFTER(prev, iova, next);
+        } else {
+            QLIST_INSERT_HEAD(&vfio_iova_regions, iova, next);
+        }
+        prev = iova;
+    }
+
+    return;
+}
+
+static int vfio_get_iommu_info(VFIOContainer *container,
+                         struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+
+    *info = g_malloc0(argsz);
+
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto retry;
+    }
+
+    vfio_iommu_get_iova_ranges(*info);
+
+    return 0;
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
@@ -1044,6 +1125,15 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
             vfio_kvm_device_add_group(group);
+
+            /* New group might change the valid iovas. Get the updated list */
+            if ((container->iommu_type == VFIO_TYPE1_IOMMU) ||
+                (container->iommu_type == VFIO_TYPE1v2_IOMMU)) {
+                struct vfio_iommu_type1_info *info;
+
+                vfio_get_iommu_info(container, &info);
+                g_free(info);
+            }
             return 0;
         }
     }
@@ -1071,7 +1161,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) ||
         ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
         bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU);
-        struct vfio_iommu_type1_info info;
+        struct vfio_iommu_type1_info *info;
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -1095,14 +1185,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
+        ret = vfio_get_iommu_info(container, &info);
         /* Ignore errors */
-        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
             /* Assume 4k IOVA page size */
-            info.iova_pgsizes = 4096;
+            info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
+        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+        g_free(info);
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
@@ -1256,6 +1346,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     if (QLIST_EMPTY(&container->group_list)) {
         VFIOAddressSpace *space = container->space;
         VFIOGuestIOMMU *giommu, *tmp;
+        VFIOIovaRange *iova, *next_iova;
 
         QLIST_REMOVE(container, next);
 
@@ -1266,6 +1357,11 @@ static void vfio_disconnect_container(VFIOGroup *group)
             g_free(giommu);
         }
 
+        QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, next_iova) {
+            QLIST_REMOVE(iova, next);
+            g_free(iova);
+        }
+
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d936014..874fe2c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -164,6 +164,12 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
+typedef struct VFIOIovaRange {
+    uint64_t start;
+    uint64_t end;
+    QLIST_ENTRY(VFIOIovaRange) next;
+} VFIOIovaRange;
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -187,6 +193,7 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 extern const MemoryRegionOps vfio_region_ops;
 extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
 extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
+extern QLIST_HEAD(vfio_iova_head, VFIOIovaRange) vfio_iova_regions;
 
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 3a0a305..117341d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -589,7 +589,30 @@ struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
 	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+	__u32   cap_offset;	/* Offset within info struct of first cap */
+};
+
+/*
+ * The IOVA capability allows to report the valid IOVA range(s)
+ * excluding any reserved regions associated with dev group. Any dma
+ * map attempt outside the valid iova range will return error.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE  1
+
+struct vfio_iova_range {
+	__u64	start;
+	__u64	end;
+};
+
+struct vfio_iommu_type1_info_cap_iova_range {
+	struct vfio_info_cap_header header;
+	__u32	nr_iovas;
+	__u32	reserved;
+	struct vfio_iova_range iova_ranges[];
 };
 
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-28 16:47   ` Andrew Jones
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework Shameer Kolothum
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

Register ram_memory_region_init notifier to allocate memory region
from system memory.

Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/arm/virt.c         | 28 ++++++++++++++++++++++------
 include/hw/arm/virt.h |  1 +
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 94dcb12..05fcb62 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1171,6 +1171,19 @@ void virt_machine_done(Notifier *notifier, void *data)
     virt_build_smbios(vms);
 }
 
+static void virt_ram_memory_region_init(Notifier *notifier, void *data)
+{
+    MemoryRegion *sysmem = get_system_memory();
+    MemoryRegion *ram = g_new(MemoryRegion, 1);
+    VirtMachineState *vms = container_of(notifier, VirtMachineState,
+                                         ram_memory_region_init);
+    MachineState *machine = MACHINE(vms);
+
+    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
+                                         machine->ram_size);
+    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
+}
+
 static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
 {
     uint8_t clustersz = ARM_DEFAULT_CPUS_PER_CLUSTER;
@@ -1204,7 +1217,6 @@ static void machvirt_init(MachineState *machine)
     MemoryRegion *sysmem = get_system_memory();
     MemoryRegion *secure_sysmem = NULL;
     int n, virt_max_cpus;
-    MemoryRegion *ram = g_new(MemoryRegion, 1);
     bool firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
 
     /* We can probe only here because during property set
@@ -1361,10 +1373,6 @@ static void machvirt_init(MachineState *machine)
     fdt_add_timer_nodes(vms);
     fdt_add_cpu_nodes(vms);
 
-    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
-                                         machine->ram_size);
-    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
-
     create_flash(vms, sysmem, secure_sysmem ? secure_sysmem : sysmem);
 
     create_gic(vms, pic);
@@ -1405,15 +1413,23 @@ static void machvirt_init(MachineState *machine)
     vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
     vms->bootinfo.get_dtb = machvirt_dtb;
     vms->bootinfo.firmware_loaded = firmware_loaded;
+
+    /* Register notifiers. They are executed in registration reverse order */
     arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
 
     /*
      * arm_load_kernel machine init done notifier registration must
      * happen before the platform_bus_create call. In this latter,
      * another notifier is registered which adds platform bus nodes.
-     * Notifiers are executed in registration reverse order.
      */
     create_platform_bus(vms, pic);
+
+    /*
+     * Register memory region notifier last as this has to be executed
+     * first.
+     */
+    vms->ram_memory_region_init.notify = virt_ram_memory_region_init;
+    qemu_add_machine_init_done_notifier(&vms->ram_memory_region_init);
 }
 
 static bool virt_get_secure(Object *obj, Error **errp)
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index ba0c1a4..fc24f3a 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -91,6 +91,7 @@ typedef struct {
 typedef struct {
     MachineState parent;
     Notifier machine_done;
+    Notifier ram_memory_region_init;
     FWCfgState *fw_cfg;
     bool secure;
     bool highmem;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel Shameer Kolothum
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes Shameer Kolothum
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

This will be used in subsequent patches to model a chunk of
memory as pc-dimm(cold plug) if the valid iova regions are
non-contiguous. This is not yet a full hotplug support.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 default-configs/aarch64-softmmu.mak |  1 +
 hw/arm/virt.c                       | 82 +++++++++++++++++++++++++++++++++++++
 include/hw/arm/virt.h               |  2 +
 3 files changed, 85 insertions(+)

diff --git a/default-configs/aarch64-softmmu.mak b/default-configs/aarch64-softmmu.mak
index 9ddccf8..7a82ed8 100644
--- a/default-configs/aarch64-softmmu.mak
+++ b/default-configs/aarch64-softmmu.mak
@@ -8,3 +8,4 @@ CONFIG_DDC=y
 CONFIG_DPCD=y
 CONFIG_XLNX_ZYNQMP=y
 CONFIG_XLNX_ZYNQMP_ARM=y
+CONFIG_MEM_HOTPLUG=y
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 05fcb62..be3ad14 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1552,9 +1552,82 @@ static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
     return ms->possible_cpus;
 }
 
+static void virt_dimm_plug(HotplugHandler *hotplug_dev,
+                         DeviceState *dev, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(dev);
+    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
+    MemoryRegion *mr;
+    uint64_t align;
+    Error *local_err = NULL;
+
+    mr = ddc->get_memory_region(dimm, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+    align = memory_region_get_alignment(mr);
+    pc_dimm_memory_plug(dev, &vms->hotplug_memory, mr, align, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+out:
+    error_propagate(errp, local_err);
+}
+
+static void virt_dimm_unplug(HotplugHandler *hotplug_dev,
+                           DeviceState *dev, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(dev);
+    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
+    MemoryRegion *mr;
+    Error *local_err = NULL;
+
+    mr = ddc->get_memory_region(dimm, &local_err);
+    pc_dimm_memory_unplug(dev, &vms->hotplug_memory, mr);
+    object_unparent(OBJECT(dev));
+
+    error_propagate(errp, local_err);
+}
+
+static void virt_machinedevice_plug_cb(HotplugHandler *hotplug_dev,
+                                      DeviceState *dev, Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+        virt_dimm_plug(hotplug_dev, dev, errp);
+    } else {
+        error_setg(errp, "device plug request for not supported device"
+                   " type: %s", object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void virt_machinedevice_unplug_cb(HotplugHandler *hotplug_dev,
+                                        DeviceState *dev, Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+        virt_dimm_unplug(hotplug_dev, dev, errp);
+    } else {
+        error_setg(errp, "device unplug for not supported device"
+                   " type: %s", object_get_typename(OBJECT(dev)));
+    }
+}
+
+static HotplugHandler *virt_get_hotplug_handler(MachineState *machine,
+                                             DeviceState *dev)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+        return HOTPLUG_HANDLER(machine);
+    }
+    return NULL;
+}
+
 static void virt_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
 
     mc->init = machvirt_init;
     /* Start max_cpus at the maximum QEMU supports. We'll further restrict
@@ -1573,6 +1646,11 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
     mc->cpu_index_to_instance_props = virt_cpu_index_to_props;
     mc->default_cpu_type = ARM_CPU_TYPE_NAME("cortex-a15");
     mc->get_default_cpu_node_id = virt_get_default_cpu_node_id;
+
+    mc->get_hotplug_handler = virt_get_hotplug_handler;
+    hc->plug = virt_machinedevice_plug_cb;
+    hc->unplug = virt_machinedevice_unplug_cb;
+
 }
 
 static const TypeInfo virt_machine_info = {
@@ -1582,6 +1660,10 @@ static const TypeInfo virt_machine_info = {
     .instance_size = sizeof(VirtMachineState),
     .class_size    = sizeof(VirtMachineClass),
     .class_init    = virt_machine_class_init,
+    .interfaces = (InterfaceInfo[]) {
+         { TYPE_HOTPLUG_HANDLER },
+         { }
+    },
 };
 
 static void machvirt_machine_init(void)
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index fc24f3a..a39f29e 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -35,6 +35,7 @@
 #include "qemu/notify.h"
 #include "hw/boards.h"
 #include "hw/arm/arm.h"
+#include "hw/mem/pc-dimm.h"
 
 #define NUM_GICV2M_SPIS       64
 #define NUM_VIRTIO_TRANSPORTS 32
@@ -108,6 +109,7 @@ typedef struct {
     uint32_t gic_phandle;
     uint32_t msi_phandle;
     int psci_conduit;
+    MemoryHotplugState hotplug_memory;
 } VirtMachineState;
 
 #define TYPE_VIRT_MACHINE   MACHINE_TYPE_NAME("virt")
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
                   ` (2 preceding siblings ...)
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem Shameer Kolothum
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

This makes changes to the DT mem node creation such that its easier
to add non-contiguous mem modeled as non-pluggable and a pc-dimm
mem later.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/arm/boot.c        | 91 ++++++++++++++++++++++++++++++++++++----------------
 include/hw/arm/arm.h | 12 +++++++
 2 files changed, 75 insertions(+), 28 deletions(-)

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 26184bc..73db0aa 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -486,6 +486,27 @@ static void fdt_add_psci_node(void *fdt)
     qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
 }
 
+static char *create_memory_fdt(void *fdt, uint32_t acells, hwaddr mem_base,
+                                          uint32_t scells, hwaddr mem_len)
+{
+    char *nodename = NULL;
+    int rc;
+
+    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
+    qemu_fdt_add_subnode(fdt, nodename);
+    qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
+    rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
+                                      scells, mem_len);
+    if (rc < 0) {
+        fprintf(stderr, "couldn't set %s/reg\n", nodename);
+        g_free(nodename);
+        return NULL;
+    }
+
+    return nodename;
+}
+
+
 /**
  * load_dtb() - load a device tree binary image into memory
  * @addr:       the address to load the image at
@@ -567,50 +588,64 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
         goto fail;
     }
 
+    /*
+     * Turn the /memory node created before into a NOP node, then create
+     * /memory@addr nodes for all numa nodes respectively.
+     */
+    qemu_fdt_nop_node(fdt, "/memory");
+
     if (nb_numa_nodes > 0) {
-        /*
-         * Turn the /memory node created before into a NOP node, then create
-         * /memory@addr nodes for all numa nodes respectively.
-         */
-        qemu_fdt_nop_node(fdt, "/memory");
+        hwaddr mem_sz;
+
         mem_base = binfo->loader_start;
+        mem_sz = binfo->ram_size;
         for (i = 0; i < nb_numa_nodes; i++) {
-            mem_len = numa_info[i].node_mem;
-            nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
-            qemu_fdt_add_subnode(fdt, nodename);
-            qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
-            rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg",
-                                              acells, mem_base,
+            mem_len = MIN(numa_info[i].node_mem, mem_sz);
+
+            nodename = create_memory_fdt(fdt, acells, mem_base,
                                               scells, mem_len);
-            if (rc < 0) {
-                fprintf(stderr, "couldn't set %s/reg for node %d\n", nodename,
-                        i);
+            if (!nodename) {
                 goto fail;
             }
 
             qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id", i);
-            mem_base += mem_len;
             g_free(nodename);
+            mem_base += mem_len;
+            mem_sz -= mem_len;
+            if (!mem_sz) {
+                break;
+           }
         }
-    } else {
-        Error *err = NULL;
 
-        rc = fdt_path_offset(fdt, "/memory");
-        if (rc < 0) {
-            qemu_fdt_add_subnode(fdt, "/memory");
-        }
+        /* Create the node for initial pc-dimm ram, if any */
+        if (binfo->dimm_mem) {
 
-        if (!qemu_fdt_getprop(fdt, "/memory", "device_type", NULL, &err)) {
-            qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
+            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem->base,
+                                              scells, binfo->dimm_mem->size);
+            if (!nodename) {
+                goto fail;
+            }
+            qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id",
+                                                 binfo->dimm_mem->node);
+            g_free(nodename);
         }
 
-        rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
-                                          acells, binfo->loader_start,
-                                          scells, binfo->ram_size);
-        if (rc < 0) {
-            fprintf(stderr, "couldn't set /memory/reg\n");
+    } else {
+
+        nodename = create_memory_fdt(fdt, acells, binfo->loader_start,
+                                         scells, binfo->ram_size);
+        if (!nodename) {
             goto fail;
         }
+
+        if (binfo->dimm_mem) {
+            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem->base,
+                                              scells, binfo->dimm_mem->size);
+            if (!nodename) {
+                goto fail;
+            }
+            g_free(nodename);
+        }
     }
 
     rc = fdt_path_offset(fdt, "/chosen");
diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
index ce769bd..0ee3b4e 100644
--- a/include/hw/arm/arm.h
+++ b/include/hw/arm/arm.h
@@ -48,6 +48,12 @@ typedef struct {
     ARMCPU *cpu; /* handle to the first cpu object */
 } ArmLoadKernelNotifier;
 
+struct dimm_mem_info {
+    int node;
+    hwaddr base;
+    hwaddr size;
+};
+
 /* arm_boot.c */
 struct arm_boot_info {
     uint64_t ram_size;
@@ -124,6 +130,12 @@ struct arm_boot_info {
     bool secure_board_setup;
 
     arm_endianness endianness;
+
+    /* This is used to model a pc-dimm based mem if the valid iova region
+     * is non-contiguous.
+     */
+    struct dimm_mem_info *dimm_mem;
+
 };
 
 /**
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
                   ` (3 preceding siblings ...)
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-28 17:02   ` Andrew Jones
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions Shameer Kolothum
  2018-05-28 14:22 ` [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Auger Eric
  6 siblings, 2 replies; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

This is in preparation for the next patch where initial ram is split
into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
iova regions are non-contiguous.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index c7c6a57..8d17b40 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     AcpiSratProcessorGiccAffinity *core;
     AcpiSratMemoryAffinity *numamem;
     int i, srat_start;
-    uint64_t mem_base;
+    uint64_t mem_base, mem_sz, mem_len;
     MachineClass *mc = MACHINE_GET_CLASS(vms);
     const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
 
@@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         core->flags = cpu_to_le32(1);
     }
 
-    mem_base = vms->memmap[VIRT_MEM].base;
+    mem_base = vms->bootinfo.loader_start;
+    mem_sz = vms->bootinfo.loader_start;
     for (i = 0; i < nb_numa_nodes; ++i) {
         numamem = acpi_data_push(table_data, sizeof(*numamem));
-        build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
+        mem_len = MIN(numa_info[i].node_mem, mem_sz);
+        build_srat_memory(numamem, mem_base, mem_len, i,
                           MEM_AFFINITY_ENABLED);
-        mem_base += numa_info[i].node_mem;
+        mem_base += mem_len;
+        mem_sz -= mem_len;
+        if (!mem_sz) {
+            break;
+        }
+    }
+
+    /* Create table for initial pc-dimm ram, if any */
+    if (vms->bootinfo.dimm_mem) {
+        numamem = acpi_data_push(table_data, sizeof(*numamem));
+        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
+                          vms->bootinfo.dimm_mem->size,
+                          vms->bootinfo.dimm_mem->node,
+                          MEM_AFFINITY_ENABLED);
+
     }
 
     build_header(linker, table_data, (void *)(table_data->data + srat_start),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
                   ` (4 preceding siblings ...)
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem Shameer Kolothum
@ 2018-05-16 15:20 ` Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
  2018-05-28 14:22 ` [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Auger Eric
  6 siblings, 1 reply; 34+ messages in thread
From: Shameer Kolothum @ 2018-05-16 15:20 UTC (permalink / raw)
  To: qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, eric.auger,
	zhaoshenglong, jonathan.cameron, linuxarm, Shameer Kolothum

In case valid iova regions are non-contiguous, split the
RAM mem into a 1GB non-pluggable dimm and remaining as a
single pc-dimm mem.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/arm/virt.c | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 256 insertions(+), 5 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index be3ad14..562e389 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -58,6 +58,12 @@
 #include "hw/smbios/smbios.h"
 #include "qapi/visitor.h"
 #include "standard-headers/linux/input.h"
+#include "hw/vfio/vfio-common.h"
+#include "qemu/config-file.h"
+#include "monitor/qdev.h"
+#include "qom/object_interfaces.h"
+#include "qapi/qmp/qdict.h"
+#include "qemu/option.h"
 
 #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
     static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
@@ -110,7 +116,10 @@ static ARMPlatformBusSystemParams platform_bus_params;
  * terabyte of physical address space.)
  */
 #define RAMLIMIT_GB 255
-#define RAMLIMIT_BYTES (RAMLIMIT_GB * 1024ULL * 1024 * 1024)
+#define SZ_1G (1024ULL * 1024 * 1024)
+#define RAMLIMIT_BYTES (RAMLIMIT_GB * SZ_1G)
+
+#define ALIGN_1G (1ULL << 30)
 
 /* Addresses and sizes of our components.
  * 0..128MB is space for a flash device so we can run bootrom code such as UEFI.
@@ -1171,6 +1180,236 @@ void virt_machine_done(Notifier *notifier, void *data)
     virt_build_smbios(vms);
 }
 
+static void free_iova_copy(struct vfio_iova_head *iova_copy)
+{
+    VFIOIovaRange *iova, *tmp;
+
+    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
+        QLIST_REMOVE(iova, next);
+        g_free(iova);
+    }
+}
+
+static void get_iova_copy(struct vfio_iova_head *iova_copy)
+{
+    VFIOIovaRange *iova, *new, *prev_iova = NULL;
+
+    QLIST_FOREACH(iova, &vfio_iova_regions, next) {
+        new = g_malloc0(sizeof(*iova));
+        new->start = iova->start;
+        new->end = iova->end;
+
+        if (prev_iova) {
+            QLIST_INSERT_AFTER(prev_iova, new, next);
+        } else {
+            QLIST_INSERT_HEAD(iova_copy, new, next);
+        }
+        prev_iova = new;
+    }
+}
+
+static hwaddr find_memory_chunk(VirtMachineState *vms,
+                   struct vfio_iova_head *iova_copy,
+                   hwaddr req_size, bool pcdimm)
+{
+    VFIOIovaRange *iova, *tmp;
+    hwaddr new_start, new_size, sz_align;
+    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
+    hwaddr addr_align = ALIGN_1G; /* Set to max ARM64 hugepage size */
+
+    /* Size alignment */
+    sz_align = (pcdimm) ? MAX(TARGET_PAGE_SIZE, QEMU_VMALLOC_ALIGN) :
+                                                      TARGET_PAGE_SIZE;
+
+    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
+        if (virt_start >= iova->end) {
+            continue;
+        }
+
+        /* Align addr */
+        new_start = ROUND_UP(MAX(virt_start, iova->start), addr_align);
+        if (new_start >= iova->end) {
+            continue;
+        }
+
+        if (req_size > iova->end - new_start + 1) {
+            continue;
+        }
+
+        /*
+         * Check the region can hold any size alignment requirement.
+         */
+        new_size = QEMU_ALIGN_UP(req_size, sz_align);
+
+        if ((new_start + new_size - 1 > iova->end) ||
+                 (new_start + new_size >= virt_start + RAMLIMIT_BYTES)) {
+            continue;
+        }
+
+        /*
+         * Modify the iova list entry for non pc-dimm case so that it
+         * is not used again for pc-dimm allocation.
+         */
+        if (!pcdimm) {
+            if (new_size - req_size) {
+                iova->start = new_start + new_size;
+            } else {
+                QLIST_REMOVE(iova, next);
+            }
+        }
+        return new_start;
+    }
+
+    return -1;
+}
+
+static void update_memory_regions(VirtMachineState *vms)
+{
+    hwaddr addr;
+    VFIOIovaRange *iova;
+    MachineState *machine = MACHINE(vms);
+    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
+    hwaddr req_size, ram_size = machine->ram_size;
+    struct vfio_iova_head iova_copy = QLIST_HEAD_INITIALIZER(iova_copy);
+
+    /* No valid iova regions, use default */
+    if (QLIST_EMPTY(&vfio_iova_regions)) {
+        vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
+        vms->bootinfo.ram_size = ram_size;
+        return;
+    }
+
+    /*
+     * If valid iovas has only one entry, check the req size fits in
+     * and can have the loader start < 4GB. This will make sure platforms
+     * with no holes in mem will have the same mem model as before.
+     */
+    req_size = ram_size;
+    iova = QLIST_NEXT(QLIST_FIRST(&vfio_iova_regions), next);
+    if (!iova) {
+        iova = QLIST_FIRST(&vfio_iova_regions);
+        addr = ROUND_UP(MAX(virt_start, iova->start), ALIGN_1G);
+        if ((addr < 4 * SZ_1G) && (ram_size <= iova->end - addr + 1) &&
+                  (addr + ram_size < virt_start + RAMLIMIT_BYTES)) {
+            vms->bootinfo.loader_start = addr;
+            vms->bootinfo.ram_size = ram_size;
+            return;
+        }
+    }
+
+    /* Get a copy of valid iovas and work on it */
+    get_iova_copy(&iova_copy);
+
+    /* Split the mem as first 1GB non-pluggable and rest as pc-dimm */
+    req_size = MIN(ram_size, SZ_1G);
+    addr = find_memory_chunk(vms, &iova_copy, req_size, false);
+    if (addr == -1 || addr >= 4 * SZ_1G) {
+        goto out;
+    }
+
+    /*Update non-pluggable mem details */
+    machine->ram_size = req_size;
+    vms->bootinfo.loader_start = addr;
+    vms->bootinfo.ram_size = machine->ram_size;
+
+    req_size = ram_size - req_size;
+    if (!req_size) {
+        goto done;
+    }
+
+    /* Remaining memory is modeled as a pc-dimm. */
+    addr = find_memory_chunk(vms, &iova_copy, req_size, true);
+    if (addr == -1) {
+        goto out;
+    }
+
+    /*Update pc-dimm mem details */
+    vms->bootinfo.dimm_mem = g_new(struct dimm_mem_info, 1);
+    vms->bootinfo.dimm_mem->base = addr;
+    vms->bootinfo.dimm_mem->size = req_size;
+    machine->maxram_size = machine->ram_size + req_size;
+    machine->ram_slots += 1;
+
+done:
+    free_iova_copy(&iova_copy);
+    return;
+
+out:
+    free_iova_copy(&iova_copy);
+    error_report("mach-virt: Not enough contiguous memory to model ram");
+    exit(1);
+}
+
+static void create_pcdimms(VirtMachineState *vms,
+                             MemoryRegion *sysmem,
+                             MemoryRegion *ram)
+{
+    hwaddr addr, size;
+    Error *local_err = NULL;
+    QDict *qdict;
+    QemuOpts *opts;
+    char *tmp;
+
+    if (!vms->bootinfo.dimm_mem) {
+        return;
+    }
+
+    addr = vms->bootinfo.dimm_mem->base;
+    size = vms->bootinfo.dimm_mem->size;
+
+    /*Create hotplug address space */
+    vms->hotplug_memory.base = ROUND_UP(addr, ALIGN_1G);
+    size = ROUND_UP(size, MAX(TARGET_PAGE_SIZE, QEMU_VMALLOC_ALIGN));
+
+    memory_region_init(&vms->hotplug_memory.mr, OBJECT(vms),
+                                      "hotplug-memory", size);
+    memory_region_add_subregion(sysmem, vms->hotplug_memory.base,
+                                        &vms->hotplug_memory.mr);
+    /* Create backend mem object */
+    qdict = qdict_new();
+    qdict_put_str(qdict, "qom-type", "memory-backend-ram");
+    qdict_put_str(qdict, "id", "mem1");
+    tmp = g_strdup_printf("%"PRIu64 "M", size / (1024 * 1024));
+    qdict_put_str(qdict, "size", tmp);
+    g_free(tmp);
+
+    opts = qemu_opts_from_qdict(qemu_find_opts("object"), qdict, &local_err);
+    if (local_err) {
+        goto err;
+    }
+
+    user_creatable_add_opts(opts, &local_err);
+    qemu_opts_del(opts);
+    QDECREF(qdict);
+    if (local_err) {
+        goto err;
+    }
+
+    /* Create pc-dimm dev*/
+    qdict = qdict_new();
+    qdict_put_str(qdict, "driver", "pc-dimm");
+    qdict_put_str(qdict, "id", "dimm1");
+    qdict_put_str(qdict, "memdev", "mem1");
+
+    opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict, &local_err);
+    if (local_err) {
+        goto err;
+    }
+
+    qdev_device_add(opts, &local_err);
+    qemu_opts_del(opts);
+    QDECREF(qdict);
+    if (local_err) {
+        goto err;
+    }
+
+    return;
+
+err:
+    error_report_err(local_err);
+    exit(1);
+}
+
 static void virt_ram_memory_region_init(Notifier *notifier, void *data)
 {
     MemoryRegion *sysmem = get_system_memory();
@@ -1179,9 +1418,14 @@ static void virt_ram_memory_region_init(Notifier *notifier, void *data)
                                          ram_memory_region_init);
     MachineState *machine = MACHINE(vms);
 
+    update_memory_regions(vms);
     memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
                                          machine->ram_size);
-    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
+    memory_region_add_subregion(sysmem, vms->bootinfo.loader_start, ram);
+
+    if (vms->bootinfo.dimm_mem) {
+        create_pcdimms(vms, sysmem, ram);
+    }
 }
 
 static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
@@ -1404,13 +1648,11 @@ static void machvirt_init(MachineState *machine)
     vms->machine_done.notify = virt_machine_done;
     qemu_add_machine_init_done_notifier(&vms->machine_done);
 
-    vms->bootinfo.ram_size = machine->ram_size;
     vms->bootinfo.kernel_filename = machine->kernel_filename;
     vms->bootinfo.kernel_cmdline = machine->kernel_cmdline;
     vms->bootinfo.initrd_filename = machine->initrd_filename;
     vms->bootinfo.nb_cpus = smp_cpus;
     vms->bootinfo.board_id = -1;
-    vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
     vms->bootinfo.get_dtb = machvirt_dtb;
     vms->bootinfo.firmware_loaded = firmware_loaded;
 
@@ -1559,7 +1801,7 @@ static void virt_dimm_plug(HotplugHandler *hotplug_dev,
     PCDIMMDevice *dimm = PC_DIMM(dev);
     PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
     MemoryRegion *mr;
-    uint64_t align;
+    uint64_t align, addr;
     Error *local_err = NULL;
 
     mr = ddc->get_memory_region(dimm, &local_err);
@@ -1573,6 +1815,15 @@ static void virt_dimm_plug(HotplugHandler *hotplug_dev,
         goto out;
     }
 
+    addr = object_property_get_uint(OBJECT(dimm), PC_DIMM_ADDR_PROP,
+                                                       &error_fatal);
+    /* Assign the node for pc-dimm initial ram */
+    if (vms->bootinfo.dimm_mem && (addr == vms->bootinfo.dimm_mem->base)
+                                                 && (nb_numa_nodes > 0)) {
+        vms->bootinfo.dimm_mem->node = object_property_get_uint(OBJECT(dev),
+                                           PC_DIMM_NODE_PROP, &error_fatal);
+    }
+
 out:
     error_propagate(errp, local_err);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:43     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: peter.maydell, drjones, jonathan.cameron, linuxarm,
	alex.williamson, zhaoshenglong, imammedo

Hi Shameer,
On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> This makes use of the newly introduced iova cap chains added
> to the  type1 VFIO_IOMMU_GET_INFO ioctl.
> 
> The retrieved iova info is stored in a list for later use.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/vfio/common.c              | 108 +++++++++++++++++++++++++++++++++++++++---
>  include/hw/vfio/vfio-common.h |   7 +++
>  linux-headers/linux/vfio.h    |  23 +++++++++
>  3 files changed, 132 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 07ffa0b..94d7b24 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,8 @@ struct vfio_group_head vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>  struct vfio_as_head vfio_address_spaces =
>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
> +struct vfio_iova_head vfio_iova_regions =
> +    QLIST_HEAD_INITIALIZER(vfio_iova_regions);
>  
>  #ifdef CONFIG_KVM
>  /*
> @@ -1030,6 +1032,85 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>      }
>  }
>  
> +static void vfio_iommu_get_iova_ranges(struct vfio_iommu_type1_info *info)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_iova_range *cap_iova;
> +    VFIOIovaRange *iova, *tmp, *prev = NULL;
nit: s/iova/iova_range?
> +    void *ptr = info;
> +    bool found = false;
> +    int i;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
> +            found = true;
> +            break;
> +        }
> +    }
> +
> +    if (!found) {
> +        return;
> +    }
> +
> +    /* purge the current iova list, if any */
> +    QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, tmp) {
> +        QLIST_REMOVE(iova, next);
> +        g_free(iova);
> +    }
> +
> +    cap_iova = container_of(hdr, struct vfio_iommu_type1_info_cap_iova_range,
> +                            header);
> +
> +    /* populate the list */
> +    for (i = 0; i < cap_iova->nr_iovas; i++) {
> +        iova = g_malloc0(sizeof(*iova));
nit: g_new0 is preferred
> +        iova->start = cap_iova->iova_ranges[i].start;
> +        iova->end = cap_iova->iova_ranges[i].end;
> +
> +        if (prev) {
> +            QLIST_INSERT_AFTER(prev, iova, next);
> +        } else {
> +            QLIST_INSERT_HEAD(&vfio_iova_regions, iova, next);
> +        }
> +        prev = iova;
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                         struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +
> +    *info = g_malloc0(argsz);
> +
> +retry:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto retry;
> +    }
> +
> +    vfio_iommu_get_iova_ranges(*info);
> +
> +    return 0;
> +}
> +
>  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                                    Error **errp)
>  {
> @@ -1044,6 +1125,15 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>              vfio_kvm_device_add_group(group);
> +
> +            /* New group might change the valid iovas. Get the updated list */
> +            if ((container->iommu_type == VFIO_TYPE1_IOMMU) ||
> +                (container->iommu_type == VFIO_TYPE1v2_IOMMU)) {
> +                struct vfio_iommu_type1_info *info;
> +
> +                vfio_get_iommu_info(container, &info);
> +                g_free(info);
> +            }
>              return 0;
>          }
>      }
> @@ -1071,7 +1161,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) ||
>          ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
>          bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU);
> -        struct vfio_iommu_type1_info info;
> +        struct vfio_iommu_type1_info *info;
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -1095,14 +1185,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        info.argsz = sizeof(info);
> -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> +        ret = vfio_get_iommu_info(container, &info);
>          /* Ignore errors */
> -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              /* Assume 4k IOVA page size */
> -            info.iova_pgsizes = 4096;
> +            info->iova_pgsizes = 4096;
>          }
> -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> +        g_free(info);
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> @@ -1256,6 +1346,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      if (QLIST_EMPTY(&container->group_list)) {
>          VFIOAddressSpace *space = container->space;
>          VFIOGuestIOMMU *giommu, *tmp;
> +        VFIOIovaRange *iova, *next_iova;
not: I would prefer range naming

>  
>          QLIST_REMOVE(container, next);
>  
> @@ -1266,6 +1357,11 @@ static void vfio_disconnect_container(VFIOGroup *group)
>              g_free(giommu);
>          }
>  
> +        QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, next_iova) {
> +            QLIST_REMOVE(iova, next);
> +            g_free(iova);
> +        }
> +
>          trace_vfio_disconnect_container(container->fd);
>          close(container->fd);
>          g_free(container);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index d936014..874fe2c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -164,6 +164,12 @@ typedef struct VFIODisplay {
>      } dmabuf;
>  } VFIODisplay;
>  
> +typedef struct VFIOIovaRange {
> +    uint64_t start;
> +    uint64_t end;
> +    QLIST_ENTRY(VFIOIovaRange) next;
> +} VFIOIovaRange;
> +
>  void vfio_put_base_device(VFIODevice *vbasedev);
>  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
> @@ -187,6 +193,7 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  extern const MemoryRegionOps vfio_region_ops;
>  extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
>  extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
> +extern QLIST_HEAD(vfio_iova_head, VFIOIovaRange) vfio_iova_regions;
>  
>  #ifdef CONFIG_LINUX
>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3a0a305..117341d 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -589,7 +589,30 @@ struct vfio_iommu_type1_info {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> +#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
> +	__u32   cap_offset;	/* Offset within info struct of first cap */
> +};
> +
> +/*
> + * The IOVA capability allows to report the valid IOVA range(s)
> + * excluding any reserved regions associated with dev group. Any dma
> + * map attempt outside the valid iova range will return error.
> + *
> + * The structures below define version 1 of this capability.
> + */
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE  1
> +
> +struct vfio_iova_range {
> +	__u64	start;
> +	__u64	end;
> +};
> +
> +struct vfio_iommu_type1_info_cap_iova_range {
> +	struct vfio_info_cap_header header;
> +	__u32	nr_iovas;
> +	__u32	reserved;
> +	struct vfio_iova_range iova_ranges[];
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
You need to update the header in a separate patch using
scripts/update-linux-headers.sh

Until the kernel series is not fully upstream you can just pickup the
VFIO related changes you are interested in (partial update) but when
this series becomes a patch, a full header update is generally used.

Thanks

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:48     ` Shameerali Kolothum Thodi
  2018-06-05  7:49     ` Shameerali Kolothum Thodi
  0 siblings, 2 replies; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, zhaoshenglong,
	jonathan.cameron, linuxarm

Hi Shameer,

On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> In case valid iova regions are non-contiguous, split the
> RAM mem into a 1GB non-pluggable dimm and remaining as a
> single pc-dimm mem.

Please can you explain where does this split come from? Currently we
have 254 GB non pluggable RAM. I read the discussion started with Drew
on RFC v1 where he explained we cannot change the RAM base without
crashing the FW. However we should at least document this  1GB split.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/virt.c | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 256 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index be3ad14..562e389 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -58,6 +58,12 @@
>  #include "hw/smbios/smbios.h"
>  #include "qapi/visitor.h"
>  #include "standard-headers/linux/input.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "qemu/config-file.h"
> +#include "monitor/qdev.h"
> +#include "qom/object_interfaces.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qemu/option.h"

The comment at the beginning of virt.c would need to be reworked with
your changes.
>  
>  #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
>      static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
> @@ -110,7 +116,10 @@ static ARMPlatformBusSystemParams platform_bus_params;
>   * terabyte of physical address space.)
>   */
>  #define RAMLIMIT_GB 255
> -#define RAMLIMIT_BYTES (RAMLIMIT_GB * 1024ULL * 1024 * 1024)
> +#define SZ_1G (1024ULL * 1024 * 1024)
> +#define RAMLIMIT_BYTES (RAMLIMIT_GB * SZ_1G)
> +
> +#define ALIGN_1G (1ULL << 30)
>  
>  /* Addresses and sizes of our components.
>   * 0..128MB is space for a flash device so we can run bootrom code such as UEFI.
> @@ -1171,6 +1180,236 @@ void virt_machine_done(Notifier *notifier, void *data)
>      virt_build_smbios(vms);
>  }
>  
> +static void free_iova_copy(struct vfio_iova_head *iova_copy)
> +{
> +    VFIOIovaRange *iova, *tmp;
> +
> +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> +        QLIST_REMOVE(iova, next);
> +        g_free(iova);
> +    }
> +}
> +
> +static void get_iova_copy(struct vfio_iova_head *iova_copy)
> +{
> +    VFIOIovaRange *iova, *new, *prev_iova = NULL;
> +
> +    QLIST_FOREACH(iova, &vfio_iova_regions, next) {
> +        new = g_malloc0(sizeof(*iova));
> +        new->start = iova->start;
> +        new->end = iova->end;
> +
> +        if (prev_iova) {
> +            QLIST_INSERT_AFTER(prev_iova, new, next);
> +        } else {
> +            QLIST_INSERT_HEAD(iova_copy, new, next);
> +        }
> +        prev_iova = new;
> +    }
> +}
> +
> +static hwaddr find_memory_chunk(VirtMachineState *vms,
> +                   struct vfio_iova_head *iova_copy,
> +                   hwaddr req_size, bool pcdimm)
> +{
> +    VFIOIovaRange *iova, *tmp;
> +    hwaddr new_start, new_size, sz_align;
> +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> +    hwaddr addr_align = ALIGN_1G; /* Set to max ARM64 hugepage size */
> +
> +    /* Size alignment */
> +    sz_align = (pcdimm) ? MAX(TARGET_PAGE_SIZE, QEMU_VMALLOC_ALIGN) :
> +                                                      TARGET_PAGE_SIZE;
> +
> +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> +        if (virt_start >= iova->end) {
> +            continue;
> +        }
> +
> +        /* Align addr */
> +        new_start = ROUND_UP(MAX(virt_start, iova->start), addr_align);
> +        if (new_start >= iova->end) {
> +            continue;
> +        }
> +
> +        if (req_size > iova->end - new_start + 1) {
> +            continue;
> +        }
don't you want to check directly with new_size?
> +
> +        /*
> +         * Check the region can hold any size alignment requirement.
> +         */
> +        new_size = QEMU_ALIGN_UP(req_size, sz_align);
> +
> +        if ((new_start + new_size - 1 > iova->end) ||
> +                 (new_start + new_size >= virt_start + RAMLIMIT_BYTES)) {
> +            continue;
> +        }
> +
> +        /*
> +         * Modify the iova list entry for non pc-dimm case so that it
> +         * is not used again for pc-dimm allocation.
> +         */
> +        if (!pcdimm) {
> +            if (new_size - req_size) {
I don't get this check, Don't you want to check the node's range is
fully consumed and in the positive remove the node?

(new_size != iova->end - iova-> start + 1)
> +                iova->start = new_start + new_size;
> +            } else {
> +                QLIST_REMOVE(iova, next);
> +            }
> +        }
> +        return new_start;
> +    }
> +
> +    return -1;
> +}
> +
> +static void update_memory_regions(VirtMachineState *vms)
> +{
> +    hwaddr addr;
> +    VFIOIovaRange *iova;
> +    MachineState *machine = MACHINE(vms);
> +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> +    hwaddr req_size, ram_size = machine->ram_size;
> +    struct vfio_iova_head iova_copy = QLIST_HEAD_INITIALIZER(iova_copy);
> +
> +    /* No valid iova regions, use default */
> +    if (QLIST_EMPTY(&vfio_iova_regions)) {
> +        vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> +        vms->bootinfo.ram_size = ram_size;
> +        return;
> +    }
> +
> +    /*
> +     * If valid iovas has only one entry, check the req size fits in
> +     * and can have the loader start < 4GB. This will make sure platforms
> +     * with no holes in mem will have the same mem model as before.
> +     */
> +    req_size = ram_size;
> +    iova = QLIST_NEXT(QLIST_FIRST(&vfio_iova_regions), next);
> +    if (!iova) {
> +        iova = QLIST_FIRST(&vfio_iova_regions);
> +        addr = ROUND_UP(MAX(virt_start, iova->start), ALIGN_1G);
> +        if ((addr < 4 * SZ_1G) && (ram_size <= iova->end - addr + 1) &&
> +                  (addr + ram_size < virt_start + RAMLIMIT_BYTES)) {
> +            vms->bootinfo.loader_start = addr;
> +            vms->bootinfo.ram_size = ram_size;
> +            return;
> +        }
> +    }
> +
> +    /* Get a copy of valid iovas and work on it */
> +    get_iova_copy(&iova_copy);
> +
> +    /* Split the mem as first 1GB non-pluggable and rest as pc-dimm */
> +    req_size = MIN(ram_size, SZ_1G);
> +    addr = find_memory_chunk(vms, &iova_copy, req_size, false);
According to what Drew reported, the base address of the cold-plug RAM
must stay at 1GB otherwise the FW will be lost. So I think you must
check the addr = 1GB
> +    if (addr == -1 || addr >= 4 * SZ_1G) {
> +        goto out;
> +    }
> +
> +    /*Update non-pluggable mem details */
> +    machine->ram_size = req_size;
> +    vms->bootinfo.loader_start = addr;
> +    vms->bootinfo.ram_size = machine->ram_size;
> +
> +    req_size = ram_size - req_size;
> +    if (!req_size) {
> +        goto done;
> +    }
> +
> +    /* Remaining memory is modeled as a pc-dimm. */
> +    addr = find_memory_chunk(vms, &iova_copy, req_size, true);
> +    if (addr == -1) {
> +        goto out;
> +    }
> +
> +    /*Update pc-dimm mem details */
> +    vms->bootinfo.dimm_mem = g_new(struct dimm_mem_info, 1);
> +    vms->bootinfo.dimm_mem->base = addr;
> +    vms->bootinfo.dimm_mem->size = req_size;
> +    machine->maxram_size = machine->ram_size + req_size;
> +    machine->ram_slots += 1;
> +
> +done:
> +    free_iova_copy(&iova_copy);
> +    return;
> +
> +out:
> +    free_iova_copy(&iova_copy);
> +    error_report("mach-virt: Not enough contiguous memory to model ram");
Output the requested size would help debug (and maybe the available IOVA
ranges)

Thanks

Eric
> +    exit(1);
> +}
> +
> +static void create_pcdimms(VirtMachineState *vms,
> +                             MemoryRegion *sysmem,
> +                             MemoryRegion *ram)
> +{
> +    hwaddr addr, size;
> +    Error *local_err = NULL;
> +    QDict *qdict;
> +    QemuOpts *opts;
> +    char *tmp;
> +
> +    if (!vms->bootinfo.dimm_mem) {
> +        return;
> +    }
> +
> +    addr = vms->bootinfo.dimm_mem->base;
> +    size = vms->bootinfo.dimm_mem->size;
> +
> +    /*Create hotplug address space */
> +    vms->hotplug_memory.base = ROUND_UP(addr, ALIGN_1G);
> +    size = ROUND_UP(size, MAX(TARGET_PAGE_SIZE, QEMU_VMALLOC_ALIGN));
> +
> +    memory_region_init(&vms->hotplug_memory.mr, OBJECT(vms),
> +                                      "hotplug-memory", size);
> +    memory_region_add_subregion(sysmem, vms->hotplug_memory.base,
> +                                        &vms->hotplug_memory.mr);
> +    /* Create backend mem object */
> +    qdict = qdict_new();
> +    qdict_put_str(qdict, "qom-type", "memory-backend-ram");
> +    qdict_put_str(qdict, "id", "mem1");
> +    tmp = g_strdup_printf("%"PRIu64 "M", size / (1024 * 1024));
> +    qdict_put_str(qdict, "size", tmp);
> +    g_free(tmp);
> +
> +    opts = qemu_opts_from_qdict(qemu_find_opts("object"), qdict, &local_err);
> +    if (local_err) {
> +        goto err;
> +    }
> +
> +    user_creatable_add_opts(opts, &local_err);
> +    qemu_opts_del(opts);
> +    QDECREF(qdict);
> +    if (local_err) {
> +        goto err;
> +    }
> +
> +    /* Create pc-dimm dev*/
> +    qdict = qdict_new();
> +    qdict_put_str(qdict, "driver", "pc-dimm");
> +    qdict_put_str(qdict, "id", "dimm1");
> +    qdict_put_str(qdict, "memdev", "mem1");
> +
> +    opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict, &local_err);
> +    if (local_err) {
> +        goto err;
> +    }
> +
> +    qdev_device_add(opts, &local_err);
> +    qemu_opts_del(opts);
> +    QDECREF(qdict);
> +    if (local_err) {
> +        goto err;
> +    }
> +
> +    return;
> +
> +err:
> +    error_report_err(local_err);
> +    exit(1);
> +}
> +
>  static void virt_ram_memory_region_init(Notifier *notifier, void *data)
>  {
>      MemoryRegion *sysmem = get_system_memory();
> @@ -1179,9 +1418,14 @@ static void virt_ram_memory_region_init(Notifier *notifier, void *data)
>                                           ram_memory_region_init);
>      MachineState *machine = MACHINE(vms);
>  
> +    update_memory_regions(vms);
>      memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
>                                           machine->ram_size);
> -    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> +    memory_region_add_subregion(sysmem, vms->bootinfo.loader_start, ram);
> +
> +    if (vms->bootinfo.dimm_mem) {
> +        create_pcdimms(vms, sysmem, ram);
> +    }
>  }
>  
>  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> @@ -1404,13 +1648,11 @@ static void machvirt_init(MachineState *machine)
>      vms->machine_done.notify = virt_machine_done;
>      qemu_add_machine_init_done_notifier(&vms->machine_done);
>  
> -    vms->bootinfo.ram_size = machine->ram_size;
>      vms->bootinfo.kernel_filename = machine->kernel_filename;
>      vms->bootinfo.kernel_cmdline = machine->kernel_cmdline;
>      vms->bootinfo.initrd_filename = machine->initrd_filename;
>      vms->bootinfo.nb_cpus = smp_cpus;
>      vms->bootinfo.board_id = -1;
> -    vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
>      vms->bootinfo.get_dtb = machvirt_dtb;
>      vms->bootinfo.firmware_loaded = firmware_loaded;
>  
> @@ -1559,7 +1801,7 @@ static void virt_dimm_plug(HotplugHandler *hotplug_dev,
>      PCDIMMDevice *dimm = PC_DIMM(dev);
>      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
>      MemoryRegion *mr;
> -    uint64_t align;
> +    uint64_t align, addr;
>      Error *local_err = NULL;
>  
>      mr = ddc->get_memory_region(dimm, &local_err);
> @@ -1573,6 +1815,15 @@ static void virt_dimm_plug(HotplugHandler *hotplug_dev,
>          goto out;
>      }
>  
> +    addr = object_property_get_uint(OBJECT(dimm), PC_DIMM_ADDR_PROP,
> +                                                       &error_fatal);
> +    /* Assign the node for pc-dimm initial ram */
> +    if (vms->bootinfo.dimm_mem && (addr == vms->bootinfo.dimm_mem->base)
> +                                                 && (nb_numa_nodes > 0)) {
> +        vms->bootinfo.dimm_mem->node = object_property_get_uint(OBJECT(dev),
> +                                           PC_DIMM_NODE_PROP, &error_fatal);
> +    }
> +
>  out:
>      error_propagate(errp, local_err);
>  }
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-28 17:02   ` Andrew Jones
  1 sibling, 0 replies; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, zhaoshenglong,
	jonathan.cameron, linuxarm

Hi Shameer,

On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> This is in preparation for the next patch where initial ram is split
> into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
valid
> iova regions are non-contiguous.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index c7c6a57..8d17b40 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      AcpiSratProcessorGiccAffinity *core;
>      AcpiSratMemoryAffinity *numamem;
>      int i, srat_start;
> -    uint64_t mem_base;
> +    uint64_t mem_base, mem_sz, mem_len;
>      MachineClass *mc = MACHINE_GET_CLASS(vms);
>      const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
>  
> @@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          core->flags = cpu_to_le32(1);
>      }
>  
> -    mem_base = vms->memmap[VIRT_MEM].base;
> +    mem_base = vms->bootinfo.loader_start;
> +    mem_sz = vms->bootinfo.loader_start;
>      for (i = 0; i < nb_numa_nodes; ++i) {
>          numamem = acpi_data_push(table_data, sizeof(*numamem));
> -        build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
> +        mem_len = MIN(numa_info[i].node_mem, mem_sz);
same question as in previous patch

Thanks

Eric
> +        build_srat_memory(numamem, mem_base, mem_len, i,
>                            MEM_AFFINITY_ENABLED);
> -        mem_base += numa_info[i].node_mem;
> +        mem_base += mem_len;
> +        mem_sz -= mem_len;
> +        if (!mem_sz) {
> +            break;
> +        }
> +    }
> +
> +    /* Create table for initial pc-dimm ram, if any */
> +    if (vms->bootinfo.dimm_mem) {
> +        numamem = acpi_data_push(table_data, sizeof(*numamem));
> +        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
> +                          vms->bootinfo.dimm_mem->size,
> +                          vms->bootinfo.dimm_mem->node,
> +                          MEM_AFFINITY_ENABLED);
> +
>      }
>  
>      build_header(linker, table_data, (void *)(table_data->data + srat_start),
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:46     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: peter.maydell, drjones, jonathan.cameron, linuxarm,
	alex.williamson, zhaoshenglong, imammedo

Hi Shameer,

On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> This makes changes to the DT mem node creation such that its easier
> to add non-contiguous mem modeled as non-pluggable and a pc-dimm
> mem later.
See comments below. I think you should augment the description here with
what the patch exactly adds:
- a new helper function
- a new dimm node?

and if possible split functional changes into separate patches?
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/boot.c        | 91 ++++++++++++++++++++++++++++++++++++----------------
>  include/hw/arm/arm.h | 12 +++++++
>  2 files changed, 75 insertions(+), 28 deletions(-)
> 
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index 26184bc..73db0aa 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -486,6 +486,27 @@ static void fdt_add_psci_node(void *fdt)
>      qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
>  }
>  
> +static char *create_memory_fdt(void *fdt, uint32_t acells, hwaddr mem_base,
> +                                          uint32_t scells, hwaddr mem_len)
Other dt node creation functions are named fdt_add_*_node
> +{
> +    char *nodename = NULL;
> +    int rc;
> +
> +    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> +    qemu_fdt_add_subnode(fdt, nodename);
> +    qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> +    rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> +                                      scells, mem_len);
> +    if (rc < 0) {
> +        fprintf(stderr, "couldn't set %s/reg\n", nodename);
> +        g_free(nodename);
> +        return NULL;
> +    }
> +
> +    return nodename;
> +}
> +
> +
>  /**
>   * load_dtb() - load a device tree binary image into memory
>   * @addr:       the address to load the image at
> @@ -567,50 +588,64 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>          goto fail;
>      }
>  
> +    /*
> +     * Turn the /memory node created before into a NOP node, then create
> +     * /memory@addr nodes for all numa nodes respectively.
> +     */
> +    qemu_fdt_nop_node(fdt, "/memory");
I don't really understand why this gets moved outside of the if
(nb_numa_nodes > 0) { check. Also the comment above mention numa nodes
whereas it are not necessarily in the numa case anymore.
> +
>      if (nb_numa_nodes > 0) {
> -        /*
> -         * Turn the /memory node created before into a NOP node, then create
> -         * /memory@addr nodes for all numa nodes respectively.
> -         */
> -        qemu_fdt_nop_node(fdt, "/memory");
> +        hwaddr mem_sz;
> +
>          mem_base = binfo->loader_start;
> +        mem_sz = binfo->ram_size;
>          for (i = 0; i < nb_numa_nodes; i++) {
> -            mem_len = numa_info[i].node_mem;
> -            nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> -            qemu_fdt_add_subnode(fdt, nodename);
> -            qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> -            rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg",
> -                                              acells, mem_base,
> +            mem_len = MIN(numa_info[i].node_mem, mem_sz);
I fail to understand how this change relates to the topic of this patch.
If this adds a consistency check, this may be put in another patch?
> +
> +            nodename = create_memory_fdt(fdt, acells, mem_base,
>                                                scells, mem_len);
You could simplify the review by just introducing the new dt node
creation function in a 1st patch and then introduce the changes to
support non contiguous DT mem nodes.
> -            if (rc < 0) {
> -                fprintf(stderr, "couldn't set %s/reg for node %d\n", nodename,
> -                        i);
> +            if (!nodename) {
>                  goto fail;
>              }
>  
>              qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id", i);
> -            mem_base += mem_len;
>              g_free(nodename);
> +            mem_base += mem_len;
> +            mem_sz -= mem_len;
> +            if (!mem_sz) {
> +                break;
So what does it mean practically if we break here. Not all the num nodes
will function? Outputting a mesg for the end user may be useful.
> +           }
>          }
> -    } else {
> -        Error *err = NULL;
>  
> -        rc = fdt_path_offset(fdt, "/memory");
> -        if (rc < 0) {
> -            qemu_fdt_add_subnode(fdt, "/memory");
> -        }
> +        /* Create the node for initial pc-dimm ram, if any */
> +        if (binfo->dimm_mem) {
>  
> -        if (!qemu_fdt_getprop(fdt, "/memory", "device_type", NULL, &err)) {
> -            qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
> +            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem->base,
> +                                              scells, binfo->dimm_mem->size);
> +            if (!nodename) {
> +                goto fail;
> +            }
> +            qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id",
> +                                                 binfo->dimm_mem->node);
> +            g_free(nodename);
>          }
>  
> -        rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
> -                                          acells, binfo->loader_start,
> -                                          scells, binfo->ram_size);
> -        if (rc < 0) {
> -            fprintf(stderr, "couldn't set /memory/reg\n");
> +    } else {
> +
> +        nodename = create_memory_fdt(fdt, acells, binfo->loader_start,
> +                                         scells, binfo->ram_size);
> +        if (!nodename) {
>              goto fail;
>          }
> +
> +        if (binfo->dimm_mem) {
> +            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem->base,
> +                                              scells, binfo->dimm_mem->size);
> +            if (!nodename) {
> +                goto fail;
> +            }
> +            g_free(nodename);
> +        }
as this code gets duplicated, a helper function may be relevant?

Thanks

Eric
>      }
>  
>      rc = fdt_path_offset(fdt, "/chosen");
> diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
> index ce769bd..0ee3b4e 100644
> --- a/include/hw/arm/arm.h
> +++ b/include/hw/arm/arm.h
> @@ -48,6 +48,12 @@ typedef struct {
>      ARMCPU *cpu; /* handle to the first cpu object */
>  } ArmLoadKernelNotifier;
>  
> +struct dimm_mem_info {
> +    int node;
> +    hwaddr base;
> +    hwaddr size;
> +};
> +
>  /* arm_boot.c */
>  struct arm_boot_info {
>      uint64_t ram_size;
> @@ -124,6 +130,12 @@ struct arm_boot_info {
>      bool secure_board_setup;
>  
>      arm_endianness endianness;
> +
> +    /* This is used to model a pc-dimm based mem if the valid iova region
> +     * is non-contiguous.
> +     */
> +    struct dimm_mem_info *dimm_mem;
> +
>  };
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:46     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, zhaoshenglong,
	jonathan.cameron, linuxarm

Hi Shameer,

On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> This will be used in subsequent patches to model a chunk of
> memory as pc-dimm(cold plug) if the valid iova regions are
> non-contiguous. This is not yet a full hotplug support.
Please can you give more details about this restriction?
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  default-configs/aarch64-softmmu.mak |  1 +
>  hw/arm/virt.c                       | 82 +++++++++++++++++++++++++++++++++++++
>  include/hw/arm/virt.h               |  2 +
>  3 files changed, 85 insertions(+)
> 
> diff --git a/default-configs/aarch64-softmmu.mak b/default-configs/aarch64-softmmu.mak
> index 9ddccf8..7a82ed8 100644
> --- a/default-configs/aarch64-softmmu.mak
> +++ b/default-configs/aarch64-softmmu.mak
> @@ -8,3 +8,4 @@ CONFIG_DDC=y
>  CONFIG_DPCD=y
>  CONFIG_XLNX_ZYNQMP=y
>  CONFIG_XLNX_ZYNQMP_ARM=y
> +CONFIG_MEM_HOTPLUG=y
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 05fcb62..be3ad14 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1552,9 +1552,82 @@ static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>      return ms->possible_cpus;
>  }
>  
> +static void virt_dimm_plug(HotplugHandler *hotplug_dev,
> +                         DeviceState *dev, Error **errp)
> +{
> +    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
> +    PCDIMMDevice *dimm = PC_DIMM(dev);
> +    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> +    MemoryRegion *mr;
> +    uint64_t align;
> +    Error *local_err = NULL;
> +
> +    mr = ddc->get_memory_region(dimm, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
> +    align = memory_region_get_alignment(mr);
I see that on PC machine class they have an "enforce_aligned_dimm"
option. Also looks memory_region_get_alignment(mr) can return 0.

if (memory_region_get_alignment(mr) && pcmc->enforce_aligned_dimm) {


> +    pc_dimm_memory_plug(dev, &vms->hotplug_memory, mr, align, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
useless block and error_propagate does nothing if local_err is NULL so
we are fine.
> +
> +out:
> +    error_propagate(errp, local_err);
> +}
> +
> +static void virt_dimm_unplug(HotplugHandler *hotplug_dev,
> +                           DeviceState *dev, Error **errp)
> +{
> +    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
> +    PCDIMMDevice *dimm = PC_DIMM(dev);
> +    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> +    MemoryRegion *mr;
> +    Error *local_err = NULL;
> +
> +    mr = ddc->get_memory_region(dimm, &local_err);
don't you want to avoid executing the following if local_err?
> +    pc_dimm_memory_unplug(dev, &vms->hotplug_memory, mr);
needs a rebase on top of "pc-dimm: no need to pass the memory region"
in master pc_dimm_memory_unplug we now have
 MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);

Thanks

Eric
> +    object_unparent(OBJECT(dev));
> +
> +    error_propagate(errp, local_err);
> +}
> +
> +static void virt_machinedevice_plug_cb(HotplugHandler *hotplug_dev,
> +                                      DeviceState *dev, Error **errp)
> +{
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +        virt_dimm_plug(hotplug_dev, dev, errp);
> +    } else {
> +        error_setg(errp, "device plug request for not supported device"
> +                   " type: %s", object_get_typename(OBJECT(dev)));
> +    }
> +}
> +
> +static void virt_machinedevice_unplug_cb(HotplugHandler *hotplug_dev,
> +                                        DeviceState *dev, Error **errp)
> +{
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +        virt_dimm_unplug(hotplug_dev, dev, errp);
> +    } else {
> +        error_setg(errp, "device unplug for not supported device"
> +                   " type: %s", object_get_typename(OBJECT(dev)));
> +    }
> +}
> +
> +static HotplugHandler *virt_get_hotplug_handler(MachineState *machine,
> +                                             DeviceState *dev)
> +{
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +        return HOTPLUG_HANDLER(machine);
> +    }
> +    return NULL;
> +}
> +
>  static void virt_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
>  
>      mc->init = machvirt_init;
>      /* Start max_cpus at the maximum QEMU supports. We'll further restrict
> @@ -1573,6 +1646,11 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
>      mc->cpu_index_to_instance_props = virt_cpu_index_to_props;
>      mc->default_cpu_type = ARM_CPU_TYPE_NAME("cortex-a15");
>      mc->get_default_cpu_node_id = virt_get_default_cpu_node_id;
> +
> +    mc->get_hotplug_handler = virt_get_hotplug_handler;
This requires a rebase on top of Igor's  and David's series.

Thanks

Eric
> +    hc->plug = virt_machinedevice_plug_cb;
> +    hc->unplug = virt_machinedevice_unplug_cb;
> +
>  }
>  
>  static const TypeInfo virt_machine_info = {
> @@ -1582,6 +1660,10 @@ static const TypeInfo virt_machine_info = {
>      .instance_size = sizeof(VirtMachineState),
>      .class_size    = sizeof(VirtMachineClass),
>      .class_init    = virt_machine_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +         { TYPE_HOTPLUG_HANDLER },
> +         { }
> +    },
>  };
>  
>  static void machvirt_machine_init(void)
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index fc24f3a..a39f29e 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -35,6 +35,7 @@
>  #include "qemu/notify.h"
>  #include "hw/boards.h"
>  #include "hw/arm/arm.h"
> +#include "hw/mem/pc-dimm.h"
>  
>  #define NUM_GICV2M_SPIS       64
>  #define NUM_VIRTIO_TRANSPORTS 32
> @@ -108,6 +109,7 @@ typedef struct {
>      uint32_t gic_phandle;
>      uint32_t msi_phandle;
>      int psci_conduit;
> +    MemoryHotplugState hotplug_memory;
>  } VirtMachineState;
>  
>  #define TYPE_VIRT_MACHINE   MACHINE_TYPE_NAME("virt")
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions Shameer Kolothum
@ 2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:43     ` Shameerali Kolothum Thodi
  2018-05-28 16:47   ` Andrew Jones
  1 sibling, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:21 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, zhaoshenglong,
	jonathan.cameron, linuxarm

Hi Shameer,
On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> Register ram_memory_region_init notifier to allocate memory region
> from system memory.

At this stage the commit message does not explain why you need a machine
init done notifier. Also the commit title does not summarize the actual
change.
Thanks

Eric
> 
> Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/virt.c         | 28 ++++++++++++++++++++++------
>  include/hw/arm/virt.h |  1 +
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 94dcb12..05fcb62 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1171,6 +1171,19 @@ void virt_machine_done(Notifier *notifier, void *data)
>      virt_build_smbios(vms);
>  }
>  
> +static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> +{
> +    MemoryRegion *sysmem = get_system_memory();
> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> +    VirtMachineState *vms = container_of(notifier, VirtMachineState,
> +                                         ram_memory_region_init);
> +    MachineState *machine = MACHINE(vms);
> +
> +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> +                                         machine->ram_size);
> +    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> +}
> +
>  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
>  {
>      uint8_t clustersz = ARM_DEFAULT_CPUS_PER_CLUSTER;
> @@ -1204,7 +1217,6 @@ static void machvirt_init(MachineState *machine)
>      MemoryRegion *sysmem = get_system_memory();
>      MemoryRegion *secure_sysmem = NULL;
>      int n, virt_max_cpus;
> -    MemoryRegion *ram = g_new(MemoryRegion, 1);
>      bool firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
>  
>      /* We can probe only here because during property set
> @@ -1361,10 +1373,6 @@ static void machvirt_init(MachineState *machine)
>      fdt_add_timer_nodes(vms);
>      fdt_add_cpu_nodes(vms);
>  
> -    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> -                                         machine->ram_size);
> -    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> -
>      create_flash(vms, sysmem, secure_sysmem ? secure_sysmem : sysmem);
>  
>      create_gic(vms, pic);
> @@ -1405,15 +1413,23 @@ static void machvirt_init(MachineState *machine)
>      vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
>      vms->bootinfo.get_dtb = machvirt_dtb;
>      vms->bootinfo.firmware_loaded = firmware_loaded;
> +
> +    /* Register notifiers. They are executed in registration reverse order */
>      arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
>  
>      /*
>       * arm_load_kernel machine init done notifier registration must
>       * happen before the platform_bus_create call. In this latter,
>       * another notifier is registered which adds platform bus nodes.
> -     * Notifiers are executed in registration reverse order.
>       */
>      create_platform_bus(vms, pic);
> +
> +    /*
> +     * Register memory region notifier last as this has to be executed
> +     * first.
> +     */
> +    vms->ram_memory_region_init.notify = virt_ram_memory_region_init;
> +    qemu_add_machine_init_done_notifier(&vms->ram_memory_region_init);
>  }
>  
>  static bool virt_get_secure(Object *obj, Error **errp)
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index ba0c1a4..fc24f3a 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -91,6 +91,7 @@ typedef struct {
>  typedef struct {
>      MachineState parent;
>      Notifier machine_done;
> +    Notifier ram_memory_region_init;
>      FWCfgState *fw_cfg;
>      bool secure;
>      bool highmem;
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
  2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
                   ` (5 preceding siblings ...)
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions Shameer Kolothum
@ 2018-05-28 14:22 ` Auger Eric
  2018-05-30 14:39   ` Shameerali Kolothum Thodi
  6 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-28 14:22 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, zhaoshenglong,
	jonathan.cameron, linuxarm

Hi Shameer,

On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> When the kernel reports valid iova ranges as non-contiguous,
> memory should be allocated to Guest in such a way that
> reserved regions(holes) are not visible by Guest.
> 
> This series retrieves the valid iova ranges based on the new
> proposed VFIO interface for kernel [1]. It then populates the
> first 1GB ram as a non-pluggable dimm and rest as a pc-dimm if
> the valid iova ranges are non-contiguous.

Some general comments:

- what are your plans with respect to VFIO device hot-plug handling?
IIUC, at the moment, any collision between reserved regions induces by
hot-plugged devices are not detected/handled. I understand mem hotplug
is not supported on aarch64. Would you simply reject the vfio device
hotplug.

- IIUC any reserved window colliding with [4000000, 1GB] cold-plug RAM
segment is show-stopper. How was this 1GB size chosen exactly?

- Currently you create a single PC-DIMM node whereas several ones may be
possible & necessary? Do you plan to support multiple PC-DIMMS node?

- I have started looking at RAM extension on machvirt. I think adding
support of PC-DIMMS through the qemu cmd line is something that we need
to work on, in paralell.

Thanks

Eric
> 
> Patch #3 of this series is loosely based on an earlier attempt
> by Kwangwoo Lee to add hotplug/pc-dimm support to arm64[2]
> 
> RFC v1[3] --> RFCv2
>  -Based on new VFIO kernel interface
>  -Part of Mem modelled as pc-dimm 
>  -Rebased to qemu 2.12.0
> 
> [1] https://lkml.org/lkml/2018/4/18/293
> [2] https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04600.html
> [3] https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02412.html
> 
> Shameer Kolothum (6):
>   hw/vfio: Retrieve valid iova ranges from kernel
>   hw/arm/virt: Enable dynamic generation of guest RAM memory regions
>   hw/arm/virt: Add pc-dimm mem hotplug framework
>   hw/arm: Changes required to accommodate non-contiguous DT mem nodes
>   hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
>   hw/arm: Populate non-contiguous memory regions
> 
>  default-configs/aarch64-softmmu.mak |   1 +
>  hw/arm/boot.c                       |  91 ++++++---
>  hw/arm/virt-acpi-build.c            |  24 ++-
>  hw/arm/virt.c                       | 367 +++++++++++++++++++++++++++++++++++-
>  hw/vfio/common.c                    | 108 ++++++++++-
>  include/hw/arm/arm.h                |  12 ++
>  include/hw/arm/virt.h               |   3 +
>  include/hw/vfio/vfio-common.h       |   7 +
>  linux-headers/linux/vfio.h          |  23 +++
>  9 files changed, 589 insertions(+), 47 deletions(-)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-28 16:47   ` Andrew Jones
  2018-05-30 14:50     ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 34+ messages in thread
From: Andrew Jones @ 2018-05-28 16:47 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-devel, qemu-arm, peter.maydell, jonathan.cameron, linuxarm,
	eric.auger, alex.williamson, zhaoshenglong, imammedo

On Wed, May 16, 2018 at 04:20:22PM +0100, Shameer Kolothum wrote:
> Register ram_memory_region_init notifier to allocate memory region
> from system memory.
> 
> Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/virt.c         | 28 ++++++++++++++++++++++------
>  include/hw/arm/virt.h |  1 +
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 94dcb12..05fcb62 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1171,6 +1171,19 @@ void virt_machine_done(Notifier *notifier, void *data)
>      virt_build_smbios(vms);
>  }
>  
> +static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> +{
> +    MemoryRegion *sysmem = get_system_memory();
> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> +    VirtMachineState *vms = container_of(notifier, VirtMachineState,
> +                                         ram_memory_region_init);
> +    MachineState *machine = MACHINE(vms);
> +
> +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> +                                         machine->ram_size);
> +    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> +}
> +
>  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
>  {
>      uint8_t clustersz = ARM_DEFAULT_CPUS_PER_CLUSTER;
> @@ -1204,7 +1217,6 @@ static void machvirt_init(MachineState *machine)
>      MemoryRegion *sysmem = get_system_memory();
>      MemoryRegion *secure_sysmem = NULL;
>      int n, virt_max_cpus;
> -    MemoryRegion *ram = g_new(MemoryRegion, 1);
>      bool firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
>  
>      /* We can probe only here because during property set
> @@ -1361,10 +1373,6 @@ static void machvirt_init(MachineState *machine)
>      fdt_add_timer_nodes(vms);
>      fdt_add_cpu_nodes(vms);
>  
> -    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> -                                         machine->ram_size);
> -    memory_region_add_subregion(sysmem, vms->memmap[VIRT_MEM].base, ram);
> -
>      create_flash(vms, sysmem, secure_sysmem ? secure_sysmem : sysmem);
>  
>      create_gic(vms, pic);
> @@ -1405,15 +1413,23 @@ static void machvirt_init(MachineState *machine)
>      vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
>      vms->bootinfo.get_dtb = machvirt_dtb;
>      vms->bootinfo.firmware_loaded = firmware_loaded;
> +
> +    /* Register notifiers. They are executed in registration reverse order */
>      arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
>  
>      /*
>       * arm_load_kernel machine init done notifier registration must
>       * happen before the platform_bus_create call. In this latter,
>       * another notifier is registered which adds platform bus nodes.
> -     * Notifiers are executed in registration reverse order.
>       */
>      create_platform_bus(vms, pic);
> +
> +    /*
> +     * Register memory region notifier last as this has to be executed
> +     * first.
> +     */
> +    vms->ram_memory_region_init.notify = virt_ram_memory_region_init;
> +    qemu_add_machine_init_done_notifier(&vms->ram_memory_region_init);

I realize we've been slow to get to this for reviewing, but a lot has
changed here since this patch was written. virt now only has a single
machine done notifier. On rebase we should attempt to integrate with
that one rather than introduce a new one.

Thanks,
drew


>  }
>  
>  static bool virt_get_secure(Object *obj, Error **errp)
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index ba0c1a4..fc24f3a 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -91,6 +91,7 @@ typedef struct {
>  typedef struct {
>      MachineState parent;
>      Notifier machine_done;
> +    Notifier ram_memory_region_init;
>      FWCfgState *fw_cfg;
>      bool secure;
>      bool highmem;
> -- 
> 2.7.4
> 
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-16 15:20 ` [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem Shameer Kolothum
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-28 17:02   ` Andrew Jones
  2018-05-30 14:51     ` Shameerali Kolothum Thodi
  2018-05-31 20:15     ` Auger Eric
  1 sibling, 2 replies; 34+ messages in thread
From: Andrew Jones @ 2018-05-28 17:02 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-devel, qemu-arm, peter.maydell, jonathan.cameron, linuxarm,
	eric.auger, alex.williamson, zhaoshenglong, imammedo

On Wed, May 16, 2018 at 04:20:25PM +0100, Shameer Kolothum wrote:
> This is in preparation for the next patch where initial ram is split
> into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
> iova regions are non-contiguous.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index c7c6a57..8d17b40 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      AcpiSratProcessorGiccAffinity *core;
>      AcpiSratMemoryAffinity *numamem;
>      int i, srat_start;
> -    uint64_t mem_base;
> +    uint64_t mem_base, mem_sz, mem_len;
>      MachineClass *mc = MACHINE_GET_CLASS(vms);
>      const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
>  
> @@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          core->flags = cpu_to_le32(1);
>      }
>  
> -    mem_base = vms->memmap[VIRT_MEM].base;
> +    mem_base = vms->bootinfo.loader_start;
> +    mem_sz = vms->bootinfo.loader_start;

mem_sz = vms->bootinfo.ram_size;

Assuming the DT generator was correct, meaning bootinfo.ram_size will
be the size of the non-pluggable dimm.


>      for (i = 0; i < nb_numa_nodes; ++i) {
>          numamem = acpi_data_push(table_data, sizeof(*numamem));
> -        build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
> +        mem_len = MIN(numa_info[i].node_mem, mem_sz);
> +        build_srat_memory(numamem, mem_base, mem_len, i,
>                            MEM_AFFINITY_ENABLED);
> -        mem_base += numa_info[i].node_mem;
> +        mem_base += mem_len;
> +        mem_sz -= mem_len;
> +        if (!mem_sz) {
> +            break;
> +        }
> +    }
> +
> +    /* Create table for initial pc-dimm ram, if any */
> +    if (vms->bootinfo.dimm_mem) {
> +        numamem = acpi_data_push(table_data, sizeof(*numamem));
> +        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
> +                          vms->bootinfo.dimm_mem->size,
> +                          vms->bootinfo.dimm_mem->node,
> +                          MEM_AFFINITY_ENABLED);
> +
>      }
>  
>      build_header(linker, table_data, (void *)(table_data->data + srat_start),
> -- 
> 2.7.4
> 
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
  2018-05-28 14:22 ` [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Auger Eric
@ 2018-05-30 14:39   ` Shameerali Kolothum Thodi
  2018-05-30 15:24     ` Auger Eric
  0 siblings, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:39 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:22 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
> 
> Hi Shameer,
> 
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > When the kernel reports valid iova ranges as non-contiguous,
> > memory should be allocated to Guest in such a way that
> > reserved regions(holes) are not visible by Guest.
> >
> > This series retrieves the valid iova ranges based on the new
> > proposed VFIO interface for kernel [1]. It then populates the
> > first 1GB ram as a non-pluggable dimm and rest as a pc-dimm if
> > the valid iova ranges are non-contiguous.
> 
> Some general comments:

Thanks for going through this series.

> - what are your plans with respect to VFIO device hot-plug handling?
> IIUC, at the moment, any collision between reserved regions induces by
> hot-plugged devices are not detected/handled. I understand mem hotplug
> is not supported on aarch64. Would you simply reject the vfio device
> hotplug.

As per the new kernel changes, the _attach_group() will fail if the device 
reserved regions conflicts with any existing dma mappings of the VM. 
So my understanding is that, host kernel will in effect reject any hot-plug
device that has a reserved region conflicting with dma mappings.

> - IIUC any reserved window colliding with [4000000, 1GB] cold-plug RAM
> segment is show-stopper. How was this 1GB size chosen exactly?

Yes, any reserved window in that region might prevent the Guest from booting
with UEFI as the current solution will move the base start address . I think this is
because the EDK2 UEFI expects the base to start at 0x4000000[1]. I am not sure
why this is required by the UEFI as without the "-bios" option, the Guest can be
booted as long as base addr < 4GB.

> - Currently you create a single PC-DIMM node whereas several ones may be
> possible & necessary? Do you plan to support multiple PC-DIMMS node?

Yes, that is correct. This is mainly to keep it simple and with the expectation that
the valid host iova regions may not be that fragmented. I can look into extent this
to support multiple pc-dimms if there is an immediate requirement to do so.
 
> - I have started looking at RAM extension on machvirt. I think adding
> support of PC-DIMMS through the qemu cmd line is something that we need
> to work on, in paralell.

Ok. Do you have more info that you can share on this? Please let me know.

Thanks,
Shameer

[1] https://github.com/tianocore/edk2/blob/master/ArmVirtPkg/ArmVirtQemu.dsc#L133

> Thanks
> 
> Eric
> >
> > Patch #3 of this series is loosely based on an earlier attempt
> > by Kwangwoo Lee to add hotplug/pc-dimm support to arm64[2]
> >
> > RFC v1[3] --> RFCv2
> >  -Based on new VFIO kernel interface
> >  -Part of Mem modelled as pc-dimm
> >  -Rebased to qemu 2.12.0
> >
> > [1] https://lkml.org/lkml/2018/4/18/293
> > [2] https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04600.html
> > [3] https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02412.html
> >
> > Shameer Kolothum (6):
> >   hw/vfio: Retrieve valid iova ranges from kernel
> >   hw/arm/virt: Enable dynamic generation of guest RAM memory regions
> >   hw/arm/virt: Add pc-dimm mem hotplug framework
> >   hw/arm: Changes required to accommodate non-contiguous DT mem nodes
> >   hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
> >   hw/arm: Populate non-contiguous memory regions
> >
> >  default-configs/aarch64-softmmu.mak |   1 +
> >  hw/arm/boot.c                       |  91 ++++++---
> >  hw/arm/virt-acpi-build.c            |  24 ++-
> >  hw/arm/virt.c                       | 367
> +++++++++++++++++++++++++++++++++++-
> >  hw/vfio/common.c                    | 108 ++++++++++-
> >  include/hw/arm/arm.h                |  12 ++
> >  include/hw/arm/virt.h               |   3 +
> >  include/hw/vfio/vfio-common.h       |   7 +
> >  linux-headers/linux/vfio.h          |  23 +++
> >  9 files changed, 589 insertions(+), 47 deletions(-)
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-30 14:43     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:43 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: peter.maydell, drjones, Jonathan Cameron, Linuxarm,
	alex.williamson, Zhaoshenglong, imammedo

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:21 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: peter.maydell@linaro.org; drjones@redhat.com; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges
> from kernel
> 
> Hi Shameer,
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > This makes use of the newly introduced iova cap chains added
> > to the  type1 VFIO_IOMMU_GET_INFO ioctl.
> >
> > The retrieved iova info is stored in a list for later use.
> >
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/vfio/common.c              | 108
> +++++++++++++++++++++++++++++++++++++++---
> >  include/hw/vfio/vfio-common.h |   7 +++
> >  linux-headers/linux/vfio.h    |  23 +++++++++
> >  3 files changed, 132 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 07ffa0b..94d7b24 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -40,6 +40,8 @@ struct vfio_group_head vfio_group_list =
> >      QLIST_HEAD_INITIALIZER(vfio_group_list);
> >  struct vfio_as_head vfio_address_spaces =
> >      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
> > +struct vfio_iova_head vfio_iova_regions =
> > +    QLIST_HEAD_INITIALIZER(vfio_iova_regions);
> >
> >  #ifdef CONFIG_KVM
> >  /*
> > @@ -1030,6 +1032,85 @@ static void
> vfio_put_address_space(VFIOAddressSpace *space)
> >      }
> >  }
> >
> > +static void vfio_iommu_get_iova_ranges(struct vfio_iommu_type1_info
> *info)
> > +{
> > +    struct vfio_info_cap_header *hdr;
> > +    struct vfio_iommu_type1_info_cap_iova_range *cap_iova;
> > +    VFIOIovaRange *iova, *tmp, *prev = NULL;
> nit: s/iova/iova_range?

Ok.

> > +    void *ptr = info;
> > +    bool found = false;
> > +    int i;
> > +
> > +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> > +        return;
> > +    }
> > +
> > +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> > +        if (hdr->id == VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
> > +            found = true;
> > +            break;
> > +        }
> > +    }
> > +
> > +    if (!found) {
> > +        return;
> > +    }
> > +
> > +    /* purge the current iova list, if any */
> > +    QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, tmp) {
> > +        QLIST_REMOVE(iova, next);
> > +        g_free(iova);
> > +    }
> > +
> > +    cap_iova = container_of(hdr, struct
> vfio_iommu_type1_info_cap_iova_range,
> > +                            header);
> > +
> > +    /* populate the list */
> > +    for (i = 0; i < cap_iova->nr_iovas; i++) {
> > +        iova = g_malloc0(sizeof(*iova));
> nit: g_new0 is preferred

Sure.

> > +        iova->start = cap_iova->iova_ranges[i].start;
> > +        iova->end = cap_iova->iova_ranges[i].end;
> > +
> > +        if (prev) {
> > +            QLIST_INSERT_AFTER(prev, iova, next);
> > +        } else {
> > +            QLIST_INSERT_HEAD(&vfio_iova_regions, iova, next);
> > +        }
> > +        prev = iova;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static int vfio_get_iommu_info(VFIOContainer *container,
> > +                         struct vfio_iommu_type1_info **info)
> > +{
> > +
> > +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> > +
> > +
> > +    *info = g_malloc0(argsz);
> > +
> > +retry:
> > +    (*info)->argsz = argsz;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> > +        g_free(*info);
> > +        *info = NULL;
> > +        return -errno;
> > +    }
> > +
> > +    if (((*info)->argsz > argsz)) {
> > +        argsz = (*info)->argsz;
> > +        *info = g_realloc(*info, argsz);
> > +        goto retry;
> > +    }
> > +
> > +    vfio_iommu_get_iova_ranges(*info);
> > +
> > +    return 0;
> > +}
> > +
> >  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >                                    Error **errp)
> >  {
> > @@ -1044,6 +1125,15 @@ static int vfio_connect_container(VFIOGroup
> *group, AddressSpace *as,
> >              group->container = container;
> >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> >              vfio_kvm_device_add_group(group);
> > +
> > +            /* New group might change the valid iovas. Get the updated list */
> > +            if ((container->iommu_type == VFIO_TYPE1_IOMMU) ||
> > +                (container->iommu_type == VFIO_TYPE1v2_IOMMU)) {
> > +                struct vfio_iommu_type1_info *info;
> > +
> > +                vfio_get_iommu_info(container, &info);
> > +                g_free(info);
> > +            }
> >              return 0;
> >          }
> >      }
> > @@ -1071,7 +1161,7 @@ static int vfio_connect_container(VFIOGroup
> *group, AddressSpace *as,
> >      if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) ||
> >          ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
> >          bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU);
> > -        struct vfio_iommu_type1_info info;
> > +        struct vfio_iommu_type1_info *info;
> >
> >          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >          if (ret) {
> > @@ -1095,14 +1185,14 @@ static int vfio_connect_container(VFIOGroup
> *group, AddressSpace *as,
> >           * existing Type1 IOMMUs generally support any IOVA we're
> >           * going to actually try in practice.
> >           */
> > -        info.argsz = sizeof(info);
> > -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> > +        ret = vfio_get_iommu_info(container, &info);
> >          /* Ignore errors */
> > -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> > +        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >              /* Assume 4k IOVA page size */
> > -            info.iova_pgsizes = 4096;
> > +            info->iova_pgsizes = 4096;
> >          }
> > -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> > +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> > +        g_free(info);
> >      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >          struct vfio_iommu_spapr_tce_info info;
> > @@ -1256,6 +1346,7 @@ static void vfio_disconnect_container(VFIOGroup
> *group)
> >      if (QLIST_EMPTY(&container->group_list)) {
> >          VFIOAddressSpace *space = container->space;
> >          VFIOGuestIOMMU *giommu, *tmp;
> > +        VFIOIovaRange *iova, *next_iova;
> not: I would prefer range naming

Ok.
 
> >
> >          QLIST_REMOVE(container, next);
> >
> > @@ -1266,6 +1357,11 @@ static void vfio_disconnect_container(VFIOGroup
> *group)
> >              g_free(giommu);
> >          }
> >
> > +        QLIST_FOREACH_SAFE(iova, &vfio_iova_regions, next, next_iova) {
> > +            QLIST_REMOVE(iova, next);
> > +            g_free(iova);
> > +        }
> > +
> >          trace_vfio_disconnect_container(container->fd);
> >          close(container->fd);
> >          g_free(container);
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index d936014..874fe2c 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -164,6 +164,12 @@ typedef struct VFIODisplay {
> >      } dmabuf;
> >  } VFIODisplay;
> >
> > +typedef struct VFIOIovaRange {
> > +    uint64_t start;
> > +    uint64_t end;
> > +    QLIST_ENTRY(VFIOIovaRange) next;
> > +} VFIOIovaRange;
> > +
> >  void vfio_put_base_device(VFIODevice *vbasedev);
> >  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
> >  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
> > @@ -187,6 +193,7 @@ int vfio_get_device(VFIOGroup *group, const char
> *name,
> >  extern const MemoryRegionOps vfio_region_ops;
> >  extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
> >  extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
> > +extern QLIST_HEAD(vfio_iova_head, VFIOIovaRange) vfio_iova_regions;
> >
> >  #ifdef CONFIG_LINUX
> >  int vfio_get_region_info(VFIODevice *vbasedev, int index,
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 3a0a305..117341d 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -589,7 +589,30 @@ struct vfio_iommu_type1_info {
> >  	__u32	argsz;
> >  	__u32	flags;
> >  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> > +#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
> >  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
> > +	__u32   cap_offset;	/* Offset within info struct of first cap */
> > +};
> > +
> > +/*
> > + * The IOVA capability allows to report the valid IOVA range(s)
> > + * excluding any reserved regions associated with dev group. Any dma
> > + * map attempt outside the valid iova range will return error.
> > + *
> > + * The structures below define version 1 of this capability.
> > + */
> > +#define VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE  1
> > +
> > +struct vfio_iova_range {
> > +	__u64	start;
> > +	__u64	end;
> > +};
> > +
> > +struct vfio_iommu_type1_info_cap_iova_range {
> > +	struct vfio_info_cap_header header;
> > +	__u32	nr_iovas;
> > +	__u32	reserved;
> > +	struct vfio_iova_range iova_ranges[];
> >  };
> >
> >  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >
> You need to update the header in a separate patch using
> scripts/update-linux-headers.sh
> 
> Until the kernel series is not fully upstream you can just pickup the
> VFIO related changes you are interested in (partial update) but when
> this series becomes a patch, a full header update is generally used.

Ok. I will take care of this in the next revision.

Thanks,
Shameer
 
> Thanks
> 
> Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-30 14:43     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:43 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm

Hi Eric.

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:22 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest
> RAM memory regions
> 
> Hi Shameer,
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > Register ram_memory_region_init notifier to allocate memory region
> > from system memory.
> 
> At this stage the commit message does not explain why you need a machine
> init done notifier. Also the commit title does not summarize the actual
> change.
> Thanks

Right. I will reword/rephrase the text.

Thanks,
Shameer

> Eric
> >
> > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/arm/virt.c         | 28 ++++++++++++++++++++++------
> >  include/hw/arm/virt.h |  1 +
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 94dcb12..05fcb62 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1171,6 +1171,19 @@ void virt_machine_done(Notifier *notifier, void
> *data)
> >      virt_build_smbios(vms);
> >  }
> >
> > +static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> > +{
> > +    MemoryRegion *sysmem = get_system_memory();
> > +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> > +    VirtMachineState *vms = container_of(notifier, VirtMachineState,
> > +                                         ram_memory_region_init);
> > +    MachineState *machine = MACHINE(vms);
> > +
> > +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> > +                                         machine->ram_size);
> > +    memory_region_add_subregion(sysmem, vms-
> >memmap[VIRT_MEM].base, ram);
> > +}
> > +
> >  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> >  {
> >      uint8_t clustersz = ARM_DEFAULT_CPUS_PER_CLUSTER;
> > @@ -1204,7 +1217,6 @@ static void machvirt_init(MachineState *machine)
> >      MemoryRegion *sysmem = get_system_memory();
> >      MemoryRegion *secure_sysmem = NULL;
> >      int n, virt_max_cpus;
> > -    MemoryRegion *ram = g_new(MemoryRegion, 1);
> >      bool firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
> >
> >      /* We can probe only here because during property set
> > @@ -1361,10 +1373,6 @@ static void machvirt_init(MachineState *machine)
> >      fdt_add_timer_nodes(vms);
> >      fdt_add_cpu_nodes(vms);
> >
> > -    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> > -                                         machine->ram_size);
> > -    memory_region_add_subregion(sysmem, vms-
> >memmap[VIRT_MEM].base, ram);
> > -
> >      create_flash(vms, sysmem, secure_sysmem ? secure_sysmem : sysmem);
> >
> >      create_gic(vms, pic);
> > @@ -1405,15 +1413,23 @@ static void machvirt_init(MachineState
> *machine)
> >      vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> >      vms->bootinfo.get_dtb = machvirt_dtb;
> >      vms->bootinfo.firmware_loaded = firmware_loaded;
> > +
> > +    /* Register notifiers. They are executed in registration reverse order */
> >      arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
> >
> >      /*
> >       * arm_load_kernel machine init done notifier registration must
> >       * happen before the platform_bus_create call. In this latter,
> >       * another notifier is registered which adds platform bus nodes.
> > -     * Notifiers are executed in registration reverse order.
> >       */
> >      create_platform_bus(vms, pic);
> > +
> > +    /*
> > +     * Register memory region notifier last as this has to be executed
> > +     * first.
> > +     */
> > +    vms->ram_memory_region_init.notify = virt_ram_memory_region_init;
> > +    qemu_add_machine_init_done_notifier(&vms-
> >ram_memory_region_init);
> >  }
> >
> >  static bool virt_get_secure(Object *obj, Error **errp)
> > diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> > index ba0c1a4..fc24f3a 100644
> > --- a/include/hw/arm/virt.h
> > +++ b/include/hw/arm/virt.h
> > @@ -91,6 +91,7 @@ typedef struct {
> >  typedef struct {
> >      MachineState parent;
> >      Notifier machine_done;
> > +    Notifier ram_memory_region_init;
> >      FWCfgState *fw_cfg;
> >      bool secure;
> >      bool highmem;
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-30 14:46     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:46 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:22 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework
> 
> Hi Shameer,
> 
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > This will be used in subsequent patches to model a chunk of
> > memory as pc-dimm(cold plug) if the valid iova regions are
> > non-contiguous. This is not yet a full hotplug support.
> Please can you give more details about this restriction?
> >
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  default-configs/aarch64-softmmu.mak |  1 +
> >  hw/arm/virt.c                       | 82
> +++++++++++++++++++++++++++++++++++++
> >  include/hw/arm/virt.h               |  2 +
> >  3 files changed, 85 insertions(+)
> >
> > diff --git a/default-configs/aarch64-softmmu.mak b/default-configs/aarch64-
> softmmu.mak
> > index 9ddccf8..7a82ed8 100644
> > --- a/default-configs/aarch64-softmmu.mak
> > +++ b/default-configs/aarch64-softmmu.mak
> > @@ -8,3 +8,4 @@ CONFIG_DDC=y
> >  CONFIG_DPCD=y
> >  CONFIG_XLNX_ZYNQMP=y
> >  CONFIG_XLNX_ZYNQMP_ARM=y
> > +CONFIG_MEM_HOTPLUG=y
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 05fcb62..be3ad14 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1552,9 +1552,82 @@ static const CPUArchIdList
> *virt_possible_cpu_arch_ids(MachineState *ms)
> >      return ms->possible_cpus;
> >  }
> >
> > +static void virt_dimm_plug(HotplugHandler *hotplug_dev,
> > +                         DeviceState *dev, Error **errp)
> > +{
> > +    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
> > +    PCDIMMDevice *dimm = PC_DIMM(dev);
> > +    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> > +    MemoryRegion *mr;
> > +    uint64_t align;
> > +    Error *local_err = NULL;
> > +
> > +    mr = ddc->get_memory_region(dimm, &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> > +
> > +    align = memory_region_get_alignment(mr);
> I see that on PC machine class they have an "enforce_aligned_dimm"
> option. Also looks memory_region_get_alignment(mr) can return 0.
> 
> if (memory_region_get_alignment(mr) && pcmc->enforce_aligned_dimm) {

Ok.
 
> 
> > +    pc_dimm_memory_plug(dev, &vms->hotplug_memory, mr, align,
> &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> useless block and error_propagate does nothing if local_err is NULL so
> we are fine.

Ok. I will remove this from this patch.

> > +
> > +out:
> > +    error_propagate(errp, local_err);
> > +}
> > +
> > +static void virt_dimm_unplug(HotplugHandler *hotplug_dev,
> > +                           DeviceState *dev, Error **errp)
> > +{
> > +    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
> > +    PCDIMMDevice *dimm = PC_DIMM(dev);
> > +    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> > +    MemoryRegion *mr;
> > +    Error *local_err = NULL;
> > +
> > +    mr = ddc->get_memory_region(dimm, &local_err);
> don't you want to avoid executing the following if local_err?

True.

> > +    pc_dimm_memory_unplug(dev, &vms->hotplug_memory, mr);
> needs a rebase on top of "pc-dimm: no need to pass the memory region"
> in master pc_dimm_memory_unplug we now have
>  MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);

Right. I will follow that then.

> Thanks
> 
> Eric
> > +    object_unparent(OBJECT(dev));
> > +
> > +    error_propagate(errp, local_err);
> > +}
> > +
> > +static void virt_machinedevice_plug_cb(HotplugHandler *hotplug_dev,
> > +                                      DeviceState *dev, Error **errp)
> > +{
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> > +        virt_dimm_plug(hotplug_dev, dev, errp);
> > +    } else {
> > +        error_setg(errp, "device plug request for not supported device"
> > +                   " type: %s", object_get_typename(OBJECT(dev)));
> > +    }
> > +}
> > +
> > +static void virt_machinedevice_unplug_cb(HotplugHandler *hotplug_dev,
> > +                                        DeviceState *dev, Error **errp)
> > +{
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> > +        virt_dimm_unplug(hotplug_dev, dev, errp);
> > +    } else {
> > +        error_setg(errp, "device unplug for not supported device"
> > +                   " type: %s", object_get_typename(OBJECT(dev)));
> > +    }
> > +}
> > +
> > +static HotplugHandler *virt_get_hotplug_handler(MachineState *machine,
> > +                                             DeviceState *dev)
> > +{
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> > +        return HOTPLUG_HANDLER(machine);
> > +    }
> > +    return NULL;
> > +}
> > +
> >  static void virt_machine_class_init(ObjectClass *oc, void *data)
> >  {
> >      MachineClass *mc = MACHINE_CLASS(oc);
> > +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
> >
> >      mc->init = machvirt_init;
> >      /* Start max_cpus at the maximum QEMU supports. We'll further restrict
> > @@ -1573,6 +1646,11 @@ static void virt_machine_class_init(ObjectClass
> *oc, void *data)
> >      mc->cpu_index_to_instance_props = virt_cpu_index_to_props;
> >      mc->default_cpu_type = ARM_CPU_TYPE_NAME("cortex-a15");
> >      mc->get_default_cpu_node_id = virt_get_default_cpu_node_id;
> > +
> > +    mc->get_hotplug_handler = virt_get_hotplug_handler;
> This requires a rebase on top of Igor's  and David's series.
> 
Sure.

Thanks,
Shameer

> Thanks
> 
> Eric
> > +    hc->plug = virt_machinedevice_plug_cb;
> > +    hc->unplug = virt_machinedevice_unplug_cb;
> > +
> >  }
> >
> >  static const TypeInfo virt_machine_info = {
> > @@ -1582,6 +1660,10 @@ static const TypeInfo virt_machine_info = {
> >      .instance_size = sizeof(VirtMachineState),
> >      .class_size    = sizeof(VirtMachineClass),
> >      .class_init    = virt_machine_class_init,
> > +    .interfaces = (InterfaceInfo[]) {
> > +         { TYPE_HOTPLUG_HANDLER },
> > +         { }
> > +    },
> >  };
> >
> >  static void machvirt_machine_init(void)
> > diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> > index fc24f3a..a39f29e 100644
> > --- a/include/hw/arm/virt.h
> > +++ b/include/hw/arm/virt.h
> > @@ -35,6 +35,7 @@
> >  #include "qemu/notify.h"
> >  #include "hw/boards.h"
> >  #include "hw/arm/arm.h"
> > +#include "hw/mem/pc-dimm.h"
> >
> >  #define NUM_GICV2M_SPIS       64
> >  #define NUM_VIRTIO_TRANSPORTS 32
> > @@ -108,6 +109,7 @@ typedef struct {
> >      uint32_t gic_phandle;
> >      uint32_t msi_phandle;
> >      int psci_conduit;
> > +    MemoryHotplugState hotplug_memory;
> >  } VirtMachineState;
> >
> >  #define TYPE_VIRT_MACHINE   MACHINE_TYPE_NAME("virt")
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-30 14:46     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:46 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: peter.maydell, drjones, Jonathan Cameron, Linuxarm,
	alex.williamson, Zhaoshenglong, imammedo

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:22 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: peter.maydell@linaro.org; drjones@redhat.com; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to
> accommodate non-contiguous DT mem nodes
> 
> Hi Shameer,
> 
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > This makes changes to the DT mem node creation such that its easier
> > to add non-contiguous mem modeled as non-pluggable and a pc-dimm
> > mem later.
> See comments below. I think you should augment the description here with
> what the patch exactly adds:
> - a new helper function
> - a new dimm node?
> 
> and if possible split functional changes into separate patches?

Agree. It is better to split this.

> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/arm/boot.c        | 91 ++++++++++++++++++++++++++++++++++++----------
> ------
> >  include/hw/arm/arm.h | 12 +++++++
> >  2 files changed, 75 insertions(+), 28 deletions(-)
> >
> > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > index 26184bc..73db0aa 100644
> > --- a/hw/arm/boot.c
> > +++ b/hw/arm/boot.c
> > @@ -486,6 +486,27 @@ static void fdt_add_psci_node(void *fdt)
> >      qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
> >  }
> >
> > +static char *create_memory_fdt(void *fdt, uint32_t acells, hwaddr
> mem_base,
> > +                                          uint32_t scells, hwaddr mem_len)
> Other dt node creation functions are named fdt_add_*_node

Ok.

> > +{
> > +    char *nodename = NULL;
> > +    int rc;
> > +
> > +    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > +    qemu_fdt_add_subnode(fdt, nodename);
> > +    qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > +    rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells,
> mem_base,
> > +                                      scells, mem_len);
> > +    if (rc < 0) {
> > +        fprintf(stderr, "couldn't set %s/reg\n", nodename);
> > +        g_free(nodename);
> > +        return NULL;
> > +    }
> > +
> > +    return nodename;
> > +}
> > +
> > +
> >  /**
> >   * load_dtb() - load a device tree binary image into memory
> >   * @addr:       the address to load the image at
> > @@ -567,50 +588,64 @@ static int load_dtb(hwaddr addr, const struct
> arm_boot_info *binfo,
> >          goto fail;
> >      }
> >
> > +    /*
> > +     * Turn the /memory node created before into a NOP node, then create
> > +     * /memory@addr nodes for all numa nodes respectively.
> > +     */
> > +    qemu_fdt_nop_node(fdt, "/memory");
> I don't really understand why this gets moved outside of the if
> (nb_numa_nodes > 0) { check. Also the comment above mention numa nodes
> whereas it are not necessarily in the numa case anymore.

Hmm..I think this is because virt.c has a create_fdt() where "/memory " subnode
is created. May be I can keep that and update with non-plug mem and create a new
"/memory @" for pc-dimm case. I will double check.

> > +
> >      if (nb_numa_nodes > 0) {
> > -        /*
> > -         * Turn the /memory node created before into a NOP node, then create
> > -         * /memory@addr nodes for all numa nodes respectively.
> > -         */
> > -        qemu_fdt_nop_node(fdt, "/memory");
> > +        hwaddr mem_sz;
> > +
> >          mem_base = binfo->loader_start;
> > +        mem_sz = binfo->ram_size;
> >          for (i = 0; i < nb_numa_nodes; i++) {
> > -            mem_len = numa_info[i].node_mem;
> > -            nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > -            qemu_fdt_add_subnode(fdt, nodename);
> > -            qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > -            rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg",
> > -                                              acells, mem_base,
> > +            mem_len = MIN(numa_info[i].node_mem, mem_sz);
> I fail to understand how this change relates to the topic of this patch.
> If this adds a consistency check, this may be put in another patch?

Yes. It is not related to this patch per se. But without this and some of the 
related mem_len/ mem_sz  changes below it will end up creating more num
of memory nodes as the numa mem info is populated much earlier than we
modify the ram_size(non-plug) info. I can move this and your related comment
for acpi patch (5/6) to a separate patch.

> > +
> > +            nodename = create_memory_fdt(fdt, acells, mem_base,
> >                                                scells, mem_len);
> You could simplify the review by just introducing the new dt node
> creation function in a 1st patch and then introduce the changes to
> support non contiguous DT mem nodes.

Ok. I will split.

> > -            if (rc < 0) {
> > -                fprintf(stderr, "couldn't set %s/reg for node %d\n", nodename,
> > -                        i);
> > +            if (!nodename) {
> >                  goto fail;
> >              }
> >
> >              qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id", i);
> > -            mem_base += mem_len;
> >              g_free(nodename);
> > +            mem_base += mem_len;
> > +            mem_sz -= mem_len;
> > +            if (!mem_sz) {
> > +                break;
> So what does it mean practically if we break here. Not all the num nodes
> will function? Outputting a mesg for the end user may be useful.

The break here means we have populated mem nodes for non-plug ram case
and the remaining mem(if any) will be based on the pc-dimm case. Yes, it 
is possible that Guest kernel will report memory-less numa nodes. I will add
a msg here.

> > +           }
> >          }
> > -    } else {
> > -        Error *err = NULL;
> >
> > -        rc = fdt_path_offset(fdt, "/memory");
> > -        if (rc < 0) {
> > -            qemu_fdt_add_subnode(fdt, "/memory");
> > -        }
> > +        /* Create the node for initial pc-dimm ram, if any */
> > +        if (binfo->dimm_mem) {
> >
> > -        if (!qemu_fdt_getprop(fdt, "/memory", "device_type", NULL, &err)) {
> > -            qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
> > +            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem-
> >base,
> > +                                              scells, binfo->dimm_mem->size);
> > +            if (!nodename) {
> > +                goto fail;
> > +            }
> > +            qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id",
> > +                                                 binfo->dimm_mem->node);
> > +            g_free(nodename);
> >          }
> >
> > -        rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
> > -                                          acells, binfo->loader_start,
> > -                                          scells, binfo->ram_size);
> > -        if (rc < 0) {
> > -            fprintf(stderr, "couldn't set /memory/reg\n");
> > +    } else {
> > +
> > +        nodename = create_memory_fdt(fdt, acells, binfo->loader_start,
> > +                                         scells, binfo->ram_size);
> > +        if (!nodename) {
> >              goto fail;
> >          }
> > +
> > +        if (binfo->dimm_mem) {
> > +            nodename = create_memory_fdt(fdt, acells, binfo->dimm_mem-
> >base,
> > +                                              scells, binfo->dimm_mem->size);
> > +            if (!nodename) {
> > +                goto fail;
> > +            }
> > +            g_free(nodename);
> > +        }
> as this code gets duplicated, a helper function may be relevant?

Ok.

Thanks,
Shameer

> Thanks
> 
> Eric
> >      }
> >
> >      rc = fdt_path_offset(fdt, "/chosen");
> > diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
> > index ce769bd..0ee3b4e 100644
> > --- a/include/hw/arm/arm.h
> > +++ b/include/hw/arm/arm.h
> > @@ -48,6 +48,12 @@ typedef struct {
> >      ARMCPU *cpu; /* handle to the first cpu object */
> >  } ArmLoadKernelNotifier;
> >
> > +struct dimm_mem_info {
> > +    int node;
> > +    hwaddr base;
> > +    hwaddr size;
> > +};
> > +
> >  /* arm_boot.c */
> >  struct arm_boot_info {
> >      uint64_t ram_size;
> > @@ -124,6 +130,12 @@ struct arm_boot_info {
> >      bool secure_board_setup;
> >
> >      arm_endianness endianness;
> > +
> > +    /* This is used to model a pc-dimm based mem if the valid iova region
> > +     * is non-contiguous.
> > +     */
> > +    struct dimm_mem_info *dimm_mem;
> > +
> >  };
> >
> >  /**
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-05-28 14:21   ` Auger Eric
@ 2018-05-30 14:48     ` Shameerali Kolothum Thodi
  2018-06-05  7:49     ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:48 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm



> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Monday, May 28, 2018 3:22 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org
> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
> 
> Hi Shameer,
> 
> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > In case valid iova regions are non-contiguous, split the
> > RAM mem into a 1GB non-pluggable dimm and remaining as a
> > single pc-dimm mem.
> 
> Please can you explain where does this split come from? Currently we
> have 254 GB non pluggable RAM. I read the discussion started with Drew
> on RFC v1 where he explained we cannot change the RAM base without
> crashing the FW. However we should at least document this  1GB split.

The comments were mainly to use the pc-dimm model instead of "mem alias"
method used on RFC v1 as this will help to add the mem hot-plug support 
in future. 

I am not sure about the motive behind Drew's idea of splitting the first
1GB as non-plug and remaining as a pc-dimm(cold). May it is to attempt a 
best effort scenario, but as I mentioned in reply to 0/6, the current solution
will end up changing the base address if the 0x4000000 falls under a reserved
region.

Again, not sure whether we should enforce a strict check on base address
start or just warn the user that it will fail on Guest with UEFI boot[1].

Hi Drew, 

Please let me know your thoughts on this.

> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/arm/virt.c | 261
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 256 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index be3ad14..562e389 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -58,6 +58,12 @@
> >  #include "hw/smbios/smbios.h"
> >  #include "qapi/visitor.h"
> >  #include "standard-headers/linux/input.h"
> > +#include "hw/vfio/vfio-common.h"
> > +#include "qemu/config-file.h"
> > +#include "monitor/qdev.h"
> > +#include "qom/object_interfaces.h"
> > +#include "qapi/qmp/qdict.h"
> > +#include "qemu/option.h"
> 
> The comment at the beginning of virt.c would need to be reworked with
> your changes.

Ok.

> >  #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
> >      static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
> > @@ -110,7 +116,10 @@ static ARMPlatformBusSystemParams
> platform_bus_params;
> >   * terabyte of physical address space.)
> >   */
> >  #define RAMLIMIT_GB 255
> > -#define RAMLIMIT_BYTES (RAMLIMIT_GB * 1024ULL * 1024 * 1024)
> > +#define SZ_1G (1024ULL * 1024 * 1024)
> > +#define RAMLIMIT_BYTES (RAMLIMIT_GB * SZ_1G)
> > +
> > +#define ALIGN_1G (1ULL << 30)
> >
> >  /* Addresses and sizes of our components.
> >   * 0..128MB is space for a flash device so we can run bootrom code such as
> UEFI.
> > @@ -1171,6 +1180,236 @@ void virt_machine_done(Notifier *notifier, void
> *data)
> >      virt_build_smbios(vms);
> >  }
> >
> > +static void free_iova_copy(struct vfio_iova_head *iova_copy)
> > +{
> > +    VFIOIovaRange *iova, *tmp;
> > +
> > +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> > +        QLIST_REMOVE(iova, next);
> > +        g_free(iova);
> > +    }
> > +}
> > +
> > +static void get_iova_copy(struct vfio_iova_head *iova_copy)
> > +{
> > +    VFIOIovaRange *iova, *new, *prev_iova = NULL;
> > +
> > +    QLIST_FOREACH(iova, &vfio_iova_regions, next) {
> > +        new = g_malloc0(sizeof(*iova));
> > +        new->start = iova->start;
> > +        new->end = iova->end;
> > +
> > +        if (prev_iova) {
> > +            QLIST_INSERT_AFTER(prev_iova, new, next);
> > +        } else {
> > +            QLIST_INSERT_HEAD(iova_copy, new, next);
> > +        }
> > +        prev_iova = new;
> > +    }
> > +}
> > +
> > +static hwaddr find_memory_chunk(VirtMachineState *vms,
> > +                   struct vfio_iova_head *iova_copy,
> > +                   hwaddr req_size, bool pcdimm)
> > +{
> > +    VFIOIovaRange *iova, *tmp;
> > +    hwaddr new_start, new_size, sz_align;
> > +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> > +    hwaddr addr_align = ALIGN_1G; /* Set to max ARM64 hugepage size */
> > +
> > +    /* Size alignment */
> > +    sz_align = (pcdimm) ? MAX(TARGET_PAGE_SIZE,
> QEMU_VMALLOC_ALIGN) :
> > +                                                      TARGET_PAGE_SIZE;
> > +
> > +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> > +        if (virt_start >= iova->end) {
> > +            continue;
> > +        }
> > +
> > +        /* Align addr */
> > +        new_start = ROUND_UP(MAX(virt_start, iova->start), addr_align);
> > +        if (new_start >= iova->end) {
> > +            continue;
> > +        }
> > +
> > +        if (req_size > iova->end - new_start + 1) {
> > +            continue;
> > +        }
> don't you want to check directly with new_size?

Yes. I think that should do.

> > +
> > +        /*
> > +         * Check the region can hold any size alignment requirement.
> > +         */
> > +        new_size = QEMU_ALIGN_UP(req_size, sz_align);
> > +
> > +        if ((new_start + new_size - 1 > iova->end) ||
> > +                 (new_start + new_size >= virt_start + RAMLIMIT_BYTES)) {
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Modify the iova list entry for non pc-dimm case so that it
> > +         * is not used again for pc-dimm allocation.
> > +         */
> > +        if (!pcdimm) {
> > +            if (new_size - req_size) {
> I don't get this check, Don't you want to check the node's range is
> fully consumed and in the positive remove the node?
> 
> (new_size != iova->end - iova-> start + 1)

Hmm..looks like I missed that.

> > +                iova->start = new_start + new_size;
> > +            } else {
> > +                QLIST_REMOVE(iova, next);
> > +            }
> > +        }
> > +        return new_start;
> > +    }
> > +
> > +    return -1;
> > +}
> > +
> > +static void update_memory_regions(VirtMachineState *vms)
> > +{
> > +    hwaddr addr;
> > +    VFIOIovaRange *iova;
> > +    MachineState *machine = MACHINE(vms);
> > +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> > +    hwaddr req_size, ram_size = machine->ram_size;
> > +    struct vfio_iova_head iova_copy = QLIST_HEAD_INITIALIZER(iova_copy);
> > +
> > +    /* No valid iova regions, use default */
> > +    if (QLIST_EMPTY(&vfio_iova_regions)) {
> > +        vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> > +        vms->bootinfo.ram_size = ram_size;
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * If valid iovas has only one entry, check the req size fits in
> > +     * and can have the loader start < 4GB. This will make sure platforms
> > +     * with no holes in mem will have the same mem model as before.
> > +     */
> > +    req_size = ram_size;
> > +    iova = QLIST_NEXT(QLIST_FIRST(&vfio_iova_regions), next);
> > +    if (!iova) {
> > +        iova = QLIST_FIRST(&vfio_iova_regions);
> > +        addr = ROUND_UP(MAX(virt_start, iova->start), ALIGN_1G);
> > +        if ((addr < 4 * SZ_1G) && (ram_size <= iova->end - addr + 1) &&
> > +                  (addr + ram_size < virt_start + RAMLIMIT_BYTES)) {
> > +            vms->bootinfo.loader_start = addr;
> > +            vms->bootinfo.ram_size = ram_size;
> > +            return;
> > +        }
> > +    }
> > +
> > +    /* Get a copy of valid iovas and work on it */
> > +    get_iova_copy(&iova_copy);
> > +
> > +    /* Split the mem as first 1GB non-pluggable and rest as pc-dimm */
> > +    req_size = MIN(ram_size, SZ_1G);
> > +    addr = find_memory_chunk(vms, &iova_copy, req_size, false);
> According to what Drew reported, the base address of the cold-plug RAM
> must stay at 1GB otherwise the FW will be lost. So I think you must
> check the addr = 1GB

Please see my above comment on the base address.

> > +    if (addr == -1 || addr >= 4 * SZ_1G) {
> > +        goto out;
> > +    }
> > +
> > +    /*Update non-pluggable mem details */
> > +    machine->ram_size = req_size;
> > +    vms->bootinfo.loader_start = addr;
> > +    vms->bootinfo.ram_size = machine->ram_size;
> > +
> > +    req_size = ram_size - req_size;
> > +    if (!req_size) {
> > +        goto done;
> > +    }
> > +
> > +    /* Remaining memory is modeled as a pc-dimm. */
> > +    addr = find_memory_chunk(vms, &iova_copy, req_size, true);
> > +    if (addr == -1) {
> > +        goto out;
> > +    }
> > +
> > +    /*Update pc-dimm mem details */
> > +    vms->bootinfo.dimm_mem = g_new(struct dimm_mem_info, 1);
> > +    vms->bootinfo.dimm_mem->base = addr;
> > +    vms->bootinfo.dimm_mem->size = req_size;
> > +    machine->maxram_size = machine->ram_size + req_size;
> > +    machine->ram_slots += 1;
> > +
> > +done:
> > +    free_iova_copy(&iova_copy);
> > +    return;
> > +
> > +out:
> > +    free_iova_copy(&iova_copy);
> > +    error_report("mach-virt: Not enough contiguous memory to model ram");
> Output the requested size would help debug (and maybe the available IOVA
> ranges)

Agree. I will change that.

Thanks,
Shameer

[1] https://github.com/tianocore/edk2/blob/master/ArmVirtPkg/ArmVirtQemu.dsc#L133 

> Thanks
> 
> Eric
> > +    exit(1);
> > +}
> > +
> > +static void create_pcdimms(VirtMachineState *vms,
> > +                             MemoryRegion *sysmem,
> > +                             MemoryRegion *ram)
> > +{
> > +    hwaddr addr, size;
> > +    Error *local_err = NULL;
> > +    QDict *qdict;
> > +    QemuOpts *opts;
> > +    char *tmp;
> > +
> > +    if (!vms->bootinfo.dimm_mem) {
> > +        return;
> > +    }
> > +
> > +    addr = vms->bootinfo.dimm_mem->base;
> > +    size = vms->bootinfo.dimm_mem->size;
> > +
> > +    /*Create hotplug address space */
> > +    vms->hotplug_memory.base = ROUND_UP(addr, ALIGN_1G);
> > +    size = ROUND_UP(size, MAX(TARGET_PAGE_SIZE,
> QEMU_VMALLOC_ALIGN));
> > +
> > +    memory_region_init(&vms->hotplug_memory.mr, OBJECT(vms),
> > +                                      "hotplug-memory", size);
> > +    memory_region_add_subregion(sysmem, vms->hotplug_memory.base,
> > +                                        &vms->hotplug_memory.mr);
> > +    /* Create backend mem object */
> > +    qdict = qdict_new();
> > +    qdict_put_str(qdict, "qom-type", "memory-backend-ram");
> > +    qdict_put_str(qdict, "id", "mem1");
> > +    tmp = g_strdup_printf("%"PRIu64 "M", size / (1024 * 1024));
> > +    qdict_put_str(qdict, "size", tmp);
> > +    g_free(tmp);
> > +
> > +    opts = qemu_opts_from_qdict(qemu_find_opts("object"), qdict,
> &local_err);
> > +    if (local_err) {
> > +        goto err;
> > +    }
> > +
> > +    user_creatable_add_opts(opts, &local_err);
> > +    qemu_opts_del(opts);
> > +    QDECREF(qdict);
> > +    if (local_err) {
> > +        goto err;
> > +    }
> > +
> > +    /* Create pc-dimm dev*/
> > +    qdict = qdict_new();
> > +    qdict_put_str(qdict, "driver", "pc-dimm");
> > +    qdict_put_str(qdict, "id", "dimm1");
> > +    qdict_put_str(qdict, "memdev", "mem1");
> > +
> > +    opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict,
> &local_err);
> > +    if (local_err) {
> > +        goto err;
> > +    }
> > +
> > +    qdev_device_add(opts, &local_err);
> > +    qemu_opts_del(opts);
> > +    QDECREF(qdict);
> > +    if (local_err) {
> > +        goto err;
> > +    }
> > +
> > +    return;
> > +
> > +err:
> > +    error_report_err(local_err);
> > +    exit(1);
> > +}
> > +
> >  static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> >  {
> >      MemoryRegion *sysmem = get_system_memory();
> > @@ -1179,9 +1418,14 @@ static void virt_ram_memory_region_init(Notifier
> *notifier, void *data)
> >                                           ram_memory_region_init);
> >      MachineState *machine = MACHINE(vms);
> >
> > +    update_memory_regions(vms);
> >      memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> >                                           machine->ram_size);
> > -    memory_region_add_subregion(sysmem, vms-
> >memmap[VIRT_MEM].base, ram);
> > +    memory_region_add_subregion(sysmem, vms->bootinfo.loader_start,
> ram);
> > +
> > +    if (vms->bootinfo.dimm_mem) {
> > +        create_pcdimms(vms, sysmem, ram);
> > +    }
> >  }
> >
> >  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> > @@ -1404,13 +1648,11 @@ static void machvirt_init(MachineState
> *machine)
> >      vms->machine_done.notify = virt_machine_done;
> >      qemu_add_machine_init_done_notifier(&vms->machine_done);
> >
> > -    vms->bootinfo.ram_size = machine->ram_size;
> >      vms->bootinfo.kernel_filename = machine->kernel_filename;
> >      vms->bootinfo.kernel_cmdline = machine->kernel_cmdline;
> >      vms->bootinfo.initrd_filename = machine->initrd_filename;
> >      vms->bootinfo.nb_cpus = smp_cpus;
> >      vms->bootinfo.board_id = -1;
> > -    vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> >      vms->bootinfo.get_dtb = machvirt_dtb;
> >      vms->bootinfo.firmware_loaded = firmware_loaded;
> >
> > @@ -1559,7 +1801,7 @@ static void virt_dimm_plug(HotplugHandler
> *hotplug_dev,
> >      PCDIMMDevice *dimm = PC_DIMM(dev);
> >      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> >      MemoryRegion *mr;
> > -    uint64_t align;
> > +    uint64_t align, addr;
> >      Error *local_err = NULL;
> >
> >      mr = ddc->get_memory_region(dimm, &local_err);
> > @@ -1573,6 +1815,15 @@ static void virt_dimm_plug(HotplugHandler
> *hotplug_dev,
> >          goto out;
> >      }
> >
> > +    addr = object_property_get_uint(OBJECT(dimm),
> PC_DIMM_ADDR_PROP,
> > +                                                       &error_fatal);
> > +    /* Assign the node for pc-dimm initial ram */
> > +    if (vms->bootinfo.dimm_mem && (addr == vms->bootinfo.dimm_mem-
> >base)
> > +                                                 && (nb_numa_nodes > 0)) {
> > +        vms->bootinfo.dimm_mem->node =
> object_property_get_uint(OBJECT(dev),
> > +                                           PC_DIMM_NODE_PROP, &error_fatal);
> > +    }
> > +
> >  out:
> >      error_propagate(errp, local_err);
> >  }
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions
  2018-05-28 16:47   ` Andrew Jones
@ 2018-05-30 14:50     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:50 UTC (permalink / raw)
  To: Andrew Jones
  Cc: qemu-devel, qemu-arm, peter.maydell, Jonathan Cameron, Linuxarm,
	eric.auger, alex.williamson, Zhaoshenglong, imammedo

Hi Drew,

Thanks for going through this.

> -----Original Message-----
> From: Andrew Jones [mailto:drjones@redhat.com]
> Sent: Monday, May 28, 2018 5:47 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: qemu-devel@nongnu.org; qemu-arm@nongnu.org;
> peter.maydell@linaro.org; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> eric.auger@redhat.com; alex.williamson@redhat.com; Zhaoshenglong
> <zhaoshenglong@huawei.com>; imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic
> generation of guest RAM memory regions
> 
> On Wed, May 16, 2018 at 04:20:22PM +0100, Shameer Kolothum wrote:
> > Register ram_memory_region_init notifier to allocate memory region
> > from system memory.
> >
> > Signed-off-by: Zhu Yijun <zhuyijun@huawei.com>
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/arm/virt.c         | 28 ++++++++++++++++++++++------
> >  include/hw/arm/virt.h |  1 +
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 94dcb12..05fcb62 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1171,6 +1171,19 @@ void virt_machine_done(Notifier *notifier, void
> *data)
> >      virt_build_smbios(vms);
> >  }
> >
> > +static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> > +{
> > +    MemoryRegion *sysmem = get_system_memory();
> > +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> > +    VirtMachineState *vms = container_of(notifier, VirtMachineState,
> > +                                         ram_memory_region_init);
> > +    MachineState *machine = MACHINE(vms);
> > +
> > +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> > +                                         machine->ram_size);
> > +    memory_region_add_subregion(sysmem, vms-
> >memmap[VIRT_MEM].base, ram);
> > +}
> > +
> >  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> >  {
> >      uint8_t clustersz = ARM_DEFAULT_CPUS_PER_CLUSTER;
> > @@ -1204,7 +1217,6 @@ static void machvirt_init(MachineState *machine)
> >      MemoryRegion *sysmem = get_system_memory();
> >      MemoryRegion *secure_sysmem = NULL;
> >      int n, virt_max_cpus;
> > -    MemoryRegion *ram = g_new(MemoryRegion, 1);
> >      bool firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
> >
> >      /* We can probe only here because during property set
> > @@ -1361,10 +1373,6 @@ static void machvirt_init(MachineState *machine)
> >      fdt_add_timer_nodes(vms);
> >      fdt_add_cpu_nodes(vms);
> >
> > -    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> > -                                         machine->ram_size);
> > -    memory_region_add_subregion(sysmem, vms-
> >memmap[VIRT_MEM].base, ram);
> > -
> >      create_flash(vms, sysmem, secure_sysmem ? secure_sysmem : sysmem);
> >
> >      create_gic(vms, pic);
> > @@ -1405,15 +1413,23 @@ static void machvirt_init(MachineState
> *machine)
> >      vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> >      vms->bootinfo.get_dtb = machvirt_dtb;
> >      vms->bootinfo.firmware_loaded = firmware_loaded;
> > +
> > +    /* Register notifiers. They are executed in registration reverse order */
> >      arm_load_kernel(ARM_CPU(first_cpu), &vms->bootinfo);
> >
> >      /*
> >       * arm_load_kernel machine init done notifier registration must
> >       * happen before the platform_bus_create call. In this latter,
> >       * another notifier is registered which adds platform bus nodes.
> > -     * Notifiers are executed in registration reverse order.
> >       */
> >      create_platform_bus(vms, pic);
> > +
> > +    /*
> > +     * Register memory region notifier last as this has to be executed
> > +     * first.
> > +     */
> > +    vms->ram_memory_region_init.notify = virt_ram_memory_region_init;
> > +    qemu_add_machine_init_done_notifier(&vms-
> >ram_memory_region_init);
> 
> I realize we've been slow to get to this for reviewing, but a lot has
> changed here since this patch was written. virt now only has a single
> machine done notifier. On rebase we should attempt to integrate with
> that one rather than introduce a new one.

Ok. I will take a look on that one.

Thanks,
Shameer

> Thanks,
> drew
> 
> 
> >  }
> >
> >  static bool virt_get_secure(Object *obj, Error **errp)
> > diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> > index ba0c1a4..fc24f3a 100644
> > --- a/include/hw/arm/virt.h
> > +++ b/include/hw/arm/virt.h
> > @@ -91,6 +91,7 @@ typedef struct {
> >  typedef struct {
> >      MachineState parent;
> >      Notifier machine_done;
> > +    Notifier ram_memory_region_init;
> >      FWCfgState *fw_cfg;
> >      bool secure;
> >      bool highmem;
> > --
> > 2.7.4
> >
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-28 17:02   ` Andrew Jones
@ 2018-05-30 14:51     ` Shameerali Kolothum Thodi
  2018-05-31 20:15     ` Auger Eric
  1 sibling, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-05-30 14:51 UTC (permalink / raw)
  To: Andrew Jones
  Cc: qemu-devel, qemu-arm, peter.maydell, Jonathan Cameron, Linuxarm,
	eric.auger, alex.williamson, Zhaoshenglong, imammedo



> -----Original Message-----
> From: Andrew Jones [mailto:drjones@redhat.com]
> Sent: Monday, May 28, 2018 6:02 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: qemu-devel@nongnu.org; qemu-arm@nongnu.org;
> peter.maydell@linaro.org; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> eric.auger@redhat.com; alex.williamson@redhat.com; Zhaoshenglong
> <zhaoshenglong@huawei.com>; imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to
> accommodate non-contiguous mem
> 
> On Wed, May 16, 2018 at 04:20:25PM +0100, Shameer Kolothum wrote:
> > This is in preparation for the next patch where initial ram is split
> > into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
> > iova regions are non-contiguous.
> >
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> > index c7c6a57..8d17b40 100644
> > --- a/hw/arm/virt-acpi-build.c
> > +++ b/hw/arm/virt-acpi-build.c
> > @@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker,
> VirtMachineState *vms)
> >      AcpiSratProcessorGiccAffinity *core;
> >      AcpiSratMemoryAffinity *numamem;
> >      int i, srat_start;
> > -    uint64_t mem_base;
> > +    uint64_t mem_base, mem_sz, mem_len;
> >      MachineClass *mc = MACHINE_GET_CLASS(vms);
> >      const CPUArchIdList *cpu_list = mc-
> >possible_cpu_arch_ids(MACHINE(vms));
> >
> > @@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker *linker,
> VirtMachineState *vms)
> >          core->flags = cpu_to_le32(1);
> >      }
> >
> > -    mem_base = vms->memmap[VIRT_MEM].base;
> > +    mem_base = vms->bootinfo.loader_start;
> > +    mem_sz = vms->bootinfo.loader_start;
> 
> mem_sz = vms->bootinfo.ram_size;
> 
> Assuming the DT generator was correct, meaning bootinfo.ram_size will
> be the size of the non-pluggable dimm.

Oops!. I will correct that.

Thanks,
Shameer


> >      for (i = 0; i < nb_numa_nodes; ++i) {
> >          numamem = acpi_data_push(table_data, sizeof(*numamem));
> > -        build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
> > +        mem_len = MIN(numa_info[i].node_mem, mem_sz);
> > +        build_srat_memory(numamem, mem_base, mem_len, i,
> >                            MEM_AFFINITY_ENABLED);
> > -        mem_base += numa_info[i].node_mem;
> > +        mem_base += mem_len;
> > +        mem_sz -= mem_len;
> > +        if (!mem_sz) {
> > +            break;
> > +        }
> > +    }
> > +
> > +    /* Create table for initial pc-dimm ram, if any */
> > +    if (vms->bootinfo.dimm_mem) {
> > +        numamem = acpi_data_push(table_data, sizeof(*numamem));
> > +        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
> > +                          vms->bootinfo.dimm_mem->size,
> > +                          vms->bootinfo.dimm_mem->node,
> > +                          MEM_AFFINITY_ENABLED);
> > +
> >      }
> >
> >      build_header(linker, table_data, (void *)(table_data->data + srat_start),
> > --
> > 2.7.4
> >
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
  2018-05-30 14:39   ` Shameerali Kolothum Thodi
@ 2018-05-30 15:24     ` Auger Eric
  0 siblings, 0 replies; 34+ messages in thread
From: Auger Eric @ 2018-05-30 15:24 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm

Hi Shameer;

On 05/30/2018 04:39 PM, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -----Original Message-----
>> From: Auger Eric [mailto:eric.auger@redhat.com]
>> Sent: Monday, May 28, 2018 3:22 PM
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> qemu-devel@nongnu.org; qemu-arm@nongnu.org
>> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
>> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
>> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
>> <linuxarm@huawei.com>
>> Subject: Re: [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
>>
>> Hi Shameer,
>>
>> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
>>> When the kernel reports valid iova ranges as non-contiguous,
>>> memory should be allocated to Guest in such a way that
>>> reserved regions(holes) are not visible by Guest.
>>>
>>> This series retrieves the valid iova ranges based on the new
>>> proposed VFIO interface for kernel [1]. It then populates the
>>> first 1GB ram as a non-pluggable dimm and rest as a pc-dimm if
>>> the valid iova ranges are non-contiguous.
>>
>> Some general comments:
> 
> Thanks for going through this series.
> 
>> - what are your plans with respect to VFIO device hot-plug handling?
>> IIUC, at the moment, any collision between reserved regions induces by
>> hot-plugged devices are not detected/handled. I understand mem hotplug
>> is not supported on aarch64. Would you simply reject the vfio device
>> hotplug.
> 
> As per the new kernel changes, the _attach_group() will fail if the device 
> reserved regions conflicts with any existing dma mappings of the VM. 
> So my understanding is that, host kernel will in effect reject any hot-plug
> device that has a reserved region conflicting with dma mappings.

Oh you're right, as the whole guest RAM space may be dma mapped, this
should be detected.
> 
>> - IIUC any reserved window colliding with [4000000, 1GB] cold-plug RAM
>> segment is show-stopper. How was this 1GB size chosen exactly?
> 
> Yes, any reserved window in that region might prevent the Guest from booting
> with UEFI as the current solution will move the base start address . I think this is
> because the EDK2 UEFI expects the base to start at 0x4000000[1]. I am not sure
> why this is required by the UEFI as without the "-bios" option, the Guest can be
> booted as long as base addr < 4GB.
> 
>> - Currently you create a single PC-DIMM node whereas several ones may be
>> possible & necessary? Do you plan to support multiple PC-DIMMS node?
> 
> Yes, that is correct. This is mainly to keep it simple and with the expectation that
> the valid host iova regions may not be that fragmented. I can look into extent this
> to support multiple pc-dimms if there is an immediate requirement to do so.
>  
>> - I have started looking at RAM extension on machvirt. I think adding
>> support of PC-DIMMS through the qemu cmd line is something that we need
>> to work on, in paralell.
> 
> Ok. Do you have more info that you can share on this? Please let me know.
not really. We just plan to support more than 255GB RAM and support this
through PC-DIMM. Some of your patches can be reused I think.

Thanks

Eric
> 
> Thanks,
> Shameer
> 
> [1] https://github.com/tianocore/edk2/blob/master/ArmVirtPkg/ArmVirtQemu.dsc#L133
> 
>> Thanks
>>
>> Eric
>>>
>>> Patch #3 of this series is loosely based on an earlier attempt
>>> by Kwangwoo Lee to add hotplug/pc-dimm support to arm64[2]
>>>
>>> RFC v1[3] --> RFCv2
>>>  -Based on new VFIO kernel interface
>>>  -Part of Mem modelled as pc-dimm
>>>  -Rebased to qemu 2.12.0
>>>
>>> [1] https://lkml.org/lkml/2018/4/18/293
>>> [2] https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04600.html
>>> [3] https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02412.html
>>>
>>> Shameer Kolothum (6):
>>>   hw/vfio: Retrieve valid iova ranges from kernel
>>>   hw/arm/virt: Enable dynamic generation of guest RAM memory regions
>>>   hw/arm/virt: Add pc-dimm mem hotplug framework
>>>   hw/arm: Changes required to accommodate non-contiguous DT mem nodes
>>>   hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
>>>   hw/arm: Populate non-contiguous memory regions
>>>
>>>  default-configs/aarch64-softmmu.mak |   1 +
>>>  hw/arm/boot.c                       |  91 ++++++---
>>>  hw/arm/virt-acpi-build.c            |  24 ++-
>>>  hw/arm/virt.c                       | 367
>> +++++++++++++++++++++++++++++++++++-
>>>  hw/vfio/common.c                    | 108 ++++++++++-
>>>  include/hw/arm/arm.h                |  12 ++
>>>  include/hw/arm/virt.h               |   3 +
>>>  include/hw/vfio/vfio-common.h       |   7 +
>>>  linux-headers/linux/vfio.h          |  23 +++
>>>  9 files changed, 589 insertions(+), 47 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-28 17:02   ` Andrew Jones
  2018-05-30 14:51     ` Shameerali Kolothum Thodi
@ 2018-05-31 20:15     ` Auger Eric
  2018-06-01 11:09       ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-05-31 20:15 UTC (permalink / raw)
  To: Andrew Jones, Shameer Kolothum
  Cc: peter.maydell, zhaoshenglong, linuxarm, qemu-devel,
	alex.williamson, qemu-arm, jonathan.cameron, imammedo

Hi Shameer,

On 05/28/2018 07:02 PM, Andrew Jones wrote:
> On Wed, May 16, 2018 at 04:20:25PM +0100, Shameer Kolothum wrote:
>> This is in preparation for the next patch where initial ram is split
>> into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
>> iova regions are non-contiguous.
>>
>> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>> ---
>>  hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
>>  1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>> index c7c6a57..8d17b40 100644
>> --- a/hw/arm/virt-acpi-build.c
>> +++ b/hw/arm/virt-acpi-build.c
>> @@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>      AcpiSratProcessorGiccAffinity *core;
>>      AcpiSratMemoryAffinity *numamem;
>>      int i, srat_start;
>> -    uint64_t mem_base;
>> +    uint64_t mem_base, mem_sz, mem_len;
>>      MachineClass *mc = MACHINE_GET_CLASS(vms);
>>      const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
>>  
>> @@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>          core->flags = cpu_to_le32(1);
>>      }
>>  
>> -    mem_base = vms->memmap[VIRT_MEM].base;
>> +    mem_base = vms->bootinfo.loader_start;
>> +    mem_sz = vms->bootinfo.loader_start;
> 
> mem_sz = vms->bootinfo.ram_size;
> 
> Assuming the DT generator was correct, meaning bootinfo.ram_size will
> be the size of the non-pluggable dimm.
> 
> 
>>      for (i = 0; i < nb_numa_nodes; ++i) {
>>          numamem = acpi_data_push(table_data, sizeof(*numamem));
>> -        build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
>> +        mem_len = MIN(numa_info[i].node_mem, mem_sz);
>> +        build_srat_memory(numamem, mem_base, mem_len, i,
>>                            MEM_AFFINITY_ENABLED);
>> -        mem_base += numa_info[i].node_mem;
>> +        mem_base += mem_len;
>> +        mem_sz -= mem_len;
>> +        if (!mem_sz) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    /* Create table for initial pc-dimm ram, if any */
>> +    if (vms->bootinfo.dimm_mem) {
>> +        numamem = acpi_data_push(table_data, sizeof(*numamem));
>> +        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
>> +                          vms->bootinfo.dimm_mem->size,
>> +                          vms->bootinfo.dimm_mem->node,
>> +                          MEM_AFFINITY_ENABLED);
If my understanding is correct the SRAT table is built only if
nb_numa_nodes > 0. I don't get how the PC-DIMM region is exposed if NUMA
nodes are not set?

Thanks

Eric
>> +
>>      }
>>  
>>      build_header(linker, table_data, (void *)(table_data->data + srat_start),
>> -- 
>> 2.7.4
>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem
  2018-05-31 20:15     ` Auger Eric
@ 2018-06-01 11:09       ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-06-01 11:09 UTC (permalink / raw)
  To: Auger Eric, Andrew Jones
  Cc: peter.maydell, Zhaoshenglong, Linuxarm, qemu-devel,
	alex.williamson, qemu-arm, Jonathan Cameron, imammedo

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Thursday, May 31, 2018 9:16 PM
> To: Andrew Jones <drjones@redhat.com>; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Cc: peter.maydell@linaro.org; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Linuxarm <linuxarm@huawei.com>; qemu-devel@nongnu.org;
> alex.williamson@redhat.com; qemu-arm@nongnu.org; Jonathan Cameron
> <jonathan.cameron@huawei.com>; imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to
> accommodate non-contiguous mem
> 
> Hi Shameer,
> 
> On 05/28/2018 07:02 PM, Andrew Jones wrote:
> > On Wed, May 16, 2018 at 04:20:25PM +0100, Shameer Kolothum wrote:
> >> This is in preparation for the next patch where initial ram is split
> >> into a non-pluggable chunk and a pc-dimm modeled mem if  the vaild
> >> iova regions are non-contiguous.
> >>
> >> Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> >> ---
> >>  hw/arm/virt-acpi-build.c | 24 ++++++++++++++++++++----
> >>  1 file changed, 20 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> >> index c7c6a57..8d17b40 100644
> >> --- a/hw/arm/virt-acpi-build.c
> >> +++ b/hw/arm/virt-acpi-build.c
> >> @@ -488,7 +488,7 @@ build_srat(GArray *table_data, BIOSLinker *linker,
> VirtMachineState *vms)
> >>      AcpiSratProcessorGiccAffinity *core;
> >>      AcpiSratMemoryAffinity *numamem;
> >>      int i, srat_start;
> >> -    uint64_t mem_base;
> >> +    uint64_t mem_base, mem_sz, mem_len;
> >>      MachineClass *mc = MACHINE_GET_CLASS(vms);
> >>      const CPUArchIdList *cpu_list = mc-
> >possible_cpu_arch_ids(MACHINE(vms));
> >>
> >> @@ -505,12 +505,28 @@ build_srat(GArray *table_data, BIOSLinker
> *linker, VirtMachineState *vms)
> >>          core->flags = cpu_to_le32(1);
> >>      }
> >>
> >> -    mem_base = vms->memmap[VIRT_MEM].base;
> >> +    mem_base = vms->bootinfo.loader_start;
> >> +    mem_sz = vms->bootinfo.loader_start;
> >
> > mem_sz = vms->bootinfo.ram_size;
> >
> > Assuming the DT generator was correct, meaning bootinfo.ram_size will
> > be the size of the non-pluggable dimm.
> >
> >
> >>      for (i = 0; i < nb_numa_nodes; ++i) {
> >>          numamem = acpi_data_push(table_data, sizeof(*numamem));
> >> -        build_srat_memory(numamem, mem_base, numa_info[i].node_mem,
> i,
> >> +        mem_len = MIN(numa_info[i].node_mem, mem_sz);
> >> +        build_srat_memory(numamem, mem_base, mem_len, i,
> >>                            MEM_AFFINITY_ENABLED);
> >> -        mem_base += numa_info[i].node_mem;
> >> +        mem_base += mem_len;
> >> +        mem_sz -= mem_len;
> >> +        if (!mem_sz) {
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    /* Create table for initial pc-dimm ram, if any */
> >> +    if (vms->bootinfo.dimm_mem) {
> >> +        numamem = acpi_data_push(table_data, sizeof(*numamem));
> >> +        build_srat_memory(numamem, vms->bootinfo.dimm_mem->base,
> >> +                          vms->bootinfo.dimm_mem->size,
> >> +                          vms->bootinfo.dimm_mem->node,
> >> +                          MEM_AFFINITY_ENABLED);
> If my understanding is correct the SRAT table is built only if
> nb_numa_nodes > 0. I don't get how the PC-DIMM region is exposed if NUMA
> nodes are not set?

Yes, SRAT is only build when nb_numa_nodes > 0. I had the same doubt as how the Guest
will see the pc-dimm node on ACPI boot without numa nodes. But during my tests, it did.

This is my qemu command options and please find below logs  with or without the "numa node,nodeid=0"

./qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3 -cpu host \
-kernel Image \
-initrd rootfs-iperf.cpio \
-device vfio-pci,host=000a:11:10.0 \
-net none \
-m 12G \
-numa node,nodeid=0 \
-nographic -D -d -enable-kvm \
-smp 4 \
-bios QEMU_EFI.fd \
-append "console=ttyAMA0 root=/dev/vda -m 4096 rw earlycon=pl011,0x9000000 acpi=force"
  

1. Guest Boot log (without -numa node,nodeid=0 )
---------------------------------------------------------------

[    0.000000] Boot CPU: AArch64 Processor [410fd082]
[    0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '')
[    0.000000] bootconsole [pl11] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II
[    0.000000] efi:  SMBIOS 3.0=0x78710000  ACPI 2.0=0x789b0000  MEMATTR=0x7ba44018 
[    0.000000] cma: Reserved 16 MiB at 0x000000007f000000
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000789B0000 000024 (v02 BOCHS )
[    0.000000] ACPI: XSDT 0x00000000789A0000 000054 (v01 BOCHS  BXPCFACP 00000001      01000013)
[    0.000000] ACPI: FACP 0x0000000078610000 00010C (v05 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x0000000078620000 0011F7 (v02 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: APIC 0x0000000078600000 000198 (v03 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] ACPI: GTDT 0x00000000785F0000 000060 (v02 BOCHS  BXPCGTDT 00000001 BXPC 00000001)
[    0.000000] ACPI: MCFG 0x00000000785E0000 00003C (v01 BOCHS  BXPCMCFG 00000001 BXPC 00000001)
[    0.000000] ACPI: SPCR 0x00000000785D0000 000050 (v02 BOCHS  BXPCSPCR 00000001 BXPC 00000001)
[    0.000000] ACPI: IORT 0x00000000785C0000 00007C (v00 BOCHS  BXPCIORT 00000001 BXPC 00000001)
[    0.000000] ACPI: SPCR: console: pl011,mmio,0x9000000,9600
[    0.000000] ACPI: NUMA: Failed to initialise from firmware
[    0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x00000003bfffffff]
[    0.000000] NUMA: Adding memblock [0x40000000 - 0x785bffff] on node 0
[    0.000000] NUMA: Adding memblock [0x785c0000 - 0x7862ffff] on node 0
[    0.000000] NUMA: Adding memblock [0x78630000 - 0x786fffff] on node 0
[    0.000000] NUMA: Adding memblock [0x78700000 - 0x78b63fff] on node 0
[    0.000000] NUMA: Adding memblock [0x78b64000 - 0x7be3ffff] on node 0
[    0.000000] NUMA: Adding memblock [0x7be40000 - 0x7becffff] on node 0
[    0.000000] NUMA: Adding memblock [0x7bed0000 - 0x7bedffff] on node 0
[    0.000000] NUMA: Adding memblock [0x7bee0000 - 0x7bffffff] on node 0
[    0.000000] NUMA: Adding memblock [0x7c000000 - 0x7fffffff] on node 0
[    0.000000] NUMA: Adding memblock [0x100000000 - 0x3bfffffff] on node 0
[    0.000000] NUMA: Initmem setup node 0 [mem 0x40000000-0x3bfffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x3bffef500-0x3bfff0fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00000003bfffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x00000000785bffff]
[    0.000000]   node   0: [mem 0x00000000785c0000-0x000000007862ffff]
[    0.000000]   node   0: [mem 0x0000000078630000-0x00000000786fffff]
[    0.000000]   node   0: [mem 0x0000000078700000-0x0000000078b63fff]
[    0.000000]   node   0: [mem 0x0000000078b64000-0x000000007be3ffff]
[    0.000000]   node   0: [mem 0x000000007be40000-0x000000007becffff]
[    0.000000]   node   0: [mem 0x000000007bed0000-0x000000007bedffff]
[    0.000000]   node   0: [mem 0x000000007bee0000-0x000000007bffffff]
[    0.000000]   node   0: [mem 0x000000007c000000-0x000000007fffffff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x00000003bfffffff]
[    0.000000] psci: probing for conduit method from ACPI.


2. Guest Boot log (with -numa node,nodeid=0 )

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.11.0-rc1-g7426f0c (shameer@shameer-ubuntu) (gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09) ) #228 SMP PREEMPT Mon Apr 24 14:51:06 BST 2017
[    0.000000] Boot CPU: AArch64 Processor [410fd082]
[    0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '')
[    0.000000] bootconsole [pl11] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II
[    0.000000] efi:  SMBIOS 3.0=0x78710000  ACPI 2.0=0x789b0000  MEMATTR=0x7ba44018 
[    0.000000] cma: Reserved 16 MiB at 0x000000007f000000
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000789B0000 000024 (v02 BOCHS )
[    0.000000] ACPI: XSDT 0x00000000789A0000 00005C (v01 BOCHS  BXPCFACP 00000001      01000013)
[    0.000000] ACPI: FACP 0x0000000078610000 00010C (v05 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x0000000078620000 0011F7 (v02 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: APIC 0x0000000078600000 000198 (v03 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] ACPI: GTDT 0x00000000785F0000 000060 (v02 BOCHS  BXPCGTDT 00000001 BXPC 00000001)
[    0.000000] ACPI: MCFG 0x00000000785E0000 00003C (v01 BOCHS  BXPCMCFG 00000001 BXPC 00000001)
[    0.000000] ACPI: SPCR 0x00000000785D0000 000050 (v02 BOCHS  BXPCSPCR 00000001 BXPC 00000001)
[    0.000000] ACPI: SRAT 0x00000000785C0000 0000C8 (v03 BOCHS  BXPCSRAT 00000001 BXPC 00000001)
[    0.000000] ACPI: IORT 0x00000000785B0000 00007C (v00 BOCHS  BXPCIORT 00000001 BXPC 00000001)
[    0.000000] ACPI: SPCR: console: pl011,mmio,0x9000000,9600
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x0 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x1 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x2 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x3 -> Node 0
[    0.000000] NUMA: Adding memblock [0x40000000 - 0x7fffffff] on node 0
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x40000000-0x7fffffff]
[    0.000000] NUMA: Adding memblock [0x100000000 - 0x3bfffffff] on node 0
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x3bfffffff]
[    0.000000] NUMA: Initmem setup node 0 [mem 0x40000000-0x3bfffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x3bffef500-0x3bfff0fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00000003bfffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x00000000785affff]
[    0.000000]   node   0: [mem 0x00000000785b0000-0x000000007862ffff]
[    0.000000]   node   0: [mem 0x0000000078630000-0x00000000786fffff]
[    0.000000]   node   0: [mem 0x0000000078700000-0x0000000078b63fff]
[    0.000000]   node   0: [mem 0x0000000078b64000-0x000000007be3ffff]
[    0.000000]   node   0: [mem 0x000000007be40000-0x000000007becffff]
[    0.000000]   node   0: [mem 0x000000007bed0000-0x000000007bedffff]
[    0.000000]   node   0: [mem 0x000000007bee0000-0x000000007bffffff]
[    0.000000]   node   0: [mem 0x000000007c000000-0x000000007fffffff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x00000003bfffffff]
[    0.000000] psci: probing for conduit method from ACPI.


In both cases the memblock [0x100000000 - 0x3bfffffff] is present which corresponds to the 
pc-dimm slot. My guess is, this is because the guest kernel retrieves the UEFI params from
FDT when EFI boot is detected.

[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II

May be I am missing something here or there are other boot scenarios where this is not the case.

Please let me know your thoughts.

Thanks,
Shameer

> Thanks
> 
> Eric
> >> +
> >>      }
> >>
> >>      build_header(linker, table_data, (void *)(table_data->data + srat_start),
> >> --
> >> 2.7.4
> >>
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-05-28 14:21   ` Auger Eric
  2018-05-30 14:48     ` Shameerali Kolothum Thodi
@ 2018-06-05  7:49     ` Shameerali Kolothum Thodi
  2018-06-15 15:44       ` Andrew Jones
  1 sibling, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-06-05  7:49 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, qemu-arm
  Cc: drjones, imammedo, peter.maydell, alex.williamson, Zhaoshenglong,
	Jonathan Cameron, Linuxarm

Hi Drew,

> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: 30 May 2018 15:49
> To: 'Auger Eric' <eric.auger@redhat.com>; qemu-devel@nongnu.org; qemu-
> arm@nongnu.org
> Cc: drjones@redhat.com; imammedo@redhat.com; peter.maydell@linaro.org;
> alex.williamson@redhat.com; Zhaoshenglong <zhaoshenglong@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: RE: [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
> 
> 
> 
> > -----Original Message-----
> > From: Auger Eric [mailto:eric.auger@redhat.com]
> > Sent: Monday, May 28, 2018 3:22 PM
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > qemu-devel@nongnu.org; qemu-arm@nongnu.org
> > Cc: drjones@redhat.com; imammedo@redhat.com;
> peter.maydell@linaro.org;
> > alex.williamson@redhat.com; Zhaoshenglong
> <zhaoshenglong@huawei.com>;
> > Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
> > <linuxarm@huawei.com>
> > Subject: Re: [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
> >
> > Hi Shameer,
> >
> > On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > > In case valid iova regions are non-contiguous, split the
> > > RAM mem into a 1GB non-pluggable dimm and remaining as a
> > > single pc-dimm mem.
> >
> > Please can you explain where does this split come from? Currently we
> > have 254 GB non pluggable RAM. I read the discussion started with Drew
> > on RFC v1 where he explained we cannot change the RAM base without
> > crashing the FW. However we should at least document this  1GB split.
> 
> The comments were mainly to use the pc-dimm model instead of "mem alias"
> method used on RFC v1 as this will help to add the mem hot-plug support
> in future.
> 
> I am not sure about the motive behind Drew's idea of splitting the first
> 1GB as non-plug and remaining as a pc-dimm(cold). May it is to attempt a
> best effort scenario, but as I mentioned in reply to 0/6, the current solution
> will end up changing the base address if the 0x4000000 falls under a reserved
> region.
> 
> Again, not sure whether we should enforce a strict check on base address
> start or just warn the user that it will fail on Guest with UEFI boot[1].
> 
> Hi Drew,
> 
> Please let me know your thoughts on this.

Could you please take a look at the above discussion and let us
know your thoughts on the split of mem regions as 1GB non-pluggable
and rest as pc-dimm.

Thanks,
Shameer
 
> > > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > > ---
> > >  hw/arm/virt.c | 261
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 256 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > > index be3ad14..562e389 100644
> > > --- a/hw/arm/virt.c
> > > +++ b/hw/arm/virt.c
> > > @@ -58,6 +58,12 @@
> > >  #include "hw/smbios/smbios.h"
> > >  #include "qapi/visitor.h"
> > >  #include "standard-headers/linux/input.h"
> > > +#include "hw/vfio/vfio-common.h"
> > > +#include "qemu/config-file.h"
> > > +#include "monitor/qdev.h"
> > > +#include "qom/object_interfaces.h"
> > > +#include "qapi/qmp/qdict.h"
> > > +#include "qemu/option.h"
> >
> > The comment at the beginning of virt.c would need to be reworked with
> > your changes.
> 
> Ok.
> 
> > >  #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
> > >      static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
> > > @@ -110,7 +116,10 @@ static ARMPlatformBusSystemParams
> > platform_bus_params;
> > >   * terabyte of physical address space.)
> > >   */
> > >  #define RAMLIMIT_GB 255
> > > -#define RAMLIMIT_BYTES (RAMLIMIT_GB * 1024ULL * 1024 * 1024)
> > > +#define SZ_1G (1024ULL * 1024 * 1024)
> > > +#define RAMLIMIT_BYTES (RAMLIMIT_GB * SZ_1G)
> > > +
> > > +#define ALIGN_1G (1ULL << 30)
> > >
> > >  /* Addresses and sizes of our components.
> > >   * 0..128MB is space for a flash device so we can run bootrom code such as
> > UEFI.
> > > @@ -1171,6 +1180,236 @@ void virt_machine_done(Notifier *notifier,
> void
> > *data)
> > >      virt_build_smbios(vms);
> > >  }
> > >
> > > +static void free_iova_copy(struct vfio_iova_head *iova_copy)
> > > +{
> > > +    VFIOIovaRange *iova, *tmp;
> > > +
> > > +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> > > +        QLIST_REMOVE(iova, next);
> > > +        g_free(iova);
> > > +    }
> > > +}
> > > +
> > > +static void get_iova_copy(struct vfio_iova_head *iova_copy)
> > > +{
> > > +    VFIOIovaRange *iova, *new, *prev_iova = NULL;
> > > +
> > > +    QLIST_FOREACH(iova, &vfio_iova_regions, next) {
> > > +        new = g_malloc0(sizeof(*iova));
> > > +        new->start = iova->start;
> > > +        new->end = iova->end;
> > > +
> > > +        if (prev_iova) {
> > > +            QLIST_INSERT_AFTER(prev_iova, new, next);
> > > +        } else {
> > > +            QLIST_INSERT_HEAD(iova_copy, new, next);
> > > +        }
> > > +        prev_iova = new;
> > > +    }
> > > +}
> > > +
> > > +static hwaddr find_memory_chunk(VirtMachineState *vms,
> > > +                   struct vfio_iova_head *iova_copy,
> > > +                   hwaddr req_size, bool pcdimm)
> > > +{
> > > +    VFIOIovaRange *iova, *tmp;
> > > +    hwaddr new_start, new_size, sz_align;
> > > +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> > > +    hwaddr addr_align = ALIGN_1G; /* Set to max ARM64 hugepage size */
> > > +
> > > +    /* Size alignment */
> > > +    sz_align = (pcdimm) ? MAX(TARGET_PAGE_SIZE,
> > QEMU_VMALLOC_ALIGN) :
> > > +                                                      TARGET_PAGE_SIZE;
> > > +
> > > +    QLIST_FOREACH_SAFE(iova, iova_copy, next, tmp) {
> > > +        if (virt_start >= iova->end) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /* Align addr */
> > > +        new_start = ROUND_UP(MAX(virt_start, iova->start), addr_align);
> > > +        if (new_start >= iova->end) {
> > > +            continue;
> > > +        }
> > > +
> > > +        if (req_size > iova->end - new_start + 1) {
> > > +            continue;
> > > +        }
> > don't you want to check directly with new_size?
> 
> Yes. I think that should do.
> 
> > > +
> > > +        /*
> > > +         * Check the region can hold any size alignment requirement.
> > > +         */
> > > +        new_size = QEMU_ALIGN_UP(req_size, sz_align);
> > > +
> > > +        if ((new_start + new_size - 1 > iova->end) ||
> > > +                 (new_start + new_size >= virt_start + RAMLIMIT_BYTES)) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /*
> > > +         * Modify the iova list entry for non pc-dimm case so that it
> > > +         * is not used again for pc-dimm allocation.
> > > +         */
> > > +        if (!pcdimm) {
> > > +            if (new_size - req_size) {
> > I don't get this check, Don't you want to check the node's range is
> > fully consumed and in the positive remove the node?
> >
> > (new_size != iova->end - iova-> start + 1)
> 
> Hmm..looks like I missed that.
> 
> > > +                iova->start = new_start + new_size;
> > > +            } else {
> > > +                QLIST_REMOVE(iova, next);
> > > +            }
> > > +        }
> > > +        return new_start;
> > > +    }
> > > +
> > > +    return -1;
> > > +}
> > > +
> > > +static void update_memory_regions(VirtMachineState *vms)
> > > +{
> > > +    hwaddr addr;
> > > +    VFIOIovaRange *iova;
> > > +    MachineState *machine = MACHINE(vms);
> > > +    hwaddr virt_start = vms->memmap[VIRT_MEM].base;
> > > +    hwaddr req_size, ram_size = machine->ram_size;
> > > +    struct vfio_iova_head iova_copy = QLIST_HEAD_INITIALIZER(iova_copy);
> > > +
> > > +    /* No valid iova regions, use default */
> > > +    if (QLIST_EMPTY(&vfio_iova_regions)) {
> > > +        vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> > > +        vms->bootinfo.ram_size = ram_size;
> > > +        return;
> > > +    }
> > > +
> > > +    /*
> > > +     * If valid iovas has only one entry, check the req size fits in
> > > +     * and can have the loader start < 4GB. This will make sure platforms
> > > +     * with no holes in mem will have the same mem model as before.
> > > +     */
> > > +    req_size = ram_size;
> > > +    iova = QLIST_NEXT(QLIST_FIRST(&vfio_iova_regions), next);
> > > +    if (!iova) {
> > > +        iova = QLIST_FIRST(&vfio_iova_regions);
> > > +        addr = ROUND_UP(MAX(virt_start, iova->start), ALIGN_1G);
> > > +        if ((addr < 4 * SZ_1G) && (ram_size <= iova->end - addr + 1) &&
> > > +                  (addr + ram_size < virt_start + RAMLIMIT_BYTES)) {
> > > +            vms->bootinfo.loader_start = addr;
> > > +            vms->bootinfo.ram_size = ram_size;
> > > +            return;
> > > +        }
> > > +    }
> > > +
> > > +    /* Get a copy of valid iovas and work on it */
> > > +    get_iova_copy(&iova_copy);
> > > +
> > > +    /* Split the mem as first 1GB non-pluggable and rest as pc-dimm */
> > > +    req_size = MIN(ram_size, SZ_1G);
> > > +    addr = find_memory_chunk(vms, &iova_copy, req_size, false);
> > According to what Drew reported, the base address of the cold-plug RAM
> > must stay at 1GB otherwise the FW will be lost. So I think you must
> > check the addr = 1GB
> 
> Please see my above comment on the base address.
> 
> > > +    if (addr == -1 || addr >= 4 * SZ_1G) {
> > > +        goto out;
> > > +    }
> > > +
> > > +    /*Update non-pluggable mem details */
> > > +    machine->ram_size = req_size;
> > > +    vms->bootinfo.loader_start = addr;
> > > +    vms->bootinfo.ram_size = machine->ram_size;
> > > +
> > > +    req_size = ram_size - req_size;
> > > +    if (!req_size) {
> > > +        goto done;
> > > +    }
> > > +
> > > +    /* Remaining memory is modeled as a pc-dimm. */
> > > +    addr = find_memory_chunk(vms, &iova_copy, req_size, true);
> > > +    if (addr == -1) {
> > > +        goto out;
> > > +    }
> > > +
> > > +    /*Update pc-dimm mem details */
> > > +    vms->bootinfo.dimm_mem = g_new(struct dimm_mem_info, 1);
> > > +    vms->bootinfo.dimm_mem->base = addr;
> > > +    vms->bootinfo.dimm_mem->size = req_size;
> > > +    machine->maxram_size = machine->ram_size + req_size;
> > > +    machine->ram_slots += 1;
> > > +
> > > +done:
> > > +    free_iova_copy(&iova_copy);
> > > +    return;
> > > +
> > > +out:
> > > +    free_iova_copy(&iova_copy);
> > > +    error_report("mach-virt: Not enough contiguous memory to model
> ram");
> > Output the requested size would help debug (and maybe the available IOVA
> > ranges)
> 
> Agree. I will change that.
> 
> Thanks,
> Shameer
> 
> [1]
> https://github.com/tianocore/edk2/blob/master/ArmVirtPkg/ArmVirtQemu.dsc
> #L133
> 
> > Thanks
> >
> > Eric
> > > +    exit(1);
> > > +}
> > > +
> > > +static void create_pcdimms(VirtMachineState *vms,
> > > +                             MemoryRegion *sysmem,
> > > +                             MemoryRegion *ram)
> > > +{
> > > +    hwaddr addr, size;
> > > +    Error *local_err = NULL;
> > > +    QDict *qdict;
> > > +    QemuOpts *opts;
> > > +    char *tmp;
> > > +
> > > +    if (!vms->bootinfo.dimm_mem) {
> > > +        return;
> > > +    }
> > > +
> > > +    addr = vms->bootinfo.dimm_mem->base;
> > > +    size = vms->bootinfo.dimm_mem->size;
> > > +
> > > +    /*Create hotplug address space */
> > > +    vms->hotplug_memory.base = ROUND_UP(addr, ALIGN_1G);
> > > +    size = ROUND_UP(size, MAX(TARGET_PAGE_SIZE,
> > QEMU_VMALLOC_ALIGN));
> > > +
> > > +    memory_region_init(&vms->hotplug_memory.mr, OBJECT(vms),
> > > +                                      "hotplug-memory", size);
> > > +    memory_region_add_subregion(sysmem, vms->hotplug_memory.base,
> > > +                                        &vms->hotplug_memory.mr);
> > > +    /* Create backend mem object */
> > > +    qdict = qdict_new();
> > > +    qdict_put_str(qdict, "qom-type", "memory-backend-ram");
> > > +    qdict_put_str(qdict, "id", "mem1");
> > > +    tmp = g_strdup_printf("%"PRIu64 "M", size / (1024 * 1024));
> > > +    qdict_put_str(qdict, "size", tmp);
> > > +    g_free(tmp);
> > > +
> > > +    opts = qemu_opts_from_qdict(qemu_find_opts("object"), qdict,
> > &local_err);
> > > +    if (local_err) {
> > > +        goto err;
> > > +    }
> > > +
> > > +    user_creatable_add_opts(opts, &local_err);
> > > +    qemu_opts_del(opts);
> > > +    QDECREF(qdict);
> > > +    if (local_err) {
> > > +        goto err;
> > > +    }
> > > +
> > > +    /* Create pc-dimm dev*/
> > > +    qdict = qdict_new();
> > > +    qdict_put_str(qdict, "driver", "pc-dimm");
> > > +    qdict_put_str(qdict, "id", "dimm1");
> > > +    qdict_put_str(qdict, "memdev", "mem1");
> > > +
> > > +    opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict,
> > &local_err);
> > > +    if (local_err) {
> > > +        goto err;
> > > +    }
> > > +
> > > +    qdev_device_add(opts, &local_err);
> > > +    qemu_opts_del(opts);
> > > +    QDECREF(qdict);
> > > +    if (local_err) {
> > > +        goto err;
> > > +    }
> > > +
> > > +    return;
> > > +
> > > +err:
> > > +    error_report_err(local_err);
> > > +    exit(1);
> > > +}
> > > +
> > >  static void virt_ram_memory_region_init(Notifier *notifier, void *data)
> > >  {
> > >      MemoryRegion *sysmem = get_system_memory();
> > > @@ -1179,9 +1418,14 @@ static void
> virt_ram_memory_region_init(Notifier
> > *notifier, void *data)
> > >                                           ram_memory_region_init);
> > >      MachineState *machine = MACHINE(vms);
> > >
> > > +    update_memory_regions(vms);
> > >      memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> > >                                           machine->ram_size);
> > > -    memory_region_add_subregion(sysmem, vms-
> > >memmap[VIRT_MEM].base, ram);
> > > +    memory_region_add_subregion(sysmem, vms->bootinfo.loader_start,
> > ram);
> > > +
> > > +    if (vms->bootinfo.dimm_mem) {
> > > +        create_pcdimms(vms, sysmem, ram);
> > > +    }
> > >  }
> > >
> > >  static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> > > @@ -1404,13 +1648,11 @@ static void machvirt_init(MachineState
> > *machine)
> > >      vms->machine_done.notify = virt_machine_done;
> > >      qemu_add_machine_init_done_notifier(&vms->machine_done);
> > >
> > > -    vms->bootinfo.ram_size = machine->ram_size;
> > >      vms->bootinfo.kernel_filename = machine->kernel_filename;
> > >      vms->bootinfo.kernel_cmdline = machine->kernel_cmdline;
> > >      vms->bootinfo.initrd_filename = machine->initrd_filename;
> > >      vms->bootinfo.nb_cpus = smp_cpus;
> > >      vms->bootinfo.board_id = -1;
> > > -    vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
> > >      vms->bootinfo.get_dtb = machvirt_dtb;
> > >      vms->bootinfo.firmware_loaded = firmware_loaded;
> > >
> > > @@ -1559,7 +1801,7 @@ static void virt_dimm_plug(HotplugHandler
> > *hotplug_dev,
> > >      PCDIMMDevice *dimm = PC_DIMM(dev);
> > >      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> > >      MemoryRegion *mr;
> > > -    uint64_t align;
> > > +    uint64_t align, addr;
> > >      Error *local_err = NULL;
> > >
> > >      mr = ddc->get_memory_region(dimm, &local_err);
> > > @@ -1573,6 +1815,15 @@ static void virt_dimm_plug(HotplugHandler
> > *hotplug_dev,
> > >          goto out;
> > >      }
> > >
> > > +    addr = object_property_get_uint(OBJECT(dimm),
> > PC_DIMM_ADDR_PROP,
> > > +                                                       &error_fatal);
> > > +    /* Assign the node for pc-dimm initial ram */
> > > +    if (vms->bootinfo.dimm_mem && (addr == vms->bootinfo.dimm_mem-
> > >base)
> > > +                                                 && (nb_numa_nodes > 0)) {
> > > +        vms->bootinfo.dimm_mem->node =
> > object_property_get_uint(OBJECT(dev),
> > > +                                           PC_DIMM_NODE_PROP, &error_fatal);
> > > +    }
> > > +
> > >  out:
> > >      error_propagate(errp, local_err);
> > >  }
> > >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-05  7:49     ` Shameerali Kolothum Thodi
@ 2018-06-15 15:44       ` Andrew Jones
  2018-06-15 15:54         ` Peter Maydell
  2018-06-15 15:55         ` Auger Eric
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Jones @ 2018-06-15 15:44 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Auger Eric, qemu-devel, qemu-arm, peter.maydell,
	Jonathan Cameron, Linuxarm, alex.williamson, Zhaoshenglong,
	imammedo

On Tue, Jun 05, 2018 at 07:49:44AM +0000, Shameerali Kolothum Thodi wrote:
> > > On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
> > > > In case valid iova regions are non-contiguous, split the
> > > > RAM mem into a 1GB non-pluggable dimm and remaining as a
> > > > single pc-dimm mem.
> > >
> > > Please can you explain where does this split come from? Currently we
> > > have 254 GB non pluggable RAM. I read the discussion started with Drew
> > > on RFC v1 where he explained we cannot change the RAM base without
> > > crashing the FW. However we should at least document this  1GB split.
> > 
> > The comments were mainly to use the pc-dimm model instead of "mem alias"
> > method used on RFC v1 as this will help to add the mem hot-plug support
> > in future.
> > 
> > I am not sure about the motive behind Drew's idea of splitting the first
> > 1GB as non-plug and remaining as a pc-dimm(cold). May it is to attempt a
> > best effort scenario, but as I mentioned in reply to 0/6, the current solution
> > will end up changing the base address if the 0x4000000 falls under a reserved
> > region.
> > 
> > Again, not sure whether we should enforce a strict check on base address
> > start or just warn the user that it will fail on Guest with UEFI boot[1].
> > 
> > Hi Drew,
> > 
> > Please let me know your thoughts on this.
> 
> Could you please take a look at the above discussion and let us
> know your thoughts on the split of mem regions as 1GB non-pluggable
> and rest as pc-dimm.
> 

Hi Shameer,

Sorry for the slow reply - I've been slowly catching back up after
vacation. There are two reasons to have one non-pluggable memory region
and the rest of memory in a single pluggable pc-dimm.

1) For mach-virt we must have the base of memory at the 1G boundary,
   both because otherwise we'll break ArmVirt UEFI and because that's
   a guarantee that Peter has stated he'd like to keep for mach-virt.

2) Memory split in this way already has precedent in the x86 PC
   machine models.

It's debatable how much memory we should allocate to the non-pluggable
region. It doesn't need to be 1G, it could be smaller. The default
memory size allocated to a mach-virt guest that doesn't provide '-m'
on the command line is 128M. Maybe we should use that size?

If 0x40000000 falls into a reserved address region on some host, then
I'm afraid the mach-virt model won't work with those devices unless the
guest doesn't use UEFI and Peter is open to providing a machine property
that one can enable in order to move the base address of memory.

I know Eric is looking at this problem too. I hope you guys are
coordinating your efforts.

Thanks,
drew

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-15 15:44       ` Andrew Jones
@ 2018-06-15 15:54         ` Peter Maydell
  2018-06-15 16:13           ` Auger Eric
  2018-06-15 15:55         ` Auger Eric
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Maydell @ 2018-06-15 15:54 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Shameerali Kolothum Thodi, Auger Eric, qemu-devel, qemu-arm,
	Jonathan Cameron, Linuxarm, alex.williamson, Zhaoshenglong,
	imammedo

On 15 June 2018 at 16:44, Andrew Jones <drjones@redhat.com> wrote:
> If 0x40000000 falls into a reserved address region on some host, then
> I'm afraid the mach-virt model won't work with those devices unless the
> guest doesn't use UEFI and Peter is open to providing a machine property
> that one can enable in order to move the base address of memory.

Why should the VM ever care about where the address regions in the
host happen to be when it comes to where it's putting its RAM
in the VM's address layout? I feel like I'm missing something here...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-15 15:44       ` Andrew Jones
  2018-06-15 15:54         ` Peter Maydell
@ 2018-06-15 15:55         ` Auger Eric
  1 sibling, 0 replies; 34+ messages in thread
From: Auger Eric @ 2018-06-15 15:55 UTC (permalink / raw)
  To: Andrew Jones, Shameerali Kolothum Thodi
  Cc: qemu-devel, qemu-arm, peter.maydell, Jonathan Cameron, Linuxarm,
	alex.williamson, Zhaoshenglong, imammedo

Hi Drew,

On 06/15/2018 05:44 PM, Andrew Jones wrote:
> On Tue, Jun 05, 2018 at 07:49:44AM +0000, Shameerali Kolothum Thodi wrote:
>>>> On 05/16/2018 05:20 PM, Shameer Kolothum wrote:
>>>>> In case valid iova regions are non-contiguous, split the
>>>>> RAM mem into a 1GB non-pluggable dimm and remaining as a
>>>>> single pc-dimm mem.
>>>>
>>>> Please can you explain where does this split come from? Currently we
>>>> have 254 GB non pluggable RAM. I read the discussion started with Drew
>>>> on RFC v1 where he explained we cannot change the RAM base without
>>>> crashing the FW. However we should at least document this  1GB split.
>>>
>>> The comments were mainly to use the pc-dimm model instead of "mem alias"
>>> method used on RFC v1 as this will help to add the mem hot-plug support
>>> in future.
>>>
>>> I am not sure about the motive behind Drew's idea of splitting the first
>>> 1GB as non-plug and remaining as a pc-dimm(cold). May it is to attempt a
>>> best effort scenario, but as I mentioned in reply to 0/6, the current solution
>>> will end up changing the base address if the 0x4000000 falls under a reserved
>>> region.
>>>
>>> Again, not sure whether we should enforce a strict check on base address
>>> start or just warn the user that it will fail on Guest with UEFI boot[1].
>>>
>>> Hi Drew,
>>>
>>> Please let me know your thoughts on this.
>>
>> Could you please take a look at the above discussion and let us
>> know your thoughts on the split of mem regions as 1GB non-pluggable
>> and rest as pc-dimm.
>>
> 
> Hi Shameer,
> 
> Sorry for the slow reply - I've been slowly catching back up after
> vacation. There are two reasons to have one non-pluggable memory region
> and the rest of memory in a single pluggable pc-dimm.
> 
> 1) For mach-virt we must have the base of memory at the 1G boundary,
>    both because otherwise we'll break ArmVirt UEFI and because that's
>    a guarantee that Peter has stated he'd like to keep for mach-virt.
> 
> 2) Memory split in this way already has precedent in the x86 PC
>    machine models.
> 
> It's debatable how much memory we should allocate to the non-pluggable
> region. It doesn't need to be 1G, it could be smaller. The default
> memory size allocated to a mach-virt guest that doesn't provide '-m'
> on the command line is 128M. Maybe we should use that size?
> 
> If 0x40000000 falls into a reserved address region on some host, then
> I'm afraid the mach-virt model won't work with those devices unless the
> guest doesn't use UEFI and Peter is open to providing a machine property
> that one can enable in order to move the base address of memory.
> 
> I know Eric is looking at this problem too. I hope you guys are
> coordinating your efforts.

Yes we sync'ed together. I will send an RFC beginning of next week
addressing both
- support of >40b PA VM (based on Suzuki's series)
- addition of DIMM at 2TB, reusing the standard PC-DIMM hotplug
framework. This is something standard and similar to what is done on PC
Q35. I am reusing some of Shameer's patches, rebased on top of Igor and
David recent works.

Then we need to understand how we can use 2 hotplug regions. This is not
obvsious to me as the framework currently uses a dedicated MemoryRegion,
if I understand it correctly.

Thanks

Eric
> 
> Thanks,
> drew
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-15 15:54         ` Peter Maydell
@ 2018-06-15 16:13           ` Auger Eric
  2018-06-15 16:33             ` Peter Maydell
  0 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2018-06-15 16:13 UTC (permalink / raw)
  To: Peter Maydell, Andrew Jones
  Cc: Shameerali Kolothum Thodi, qemu-devel, qemu-arm,
	Jonathan Cameron, Linuxarm, alex.williamson, Zhaoshenglong,
	imammedo

Hi,

On 06/15/2018 05:54 PM, Peter Maydell wrote:
> Why should the VM ever care about where the address regions in the
> host happen to be when it comes to where it's putting its RAM
> in the VM's address layout? I feel like I'm missing something here...
> 
> thanks

The VM cannot use RAM GPA that matches assigned device reserved IOVA
regions. When you assign a device, the whole VM RAM is mapped in the
physical IOMMU and IOVA corresponds to GPA. but sometimes some IOVA
cannot be used due to the host specificities and especially iommu
peculiarities. Some IOVAs may be simply "bypassed" by the iommu, like on
x86 and also with some ARM SMMU integration. In such a case the DMA
accesses have no chance to reach the guest RAM as expected. Hope it
clarifies.

Thanks

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-15 16:13           ` Auger Eric
@ 2018-06-15 16:33             ` Peter Maydell
  2018-06-18  9:46               ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Maydell @ 2018-06-15 16:33 UTC (permalink / raw)
  To: Auger Eric
  Cc: Andrew Jones, Shameerali Kolothum Thodi, qemu-devel, qemu-arm,
	Jonathan Cameron, Linuxarm, alex.williamson, Zhaoshenglong,
	imammedo

On 15 June 2018 at 17:13, Auger Eric <eric.auger@redhat.com> wrote:
> On 06/15/2018 05:54 PM, Peter Maydell wrote:
>> Why should the VM ever care about where the address regions in the
>> host happen to be when it comes to where it's putting its RAM
>> in the VM's address layout? I feel like I'm missing something here...

> The VM cannot use RAM GPA that matches assigned device reserved IOVA
> regions. When you assign a device, the whole VM RAM is mapped in the
> physical IOMMU and IOVA corresponds to GPA. but sometimes some IOVA
> cannot be used due to the host specificities and especially iommu
> peculiarities. Some IOVAs may be simply "bypassed" by the iommu, like on
> x86 and also with some ARM SMMU integration. In such a case the DMA
> accesses have no chance to reach the guest RAM as expected. Hope it
> clarifies.

Hmm. We don't want to have the address layout of
the VM have to cope with all the various oddities that host
systems might have. Is this kind of thing common, or rare?
I'd much rather just say "we don't support device passthrough
on that sort of system".

thanks
-- PMM

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions
  2018-06-15 16:33             ` Peter Maydell
@ 2018-06-18  9:46               ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2018-06-18  9:46 UTC (permalink / raw)
  To: Peter Maydell, Auger Eric
  Cc: Andrew Jones, qemu-devel, qemu-arm, Jonathan Cameron, Linuxarm,
	alex.williamson, Zhaoshenglong, imammedo



> -----Original Message-----
> From: Peter Maydell [mailto:peter.maydell@linaro.org]
> Sent: 15 June 2018 17:33
> To: Auger Eric <eric.auger@redhat.com>
> Cc: Andrew Jones <drjones@redhat.com>; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; qemu-devel@nongnu.org; qemu-
> arm@nongnu.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
> Linuxarm <linuxarm@huawei.com>; alex.williamson@redhat.com;
> Zhaoshenglong <zhaoshenglong@huawei.com>; imammedo@redhat.com
> Subject: Re: [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous
> memory regions
> 
> On 15 June 2018 at 17:13, Auger Eric <eric.auger@redhat.com> wrote:
> > On 06/15/2018 05:54 PM, Peter Maydell wrote:
> >> Why should the VM ever care about where the address regions in the
> >> host happen to be when it comes to where it's putting its RAM
> >> in the VM's address layout? I feel like I'm missing something here...
> 
> > The VM cannot use RAM GPA that matches assigned device reserved IOVA
> > regions. When you assign a device, the whole VM RAM is mapped in the
> > physical IOMMU and IOVA corresponds to GPA. but sometimes some IOVA
> > cannot be used due to the host specificities and especially iommu
> > peculiarities. Some IOVAs may be simply "bypassed" by the iommu, like on
> > x86 and also with some ARM SMMU integration. In such a case the DMA
> > accesses have no chance to reach the guest RAM as expected. Hope it
> > clarifies.
> 
> Hmm. We don't want to have the address layout of
> the VM have to cope with all the various oddities that host
> systems might have. Is this kind of thing common, or rare?
> I'd much rather just say "we don't support device passthrough
> on that sort of system".
>

As far as I know on ARM64 platforms this is not very rare mainly because of
the phys MSI doorbell and the fact that memory map is not standardized. This
was discussed in LPC sometime back[1] and a sysfs interface was provided to
report these reserved regions. And later Alex commented that vfio needs a more
robust way of identifying and reporting these regions and that’s how the vfio kernel
patch series[2] was introduced on which this qemu series is based on.

Thanks,
Shameer

[1] https://lkml.org/lkml/2016/11/7/935

[2] https://lkml.org/lkml/2018/4/18/293



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-06-18  9:47 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16 15:20 [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Shameer Kolothum
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 1/6] hw/vfio: Retrieve valid iova ranges from kernel Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-30 14:43     ` Shameerali Kolothum Thodi
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 2/6] hw/arm/virt: Enable dynamic generation of guest RAM memory regions Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-30 14:43     ` Shameerali Kolothum Thodi
2018-05-28 16:47   ` Andrew Jones
2018-05-30 14:50     ` Shameerali Kolothum Thodi
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 3/6] hw/arm/virt: Add pc-dimm mem hotplug framework Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-30 14:46     ` Shameerali Kolothum Thodi
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 4/6] hw/arm: Changes required to accommodate non-contiguous DT mem nodes Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-30 14:46     ` Shameerali Kolothum Thodi
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 5/6] hw/arm: ACPI SRAT changes to accommodate non-contiguous mem Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-28 17:02   ` Andrew Jones
2018-05-30 14:51     ` Shameerali Kolothum Thodi
2018-05-31 20:15     ` Auger Eric
2018-06-01 11:09       ` Shameerali Kolothum Thodi
2018-05-16 15:20 ` [Qemu-devel] [RFC v2 6/6] hw/arm: Populate non-contiguous memory regions Shameer Kolothum
2018-05-28 14:21   ` Auger Eric
2018-05-30 14:48     ` Shameerali Kolothum Thodi
2018-06-05  7:49     ` Shameerali Kolothum Thodi
2018-06-15 15:44       ` Andrew Jones
2018-06-15 15:54         ` Peter Maydell
2018-06-15 16:13           ` Auger Eric
2018-06-15 16:33             ` Peter Maydell
2018-06-18  9:46               ` Shameerali Kolothum Thodi
2018-06-15 15:55         ` Auger Eric
2018-05-28 14:22 ` [Qemu-devel] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions Auger Eric
2018-05-30 14:39   ` Shameerali Kolothum Thodi
2018-05-30 15:24     ` Auger Eric

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.