[Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
@ 2018-04-20 12:34 David Hildenbrand
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-20 12:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Michael S . Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta, David Hildenbrand

Right now we can only map PCDIMM/NVDIMM into guest address space. In the
future, we might want to do the same for virtio devices - e.g.
virtio-pmem or virtio-mem. Especially, they should be able to live side
by side to each other.

E.g. the virto based memory devices regions will not be exposed via ACPI
and friends. They will be detected just like other virtio devices and
indicate the applicable memory region. This makes it possible to also use
them on architectures without memory device detection support (e.g. s390x).

Let's factor out the memory device code into a MemoryDevice interface.


v2 -> v3:
- "pc-dimm: factor out MemoryDevice interface"
--> Lookup both classes when comparing (David Gibson)

v1 -> v2:
- Fix compile issues on ppc (still untested  )


David Hildenbrand (3):
  pc-dimm: factor out MemoryDevice interface
  machine: make MemoryHotplugState accessible via the machine
  pc-dimm: factor out address space logic into MemoryDevice code

 hw/i386/acpi-build.c                         |   3 +-
 hw/i386/pc.c                                 |  24 ++-
 hw/mem/Makefile.objs                         |   1 +
 hw/mem/memory-device.c                       | 282 +++++++++++++++++++++++++
 hw/mem/pc-dimm.c                             | 304 +++++++--------------------
 hw/ppc/spapr.c                               |  24 ++-
 hw/ppc/spapr_hcall.c                         |   1 +
 include/hw/boards.h                          |  16 ++
 include/hw/mem/memory-device.h               |  48 +++++
 include/hw/mem/pc-dimm.h                     |  26 +--
 numa.c                                       |   3 +-
 qmp.c                                        |   4 +-
 stubs/Makefile.objs                          |   2 +-
 stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
 14 files changed, 465 insertions(+), 277 deletions(-)
 create mode 100644 hw/mem/memory-device.c
 create mode 100644 include/hw/mem/memory-device.h
 rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
@ 2018-04-20 12:34 ` David Hildenbrand
  2018-04-22  4:26   ` David Gibson
  2018-04-22  5:09   ` Pankaj Gupta
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-20 12:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Michael S . Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta, David Hildenbrand

On the qmp level, we already have the concept of memory devices:
    "query-memory-devices"
Right now, we only support NVDIMM and PCDIMM.

We want to map other devices later into the address space of the guest.
Such device could e.g. be virtio devices. These devices will have a
guest memory range assigned but won't be exposed via e.g. ACPI. We want
to make them look like memory device, but not glued to pc-dimm.

Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
class (e.g. virtio devices). Let's use an interface instead. As a first
part, convert handling of
- qmp_pc_dimm_device_list
- get_plugged_memory_size
to our new model. plug/unplug stuff etc. will follow later.

A memory device will have to provide the following functions:
- get_addr(): Necessary, as the property "addr" can e.g. not be used for
              virtio devices (already defined).
- get_plugged_size(): The amount this device offers to the guest as of
                      now.
- get_region_size(): Because this can later on be bigger than the
                     plugged size.
- fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/i386/acpi-build.c                         |   3 +-
 hw/mem/Makefile.objs                         |   1 +
 hw/mem/memory-device.c                       | 120 +++++++++++++++++++++++++++
 hw/mem/pc-dimm.c                             | 119 +++++++++++++-------------
 hw/ppc/spapr.c                               |   3 +-
 hw/ppc/spapr_hcall.c                         |   1 +
 include/hw/mem/memory-device.h               |  44 ++++++++++
 include/hw/mem/pc-dimm.h                     |   2 -
 numa.c                                       |   3 +-
 qmp.c                                        |   4 +-
 stubs/Makefile.objs                          |   2 +-
 stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
 12 files changed, 240 insertions(+), 66 deletions(-)
 create mode 100644 hw/mem/memory-device.c
 create mode 100644 include/hw/mem/memory-device.h
 rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 3cf2a1679c..ca3645d57b 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -46,6 +46,7 @@
 #include "hw/acpi/vmgenid.h"
 #include "sysemu/tpm_backend.h"
 #include "hw/timer/mc146818rtc_regs.h"
+#include "hw/mem/memory-device.h"
 #include "sysemu/numa.h"
 
 /* Supported chipsets: */
@@ -2253,7 +2254,7 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, GArray *tcpalog)
 static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
                                            uint64_t len, int default_node)
 {
-    MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+    MemoryDeviceInfoList *info_list = qmp_memory_device_list();
     MemoryDeviceInfoList *info;
     MemoryDeviceInfo *mi;
     PCDIMMDeviceInfo *di;
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index f12f8b97a2..10be4df2a2 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,2 +1,3 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_MEM_HOTPLUG) += memory-device.o
 common-obj-$(CONFIG_NVDIMM) += nvdimm.o
diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
new file mode 100644
index 0000000000..b860c9c582
--- /dev/null
+++ b/hw/mem/memory-device.c
@@ -0,0 +1,120 @@
+/*
+ * Memory Device Interface
+ *
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2014 Red Hat Inc
+ * Copyright (c) 2018 Red Hat Inc
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/mem/memory-device.h"
+#include "hw/qdev.h"
+#include "qapi/error.h"
+#include "hw/boards.h"
+#include "qemu/range.h"
+
+static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
+{
+    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
+    MemoryDeviceState *md_b = MEMORY_DEVICE(b);
+    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
+    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
+    const uint64_t addr_a = mdc_a->get_addr(md_a);
+    const uint64_t addr_b = mdc_b->get_addr(md_b);
+
+    if (addr_a > addr_b) {
+        return 1;
+    } else if (addr_a < addr_b) {
+        return -1;
+    }
+    return 0;
+}
+
+static int memory_device_built_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
+        DeviceState *dev = DEVICE(obj);
+        if (dev->realized) { /* only realized memory devices matter */
+            *list = g_slist_insert_sorted(*list, dev, memory_device_addr_sort);
+        }
+    }
+
+    object_child_foreach(obj, memory_device_built_list, opaque);
+    return 0;
+}
+
+MemoryDeviceInfoList *qmp_memory_device_list(void)
+{
+    GSList *devices = NULL, *item;
+    MemoryDeviceInfoList *list = NULL, *prev = NULL;
+
+    object_child_foreach(qdev_get_machine(), memory_device_built_list,
+                         &devices);
+
+    for (item = devices; item; item = g_slist_next(item)) {
+        MemoryDeviceState *md = MEMORY_DEVICE(item->data);
+        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data);
+        MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
+        MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
+
+        mdc->fill_device_info(md, info);
+
+        elem->value = info;
+        elem->next = NULL;
+        if (prev) {
+            prev->next = elem;
+        } else {
+            list = elem;
+        }
+        prev = elem;
+    }
+
+    g_slist_free(devices);
+
+    return list;
+}
+
+static int memory_device_plugged_size(Object *obj, void *opaque)
+{
+    uint64_t *size = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
+        DeviceState *dev = DEVICE(obj);
+        MemoryDeviceState *md = MEMORY_DEVICE(obj);
+        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
+
+        if (dev->realized) {
+            *size += mdc->get_plugged_size(md, &error_abort);
+        }
+    }
+
+    object_child_foreach(obj, memory_device_plugged_size, opaque);
+    return 0;
+}
+
+uint64_t get_plugged_memory_size(void)
+{
+    uint64_t size = 0;
+
+    memory_device_plugged_size(qdev_get_machine(), &size);
+
+    return size;
+}
+
+static const TypeInfo memory_device_info = {
+    .name          = TYPE_MEMORY_DEVICE,
+    .parent        = TYPE_INTERFACE,
+    .class_size = sizeof(MemoryDeviceClass),
+};
+
+static void memory_device_register_types(void)
+{
+    type_register_static(&memory_device_info);
+}
+
+type_init(memory_device_register_types)
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 51350d9c2d..1dbf699e02 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -21,6 +21,7 @@
 #include "qemu/osdep.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/mem/nvdimm.h"
+#include "hw/mem/memory-device.h"
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
@@ -158,11 +159,6 @@ uint64_t pc_existing_dimms_capacity(Error **errp)
     return cap.size;
 }
 
-uint64_t get_plugged_memory_size(void)
-{
-    return pc_existing_dimms_capacity(&error_abort);
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
     unsigned long *bitmap = opaque;
@@ -238,57 +234,6 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
     return 0;
 }
 
-MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
-{
-    GSList *dimms = NULL, *item;
-    MemoryDeviceInfoList *list = NULL, *prev = NULL;
-
-    object_child_foreach(qdev_get_machine(), pc_dimm_built_list, &dimms);
-
-    for (item = dimms; item; item = g_slist_next(item)) {
-        PCDIMMDevice *dimm = PC_DIMM(item->data);
-        Object *obj = OBJECT(dimm);
-        MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
-        MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-        PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
-        bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
-        DeviceClass *dc = DEVICE_GET_CLASS(obj);
-        DeviceState *dev = DEVICE(obj);
-
-        if (dev->id) {
-            di->has_id = true;
-            di->id = g_strdup(dev->id);
-        }
-        di->hotplugged = dev->hotplugged;
-        di->hotpluggable = dc->hotpluggable;
-        di->addr = dimm->addr;
-        di->slot = dimm->slot;
-        di->node = dimm->node;
-        di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
-        di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
-
-        if (!is_nvdimm) {
-            info->u.dimm.data = di;
-            info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
-        } else {
-            info->u.nvdimm.data = di;
-            info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
-        }
-        elem->value = info;
-        elem->next = NULL;
-        if (prev) {
-            prev->next = elem;
-        } else {
-            list = elem;
-        }
-        prev = elem;
-    }
-
-    g_slist_free(dimms);
-
-    return list;
-}
-
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
                                uint64_t address_space_size,
                                uint64_t *hint, uint64_t align, uint64_t size,
@@ -445,10 +390,62 @@ static MemoryRegion *pc_dimm_get_vmstate_memory_region(PCDIMMDevice *dimm)
     return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
 }
 
+static uint64_t pc_dimm_md_get_addr(MemoryDeviceState *md)
+{
+    PCDIMMDevice *dimm = PC_DIMM(md);
+
+    return dimm->addr;
+}
+
+static uint64_t pc_dimm_md_get_region_size(MemoryDeviceState *md, Error **errp)
+{
+    PCDIMMDevice *dimm = PC_DIMM(md);
+    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(md);
+    MemoryRegion *mr;
+
+    mr = ddc->get_memory_region(dimm, errp);
+    if (!mr) {
+        return 0;
+    }
+
+    return memory_region_size(mr);
+}
+
+static void pc_dimm_md_fill_device_info(MemoryDeviceState *md,
+                                        MemoryDeviceInfo *info)
+{
+    PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+    DeviceClass *dc = DEVICE_GET_CLASS(md);
+    PCDIMMDevice *dimm = PC_DIMM(md);
+    DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        di->has_id = true;
+        di->id = g_strdup(dev->id);
+    }
+    di->hotplugged = dev->hotplugged;
+    di->hotpluggable = dc->hotpluggable;
+    di->addr = dimm->addr;
+    di->slot = dimm->slot;
+    di->node = dimm->node;
+    di->size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
+                                        NULL);
+    di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
+        info->u.nvdimm.data = di;
+        info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
+    } else {
+        info->u.dimm.data = di;
+        info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+    }
+}
+
 static void pc_dimm_class_init(ObjectClass *oc, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(oc);
     PCDIMMDeviceClass *ddc = PC_DIMM_CLASS(oc);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(oc);
 
     dc->realize = pc_dimm_realize;
     dc->unrealize = pc_dimm_unrealize;
@@ -457,6 +454,12 @@ static void pc_dimm_class_init(ObjectClass *oc, void *data)
 
     ddc->get_memory_region = pc_dimm_get_memory_region;
     ddc->get_vmstate_memory_region = pc_dimm_get_vmstate_memory_region;
+
+    mdc->get_addr = pc_dimm_md_get_addr;
+    /* for a dimm plugged_size == region_size */
+    mdc->get_plugged_size = pc_dimm_md_get_region_size;
+    mdc->get_region_size = pc_dimm_md_get_region_size;
+    mdc->fill_device_info = pc_dimm_md_fill_device_info;
 }
 
 static TypeInfo pc_dimm_info = {
@@ -466,6 +469,10 @@ static TypeInfo pc_dimm_info = {
     .instance_init = pc_dimm_init,
     .class_init    = pc_dimm_class_init,
     .class_size    = sizeof(PCDIMMDeviceClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
 };
 
 static void pc_dimm_register_types(void)
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a81570e7c8..a7428f7da7 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -74,6 +74,7 @@
 #include "hw/compat.h"
 #include "qemu/cutils.h"
 #include "hw/ppc/spapr_cpu_core.h"
+#include "hw/mem/memory-device.h"
 
 #include <libfdt.h>
 
@@ -722,7 +723,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
     }
 
     if (hotplug_lmb_start) {
-        dimms = qmp_pc_dimm_device_list();
+        dimms = qmp_memory_device_list();
     }
 
     /* ibm,dynamic-memory */
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 16bccdd5c0..4cdae3ca3a 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -14,6 +14,7 @@
 #include "kvm_ppc.h"
 #include "hw/ppc/spapr_ovec.h"
 #include "mmu-book3s-v3.h"
+#include "hw/mem/memory-device.h"
 
 struct SPRSyncState {
     int spr;
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
new file mode 100644
index 0000000000..3e498b2e61
--- /dev/null
+++ b/include/hw/mem/memory-device.h
@@ -0,0 +1,44 @@
+/*
+ * Memory Device Interface
+ *
+ * Copyright (c) 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MEMORY_DEVICE_H
+#define MEMORY_DEVICE_H
+
+#include "qom/object.h"
+#include "hw/qdev.h"
+
+#define TYPE_MEMORY_DEVICE "memory-device"
+
+#define MEMORY_DEVICE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(MemoryDeviceClass, (klass), TYPE_MEMORY_DEVICE)
+#define MEMORY_DEVICE_GET_CLASS(obj) \
+    OBJECT_GET_CLASS(MemoryDeviceClass, (obj), TYPE_MEMORY_DEVICE)
+#define MEMORY_DEVICE(obj) \
+     INTERFACE_CHECK(MemoryDeviceState, (obj), TYPE_MEMORY_DEVICE)
+
+typedef struct MemoryDeviceState {
+    Object parent_obj;
+} MemoryDeviceState;
+
+typedef struct MemoryDeviceClass {
+    InterfaceClass parent_class;
+
+    uint64_t (*get_addr)(MemoryDeviceState *md);
+    uint64_t (*get_plugged_size)(MemoryDeviceState *md, Error **errp);
+    uint64_t (*get_region_size)(MemoryDeviceState *md, Error **errp);
+    void (*fill_device_info)(MemoryDeviceState *md, MemoryDeviceInfo *info);
+} MemoryDeviceClass;
+
+MemoryDeviceInfoList *qmp_memory_device_list(void);
+uint64_t get_plugged_memory_size(void);
+
+#endif
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 1fc479281c..e88073321f 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -93,9 +93,7 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
 
 int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
 
-MemoryDeviceInfoList *qmp_pc_dimm_device_list(void);
 uint64_t pc_existing_dimms_capacity(Error **errp);
-uint64_t get_plugged_memory_size(void);
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
                          MemoryRegion *mr, uint64_t align, Error **errp);
 void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
diff --git a/numa.c b/numa.c
index 1116c90af9..6a0eaebc01 100644
--- a/numa.c
+++ b/numa.c
@@ -36,6 +36,7 @@
 #include "hw/boards.h"
 #include "sysemu/hostmem.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/memory-device.h"
 #include "qemu/option.h"
 #include "qemu/config-file.h"
 #include "qemu/cutils.h"
@@ -520,7 +521,7 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner,
 
 static void numa_stat_memory_devices(NumaNodeMem node_mem[])
 {
-    MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+    MemoryDeviceInfoList *info_list = qmp_memory_device_list();
     MemoryDeviceInfoList *info;
     PCDIMMDeviceInfo     *pcdimm_info;
 
diff --git a/qmp.c b/qmp.c
index f72261667f..3de029946a 100644
--- a/qmp.c
+++ b/qmp.c
@@ -39,7 +39,7 @@
 #include "qapi/qobject-input-visitor.h"
 #include "hw/boards.h"
 #include "qom/object_interfaces.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/memory-device.h"
 #include "hw/acpi/acpi_dev_interface.h"
 
 NameInfo *qmp_query_name(Error **errp)
@@ -731,7 +731,7 @@ void qmp_object_del(const char *id, Error **errp)
 
 MemoryDeviceInfoList *qmp_query_memory_devices(Error **errp)
 {
-    return qmp_pc_dimm_device_list();
+    return qmp_memory_device_list();
 }
 
 ACPIOSTInfoList *qmp_query_acpi_ospm_status(Error **errp)
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 2d59d84091..53d3f32cb2 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -34,7 +34,7 @@ stub-obj-y += uuid.o
 stub-obj-y += vm-stop.o
 stub-obj-y += vmstate.o
 stub-obj-$(CONFIG_WIN32) += fd-register.o
-stub-obj-y += qmp_pc_dimm.o
+stub-obj-y += qmp_memory_device.o
 stub-obj-y += target-monitor-defs.o
 stub-obj-y += target-get-monitor-def.o
 stub-obj-y += pc_madt_cpu_entry.o
diff --git a/stubs/qmp_pc_dimm.c b/stubs/qmp_memory_device.c
similarity index 61%
rename from stubs/qmp_pc_dimm.c
rename to stubs/qmp_memory_device.c
index b6b2cca89e..85ff8f2d7e 100644
--- a/stubs/qmp_pc_dimm.c
+++ b/stubs/qmp_memory_device.c
@@ -1,8 +1,8 @@
 #include "qemu/osdep.h"
 #include "qom/object.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/memory-device.h"
 
-MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
+MemoryDeviceInfoList *qmp_memory_device_list(void)
 {
    return NULL;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine
  2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
@ 2018-04-20 12:34 ` David Hildenbrand
  2018-04-23  3:28   ` David Gibson
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-20 12:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Michael S . Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta, David Hildenbrand

Let's allow to query the MemoryHotplugState from the machine.

This allows us to generically detect if a certain machine has support
for memory devices, and to generically manage it (find free address
range, plug/unplug a memory region).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/i386/pc.c             | 12 ++++++++++++
 hw/ppc/spapr.c           | 12 ++++++++++++
 include/hw/boards.h      | 16 ++++++++++++++++
 include/hw/mem/pc-dimm.h | 12 +-----------
 4 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d36bac8c89..fa8862af33 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -2337,6 +2337,17 @@ static void x86_nmi(NMIState *n, int cpu_index, Error **errp)
     }
 }
 
+static MemoryHotplugState *pc_machine_get_memory_hotplug_state(MachineState *ms)
+{
+    PCMachineState *pcms = PC_MACHINE(ms);
+
+    if (!pcms->hotplug_memory.base) {
+        /* hotplug not supported */
+        return NULL;
+    }
+    return &pcms->hotplug_memory;
+}
+
 static void pc_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -2376,6 +2387,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     hc->unplug = pc_machine_device_unplug_cb;
     nc->nmi_monitor_handler = x86_nmi;
     mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
+    mc->get_memory_hotplug_state = pc_machine_get_memory_hotplug_state;
 
     object_class_property_add(oc, PC_MACHINE_MEMHP_REGION_SIZE, "int",
         pc_machine_get_hotplug_memory_region_size, NULL,
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a7428f7da7..7ccdb705b3 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3868,6 +3868,17 @@ PowerPCCPU *spapr_find_cpu(int vcpu_id)
     return NULL;
 }
 
+static MemoryHotplugState *spapr_get_memory_hotplug_state(MachineState *ms)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(ms);
+
+    if (!spapr->hotplug_memory.base) {
+        /* hotplug not supported */
+        return NULL;
+    }
+    return &spapr->hotplug_memory;
+}
+
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -3927,6 +3938,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
      * in which LMBs are represented and hot-added
      */
     mc->numa_mem_align_shift = 28;
+    mc->get_memory_hotplug_state = spapr_get_memory_hotplug_state;
 
     smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
     smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
diff --git a/include/hw/boards.h b/include/hw/boards.h
index a609239112..447bdc7219 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -105,6 +105,17 @@ typedef struct {
     CPUArchId cpus[0];
 } CPUArchIdList;
 
+/**
+ * MemoryHotplugState:
+ * @base: address in guest physical address space where hotplug memory
+ * address space begins.
+ * @mr: hotplug memory address space container
+ */
+typedef struct MemoryHotplugState {
+    hwaddr base;
+    MemoryRegion mr;
+} MemoryHotplugState;
+
 /**
  * MachineClass:
  * @max_cpus: maximum number of CPUs supported. Default: 1
@@ -156,6 +167,10 @@ typedef struct {
  *    should instead use "unimplemented-device" for all memory ranges where
  *    the guest will attempt to probe for a device that QEMU doesn't
  *    implement and a stub device is required.
+ * @get_memory_hotplug_state:
+ *    If a machine support memory hotplug, the returned data ontains
+ *    information about the portion of guest physical address space
+ *    where memory devices can be mapped to (e.g. to hotplug a pc-dimm).
  */
 struct MachineClass {
     /*< private >*/
@@ -212,6 +227,7 @@ struct MachineClass {
                                                          unsigned cpu_index);
     const CPUArchIdList *(*possible_cpu_arch_ids)(MachineState *machine);
     int64_t (*get_default_cpu_node_id)(const MachineState *ms, int idx);
+    MemoryHotplugState *(*get_memory_hotplug_state)(MachineState *ms);
 };
 
 /**
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index e88073321f..8bda37adab 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -19,6 +19,7 @@
 #include "exec/memory.h"
 #include "sysemu/hostmem.h"
 #include "hw/qdev.h"
+#include "hw/boards.h"
 
 #define TYPE_PC_DIMM "pc-dimm"
 #define PC_DIMM(obj) \
@@ -75,17 +76,6 @@ typedef struct PCDIMMDeviceClass {
     MemoryRegion *(*get_vmstate_memory_region)(PCDIMMDevice *dimm);
 } PCDIMMDeviceClass;
 
-/**
- * MemoryHotplugState:
- * @base: address in guest physical address space where hotplug memory
- * address space begins.
- * @mr: hotplug memory address space container
- */
-typedef struct MemoryHotplugState {
-    hwaddr base;
-    MemoryRegion mr;
-} MemoryHotplugState;
-
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
                                uint64_t address_space_size,
                                uint64_t *hint, uint64_t align, uint64_t size,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine David Hildenbrand
@ 2018-04-20 12:34 ` David Hildenbrand
  2018-04-23 12:19   ` Igor Mammedov
  2018-04-22  4:58 ` [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice Pankaj Gupta
  2018-04-23 12:31 ` Igor Mammedov
  4 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-20 12:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Michael S . Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta, David Hildenbrand

To be able to reuse MemoryDevice logic from other devices besides
pc-dimm, factor the relevant stuff out into the MemoryDevice code.

As we don't care about slots for memory devices that are not pc-dimm,
don't factor that part out.

Most of this patch just moves checks and logic around. While at it, make
the code properly detect certain error conditions better (e.g. fragmented
memory).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/i386/pc.c                   |  12 +--
 hw/mem/memory-device.c         | 162 ++++++++++++++++++++++++++++++++++++
 hw/mem/pc-dimm.c               | 185 +++--------------------------------------
 hw/ppc/spapr.c                 |   9 +-
 include/hw/mem/memory-device.h |   4 +
 include/hw/mem/pc-dimm.h       |  14 +---
 6 files changed, 185 insertions(+), 201 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index fa8862af33..1c25546a0c 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1711,7 +1711,7 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
         goto out;
     }
 
-    pc_dimm_memory_plug(dev, &pcms->hotplug_memory, mr, align, &local_err);
+    pc_dimm_memory_plug(dev, align, &local_err);
     if (local_err) {
         goto out;
     }
@@ -1761,17 +1761,9 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
                            DeviceState *dev, Error **errp)
 {
     PCMachineState *pcms = PC_MACHINE(hotplug_dev);
-    PCDIMMDevice *dimm = PC_DIMM(dev);
-    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
-    MemoryRegion *mr;
     HotplugHandlerClass *hhc;
     Error *local_err = NULL;
 
-    mr = ddc->get_memory_region(dimm, &local_err);
-    if (local_err) {
-        goto out;
-    }
-
     hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
     hhc->unplug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &local_err);
 
@@ -1779,7 +1771,7 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
         goto out;
     }
 
-    pc_dimm_memory_unplug(dev, &pcms->hotplug_memory, mr);
+    pc_dimm_memory_unplug(dev);
     object_unparent(OBJECT(dev));
 
  out:
diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index b860c9c582..b96efa3bf4 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -15,6 +15,8 @@
 #include "qapi/error.h"
 #include "hw/boards.h"
 #include "qemu/range.h"
+#include "hw/virtio/vhost.h"
+#include "sysemu/kvm.h"
 
 static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
 {
@@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
     return size;
 }
 
+static int memory_device_used_region_size_internal(Object *obj, void *opaque)
+{
+    uint64_t *size = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
+        DeviceState *dev = DEVICE(obj);
+        MemoryDeviceState *md = MEMORY_DEVICE(obj);
+        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
+
+        if (dev->realized) {
+            *size += mdc->get_region_size(md, &error_abort);
+        }
+    }
+
+    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
+    return 0;
+}
+
+static uint64_t memory_device_used_region_size(void)
+{
+    uint64_t size = 0;
+
+    memory_device_used_region_size_internal(qdev_get_machine(), &size);
+
+    return size;
+}
+
+uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
+                                     uint64_t size, Error **errp)
+{
+    const uint64_t used_region_size = memory_device_used_region_size();
+    uint64_t address_space_start, address_space_end;
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+    MemoryHotplugState *hpms;
+    GSList *list = NULL, *item;
+    uint64_t new_addr = 0;
+
+    if (!mc->get_memory_hotplug_state) {
+        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
+                         "supported by the machine");
+        return 0;
+    }
+
+    hpms = mc->get_memory_hotplug_state(machine);
+    if (!hpms || !memory_region_size(&hpms->mr)) {
+        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
+                         "enabled, please specify the maxmem option");
+        return 0;
+    }
+    address_space_start = hpms->base;
+    address_space_end = address_space_start + memory_region_size(&hpms->mr);
+    g_assert(address_space_end >= address_space_start);
+
+    if (used_region_size + size > machine->maxram_size - machine->ram_size) {
+        error_setg(errp, "not enough space, currently 0x%" PRIx64
+                   " in use of total hot pluggable 0x" RAM_ADDR_FMT,
+                   used_region_size, machine->maxram_size - machine->ram_size);
+        return 0;
+    }
+
+    if (hint && QEMU_ALIGN_UP(*hint, align) != *hint) {
+        error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
+                   align);
+        return 0;
+    }
+
+    if (QEMU_ALIGN_UP(size, align) != size) {
+        error_setg(errp, "backend memory size must be multiple of 0x%"
+                   PRIx64, align);
+        return 0;
+    }
+
+    if (hint) {
+        new_addr = *hint;
+        if (new_addr < address_space_start) {
+            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
+                       "] at 0x%" PRIx64, new_addr, size, address_space_start);
+            return 0;
+        } else if ((new_addr + size) > address_space_end) {
+            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
+                       "] beyond 0x%" PRIx64, new_addr, size,
+                       address_space_end);
+            return 0;
+        }
+    } else {
+        new_addr = address_space_start;
+    }
+
+    /* find address range that will fit new memory device */
+    object_child_foreach(qdev_get_machine(), memory_device_built_list, &list);
+    for (item = list; item; item = g_slist_next(item)) {
+        MemoryDeviceState *md = item->data;
+        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(OBJECT(md));
+        uint64_t md_size, md_addr;
+
+        md_addr = mdc->get_addr(md);
+        md_size = mdc->get_region_size(md, errp);
+        if (*errp) {
+            goto out;
+        }
+
+        if (ranges_overlap(md_addr, md_size, new_addr, size)) {
+            if (hint) {
+                DeviceState *d = DEVICE(md);
+                error_setg(errp, "address range conflicts with '%s'", d->id);
+                goto out;
+            }
+            new_addr = QEMU_ALIGN_UP(md_addr + md_size, align);
+        }
+    }
+
+    if (new_addr + size > address_space_end) {
+        error_setg(errp, "could not find position in guest address space for "
+                   "memory device - memory fragmented due to alignments");
+        goto out;
+    }
+out:
+    g_slist_free(list);
+    return new_addr;
+}
+
+void memory_device_plug_region(MemoryRegion *mr, uint64_t addr, Error **errp)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+    MemoryHotplugState *hpms;
+
+    /* we expect a previous call to memory_device_get_free_addr() */
+    g_assert(mc->get_memory_hotplug_state);
+    hpms = mc->get_memory_hotplug_state(machine);
+    g_assert(hpms);
+
+    /* we will need a new memory slot for kvm and vhost */
+    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
+        error_setg(errp, "hypervisor has no free memory slots left");
+        return;
+    }
+    if (!vhost_has_free_slot()) {
+        error_setg(errp, "a used vhost backend has no free memory slots left");
+        return;
+    }
+
+    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
+}
+
+void memory_device_unplug_region(MemoryRegion *mr)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+    MemoryHotplugState *hpms;
+
+    /* we expect a previous call to memory_device_get_free_addr() */
+    g_assert(mc->get_memory_hotplug_state);
+    hpms = mc->get_memory_hotplug_state(machine);
+    g_assert(hpms);
+
+    memory_region_del_subregion(&hpms->mr, mr);
+}
+
 static const TypeInfo memory_device_info = {
     .name          = TYPE_MEMORY_DEVICE,
     .parent        = TYPE_INTERFACE,
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 1dbf699e02..cf23ab5d76 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -25,19 +25,10 @@
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
-#include "qemu/range.h"
 #include "sysemu/numa.h"
-#include "sysemu/kvm.h"
 #include "trace.h"
-#include "hw/virtio/vhost.h"
 
-typedef struct pc_dimms_capacity {
-     uint64_t size;
-     Error    **errp;
-} pc_dimms_capacity;
-
-void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
-                         MemoryRegion *mr, uint64_t align, Error **errp)
+void pc_dimm_memory_plug(DeviceState *dev, uint64_t align, Error **errp)
 {
     int slot;
     MachineState *machine = MACHINE(qdev_get_machine());
@@ -45,37 +36,26 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
     PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
     MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
     Error *local_err = NULL;
-    uint64_t existing_dimms_capacity = 0;
+    MemoryRegion *mr;
     uint64_t addr;
 
-    addr = object_property_get_uint(OBJECT(dimm),
-                                    PC_DIMM_ADDR_PROP, &local_err);
+    mr = ddc->get_memory_region(dimm, &local_err);
     if (local_err) {
         goto out;
     }
 
-    addr = pc_dimm_get_free_addr(hpms->base,
-                                 memory_region_size(&hpms->mr),
-                                 !addr ? NULL : &addr, align,
-                                 memory_region_size(mr), &local_err);
+    addr = object_property_get_uint(OBJECT(dimm),
+                                    PC_DIMM_ADDR_PROP, &local_err);
     if (local_err) {
         goto out;
     }
 
-    existing_dimms_capacity = pc_existing_dimms_capacity(&local_err);
+    addr = memory_device_get_free_addr(!addr ? NULL : &addr, align,
+                                       memory_region_size(mr), &local_err);
     if (local_err) {
         goto out;
     }
 
-    if (existing_dimms_capacity + memory_region_size(mr) >
-        machine->maxram_size - machine->ram_size) {
-        error_setg(&local_err, "not enough space, currently 0x%" PRIx64
-                   " in use of total hot pluggable 0x" RAM_ADDR_FMT,
-                   existing_dimms_capacity,
-                   machine->maxram_size - machine->ram_size);
-        goto out;
-    }
-
     object_property_set_uint(OBJECT(dev), addr, PC_DIMM_ADDR_PROP, &local_err);
     if (local_err) {
         goto out;
@@ -98,67 +78,27 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
     }
     trace_mhp_pc_dimm_assigned_slot(slot);
 
-    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
-        error_setg(&local_err, "hypervisor has no free memory slots left");
-        goto out;
-    }
-
-    if (!vhost_has_free_slot()) {
-        error_setg(&local_err, "a used vhost backend has no free"
-                               " memory slots left");
+    memory_device_plug_region(mr, addr, &local_err);
+    if (local_err) {
         goto out;
     }
-
-    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
     vmstate_register_ram(vmstate_mr, dev);
 
 out:
     error_propagate(errp, local_err);
 }
 
-void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
-                           MemoryRegion *mr)
+void pc_dimm_memory_unplug(DeviceState *dev)
 {
     PCDIMMDevice *dimm = PC_DIMM(dev);
     PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
     MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
+    MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);
 
-    memory_region_del_subregion(&hpms->mr, mr);
+    memory_device_unplug_region(mr);
     vmstate_unregister_ram(vmstate_mr, dev);
 }
 
-static int pc_existing_dimms_capacity_internal(Object *obj, void *opaque)
-{
-    pc_dimms_capacity *cap = opaque;
-    uint64_t *size = &cap->size;
-
-    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-        DeviceState *dev = DEVICE(obj);
-
-        if (dev->realized) {
-            (*size) += object_property_get_uint(obj, PC_DIMM_SIZE_PROP,
-                cap->errp);
-        }
-
-        if (cap->errp && *cap->errp) {
-            return 1;
-        }
-    }
-    object_child_foreach(obj, pc_existing_dimms_capacity_internal, opaque);
-    return 0;
-}
-
-uint64_t pc_existing_dimms_capacity(Error **errp)
-{
-    pc_dimms_capacity cap;
-
-    cap.size = 0;
-    cap.errp = errp;
-
-    pc_existing_dimms_capacity_internal(qdev_get_machine(), &cap);
-    return cap.size;
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
     unsigned long *bitmap = opaque;
@@ -205,107 +145,6 @@ out:
     return slot;
 }
 
-static gint pc_dimm_addr_sort(gconstpointer a, gconstpointer b)
-{
-    PCDIMMDevice *x = PC_DIMM(a);
-    PCDIMMDevice *y = PC_DIMM(b);
-    Int128 diff = int128_sub(int128_make64(x->addr), int128_make64(y->addr));
-
-    if (int128_lt(diff, int128_zero())) {
-        return -1;
-    } else if (int128_gt(diff, int128_zero())) {
-        return 1;
-    }
-    return 0;
-}
-
-static int pc_dimm_built_list(Object *obj, void *opaque)
-{
-    GSList **list = opaque;
-
-    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-        DeviceState *dev = DEVICE(obj);
-        if (dev->realized) { /* only realized DIMMs matter */
-            *list = g_slist_insert_sorted(*list, dev, pc_dimm_addr_sort);
-        }
-    }
-
-    object_child_foreach(obj, pc_dimm_built_list, opaque);
-    return 0;
-}
-
-uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
-                               uint64_t address_space_size,
-                               uint64_t *hint, uint64_t align, uint64_t size,
-                               Error **errp)
-{
-    GSList *list = NULL, *item;
-    uint64_t new_addr, ret = 0;
-    uint64_t address_space_end = address_space_start + address_space_size;
-
-    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
-
-    if (!address_space_size) {
-        error_setg(errp, "memory hotplug is not enabled, "
-                         "please add maxmem option");
-        goto out;
-    }
-
-    if (hint && QEMU_ALIGN_UP(*hint, align) != *hint) {
-        error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
-                   align);
-        goto out;
-    }
-
-    if (QEMU_ALIGN_UP(size, align) != size) {
-        error_setg(errp, "backend memory size must be multiple of 0x%"
-                   PRIx64, align);
-        goto out;
-    }
-
-    assert(address_space_end > address_space_start);
-    object_child_foreach(qdev_get_machine(), pc_dimm_built_list, &list);
-
-    if (hint) {
-        new_addr = *hint;
-    } else {
-        new_addr = address_space_start;
-    }
-
-    /* find address range that will fit new DIMM */
-    for (item = list; item; item = g_slist_next(item)) {
-        PCDIMMDevice *dimm = item->data;
-        uint64_t dimm_size = object_property_get_uint(OBJECT(dimm),
-                                                      PC_DIMM_SIZE_PROP,
-                                                      errp);
-        if (errp && *errp) {
-            goto out;
-        }
-
-        if (ranges_overlap(dimm->addr, dimm_size, new_addr, size)) {
-            if (hint) {
-                DeviceState *d = DEVICE(dimm);
-                error_setg(errp, "address range conflicts with '%s'", d->id);
-                goto out;
-            }
-            new_addr = QEMU_ALIGN_UP(dimm->addr + dimm_size, align);
-        }
-    }
-    ret = new_addr;
-
-    if (new_addr < address_space_start) {
-        error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
-                   "] at 0x%" PRIx64, new_addr, size, address_space_start);
-    } else if ((new_addr + size) > address_space_end) {
-        error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
-                   "] beyond 0x%" PRIx64, new_addr, size, address_space_end);
-    }
-
-out:
-    g_slist_free(list);
-    return ret;
-}
-
 static Property pc_dimm_properties[] = {
     DEFINE_PROP_UINT64(PC_DIMM_ADDR_PROP, PCDIMMDevice, addr, 0),
     DEFINE_PROP_UINT32(PC_DIMM_NODE_PROP, PCDIMMDevice, node, 0),
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7ccdb705b3..7757a49335 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3041,7 +3041,7 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
     align = memory_region_get_alignment(mr);
     size = memory_region_size(mr);
 
-    pc_dimm_memory_plug(dev, &ms->hotplug_memory, mr, align, &local_err);
+    pc_dimm_memory_plug(dev, align, &local_err);
     if (local_err) {
         goto out;
     }
@@ -3062,7 +3062,7 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
     return;
 
 out_unplug:
-    pc_dimm_memory_unplug(dev, &ms->hotplug_memory, mr);
+    pc_dimm_memory_unplug(dev);
 out:
     error_propagate(errp, local_err);
 }
@@ -3180,9 +3180,6 @@ static sPAPRDIMMState *spapr_recover_pending_dimm_state(sPAPRMachineState *ms,
 void spapr_lmb_release(DeviceState *dev)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_hotplug_handler(dev));
-    PCDIMMDevice *dimm = PC_DIMM(dev);
-    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
-    MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);
     sPAPRDIMMState *ds = spapr_pending_dimm_unplugs_find(spapr, PC_DIMM(dev));
 
     /* This information will get lost if a migration occurs
@@ -3202,7 +3199,7 @@ void spapr_lmb_release(DeviceState *dev)
      * Now that all the LMBs have been removed by the guest, call the
      * pc-dimm unplug handler to cleanup up the pc-dimm device.
      */
-    pc_dimm_memory_unplug(dev, &spapr->hotplug_memory, mr);
+    pc_dimm_memory_unplug(dev);
     object_unparent(OBJECT(dev));
     spapr_pending_dimm_unplugs_remove(spapr, ds);
 }
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
index 3e498b2e61..722620da24 100644
--- a/include/hw/mem/memory-device.h
+++ b/include/hw/mem/memory-device.h
@@ -40,5 +40,9 @@ typedef struct MemoryDeviceClass {
 
 MemoryDeviceInfoList *qmp_memory_device_list(void);
 uint64_t get_plugged_memory_size(void);
+uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
+                                     uint64_t size, Error **errp);
+void memory_device_plug_region(MemoryRegion *mr, uint64_t addr, Error **errp);
+void memory_device_unplug_region(MemoryRegion *mr);
 
 #endif
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 8bda37adab..2e7c2abe35 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -19,7 +19,6 @@
 #include "exec/memory.h"
 #include "sysemu/hostmem.h"
 #include "hw/qdev.h"
-#include "hw/boards.h"
 
 #define TYPE_PC_DIMM "pc-dimm"
 #define PC_DIMM(obj) \
@@ -76,16 +75,7 @@ typedef struct PCDIMMDeviceClass {
     MemoryRegion *(*get_vmstate_memory_region)(PCDIMMDevice *dimm);
 } PCDIMMDeviceClass;
 
-uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
-                               uint64_t address_space_size,
-                               uint64_t *hint, uint64_t align, uint64_t size,
-                               Error **errp);
-
 int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
-
-uint64_t pc_existing_dimms_capacity(Error **errp);
-void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
-                         MemoryRegion *mr, uint64_t align, Error **errp);
-void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
-                           MemoryRegion *mr);
+void pc_dimm_memory_plug(DeviceState *dev, uint64_t align, Error **errp);
+void pc_dimm_memory_unplug(DeviceState *dev);
 #endif
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
@ 2018-04-22  4:26   ` David Gibson
  2018-04-22  8:21     ` David Hildenbrand
  2018-04-22  5:09   ` Pankaj Gupta
  1 sibling, 1 reply; 49+ messages in thread
From: David Gibson @ 2018-04-22  4:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

[-- Attachment #1: Type: text/plain, Size: 2408 bytes --]

On Fri, Apr 20, 2018 at 02:34:54PM +0200, David Hildenbrand wrote:
> On the qmp level, we already have the concept of memory devices:
>     "query-memory-devices"
> Right now, we only support NVDIMM and PCDIMM.
> 
> We want to map other devices later into the address space of the guest.
> Such device could e.g. be virtio devices. These devices will have a
> guest memory range assigned but won't be exposed via e.g. ACPI. We want
> to make them look like memory device, but not glued to pc-dimm.
> 
> Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
> class (e.g. virtio devices). Let's use an interface instead. As a first
> part, convert handling of
> - qmp_pc_dimm_device_list
> - get_plugged_memory_size
> to our new model. plug/unplug stuff etc. will follow later.
> 
> A memory device will have to provide the following functions:
> - get_addr(): Necessary, as the property "addr" can e.g. not be used for
>               virtio devices (already defined).
> - get_plugged_size(): The amount this device offers to the guest as of
>                       now.
> - get_region_size(): Because this can later on be bigger than the
>                      plugged size.
> - fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

with the exception of some tiny nits..

[snip]
> +static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
> +{
> +    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
> +    MemoryDeviceState *md_b = MEMORY_DEVICE(b);

These probably should be const MemoryDeviceState *.

> +    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
> +    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
> +    const uint64_t addr_a = mdc_a->get_addr(md_a);
> +    const uint64_t addr_b = mdc_b->get_addr(md_b);
> +
> +    if (addr_a > addr_b) {
> +        return 1;
> +    } else if (addr_a < addr_b) {
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static int memory_device_built_list(Object *obj, void *opaque)

s/built/build/ will read a bit more clearly I think.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
                   ` (2 preceding siblings ...)
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
@ 2018-04-22  4:58 ` Pankaj Gupta
  2018-04-22  8:20   ` David Hildenbrand
  2018-04-23 12:31 ` Igor Mammedov
  4 siblings, 1 reply; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-22  4:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, Igor Mammedov, David Gibson, Richard Henderson


Hello,

This is good re-factoring and needed for 'virtio-pmem' as well to
reserve memory region in system address space.

I have tested this code with virtio-pmem and its working fine. Thank you
for the work.

I just have a small suggestion : when functions like(get_addr(), get_plugged_size etc) 
in the interface are not provided by derived class, Qemu crashes.

I think having a contract for must override functions with NULL check and error
at the calling sites would be better?

Thanks,
Pankaj  

> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
> future, we might want to do the same for virtio devices - e.g.
> virtio-pmem or virtio-mem. Especially, they should be able to live side
> by side to each other.
> 
> E.g. the virto based memory devices regions will not be exposed via ACPI
> and friends. They will be detected just like other virtio devices and
> indicate the applicable memory region. This makes it possible to also use
> them on architectures without memory device detection support (e.g. s390x).
> 
> Let's factor out the memory device code into a MemoryDevice interface.
> 
> 
> v2 -> v3:
> - "pc-dimm: factor out MemoryDevice interface"
> --> Lookup both classes when comparing (David Gibson)
> 
> v1 -> v2:
> - Fix compile issues on ppc (still untested  )
> 
> 
> David Hildenbrand (3):
>   pc-dimm: factor out MemoryDevice interface
>   machine: make MemoryHotplugState accessible via the machine
>   pc-dimm: factor out address space logic into MemoryDevice code
> 
>  hw/i386/acpi-build.c                         |   3 +-
>  hw/i386/pc.c                                 |  24 ++-
>  hw/mem/Makefile.objs                         |   1 +
>  hw/mem/memory-device.c                       | 282 +++++++++++++++++++++++++
>  hw/mem/pc-dimm.c                             | 304
>  +++++++--------------------
>  hw/ppc/spapr.c                               |  24 ++-
>  hw/ppc/spapr_hcall.c                         |   1 +
>  include/hw/boards.h                          |  16 ++
>  include/hw/mem/memory-device.h               |  48 +++++
>  include/hw/mem/pc-dimm.h                     |  26 +--
>  numa.c                                       |   3 +-
>  qmp.c                                        |   4 +-
>  stubs/Makefile.objs                          |   2 +-
>  stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
>  14 files changed, 465 insertions(+), 277 deletions(-)
>  create mode 100644 hw/mem/memory-device.c
>  create mode 100644 include/hw/mem/memory-device.h
>  rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)
> 
> --
> 2.14.3
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
  2018-04-22  4:26   ` David Gibson
@ 2018-04-22  5:09   ` Pankaj Gupta
  2018-04-22  8:26     ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-22  5:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, Igor Mammedov, David Gibson, Richard Henderson


> 
> On the qmp level, we already have the concept of memory devices:
>     "query-memory-devices"
> Right now, we only support NVDIMM and PCDIMM.
> 
> We want to map other devices later into the address space of the guest.
> Such device could e.g. be virtio devices. These devices will have a
> guest memory range assigned but won't be exposed via e.g. ACPI. We want
> to make them look like memory device, but not glued to pc-dimm.
> 
> Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
> class (e.g. virtio devices). Let's use an interface instead. As a first
> part, convert handling of
> - qmp_pc_dimm_device_list
> - get_plugged_memory_size
> to our new model. plug/unplug stuff etc. will follow later.
> 
> A memory device will have to provide the following functions:
> - get_addr(): Necessary, as the property "addr" can e.g. not be used for
>               virtio devices (already defined).
> - get_plugged_size(): The amount this device offers to the guest as of
>                       now.
> - get_region_size(): Because this can later on be bigger than the
>                      plugged size.
> - fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/i386/acpi-build.c                         |   3 +-
>  hw/mem/Makefile.objs                         |   1 +
>  hw/mem/memory-device.c                       | 120
>  +++++++++++++++++++++++++++
>  hw/mem/pc-dimm.c                             | 119
>  +++++++++++++-------------
>  hw/ppc/spapr.c                               |   3 +-
>  hw/ppc/spapr_hcall.c                         |   1 +
>  include/hw/mem/memory-device.h               |  44 ++++++++++
>  include/hw/mem/pc-dimm.h                     |   2 -
>  numa.c                                       |   3 +-
>  qmp.c                                        |   4 +-
>  stubs/Makefile.objs                          |   2 +-
>  stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
>  12 files changed, 240 insertions(+), 66 deletions(-)
>  create mode 100644 hw/mem/memory-device.c
>  create mode 100644 include/hw/mem/memory-device.h
>  rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 3cf2a1679c..ca3645d57b 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -46,6 +46,7 @@
>  #include "hw/acpi/vmgenid.h"
>  #include "sysemu/tpm_backend.h"
>  #include "hw/timer/mc146818rtc_regs.h"
> +#include "hw/mem/memory-device.h"
>  #include "sysemu/numa.h"
>  
>  /* Supported chipsets: */
> @@ -2253,7 +2254,7 @@ build_tpm2(GArray *table_data, BIOSLinker *linker,
> GArray *tcpalog)
>  static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t
>  base,
>                                             uint64_t len, int default_node)
>  {
> -    MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
> +    MemoryDeviceInfoList *info_list = qmp_memory_device_list();
>      MemoryDeviceInfoList *info;
>      MemoryDeviceInfo *mi;
>      PCDIMMDeviceInfo *di;
> diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
> index f12f8b97a2..10be4df2a2 100644
> --- a/hw/mem/Makefile.objs
> +++ b/hw/mem/Makefile.objs
> @@ -1,2 +1,3 @@
>  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
> +common-obj-$(CONFIG_MEM_HOTPLUG) += memory-device.o
>  common-obj-$(CONFIG_NVDIMM) += nvdimm.o
> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> new file mode 100644
> index 0000000000..b860c9c582
> --- /dev/null
> +++ b/hw/mem/memory-device.c
> @@ -0,0 +1,120 @@
> +/*
> + * Memory Device Interface
> + *
> + * Copyright ProfitBricks GmbH 2012
> + * Copyright (C) 2014 Red Hat Inc
> + * Copyright (c) 2018 Red Hat Inc
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/mem/memory-device.h"
> +#include "hw/qdev.h"
> +#include "qapi/error.h"
> +#include "hw/boards.h"
> +#include "qemu/range.h"
> +
> +static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
> +{
> +    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
> +    MemoryDeviceState *md_b = MEMORY_DEVICE(b);
> +    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
> +    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
> +    const uint64_t addr_a = mdc_a->get_addr(md_a);
> +    const uint64_t addr_b = mdc_b->get_addr(md_b);
> +
> +    if (addr_a > addr_b) {
> +        return 1;
> +    } else if (addr_a < addr_b) {
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static int memory_device_built_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
> +        DeviceState *dev = DEVICE(obj);
> +        if (dev->realized) { /* only realized memory devices matter */
> +            *list = g_slist_insert_sorted(*list, dev,
> memory_device_addr_sort);
> +        }
> +    }
> +
> +    object_child_foreach(obj, memory_device_built_list, opaque);
> +    return 0;
> +}
> +
> +MemoryDeviceInfoList *qmp_memory_device_list(void)
> +{
> +    GSList *devices = NULL, *item;
> +    MemoryDeviceInfoList *list = NULL, *prev = NULL;
> +
> +    object_child_foreach(qdev_get_machine(), memory_device_built_list,
> +                         &devices);
> +
> +    for (item = devices; item; item = g_slist_next(item)) {
> +        MemoryDeviceState *md = MEMORY_DEVICE(item->data);
> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data);
> +        MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
> +        MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
> +
> +        mdc->fill_device_info(md, info);
> +
> +        elem->value = info;
> +        elem->next = NULL;
> +        if (prev) {
> +            prev->next = elem;
> +        } else {
> +            list = elem;
> +        }
> +        prev = elem;
> +    }
> +
> +    g_slist_free(devices);
> +
> +    return list;
> +}
> +
> +static int memory_device_plugged_size(Object *obj, void *opaque)
> +{
> +    uint64_t *size = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
> +        DeviceState *dev = DEVICE(obj);
> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
> +
> +        if (dev->realized) {
> +            *size += mdc->get_plugged_size(md, &error_abort);
> +        }
> +    }
> +
> +    object_child_foreach(obj, memory_device_plugged_size, opaque);
> +    return 0;
> +}
> +
> +uint64_t get_plugged_memory_size(void)
> +{
> +    uint64_t size = 0;
> +
> +    memory_device_plugged_size(qdev_get_machine(), &size);
> +
> +    return size;
> +}
> +
> +static const TypeInfo memory_device_info = {
> +    .name          = TYPE_MEMORY_DEVICE,
> +    .parent        = TYPE_INTERFACE,
> +    .class_size = sizeof(MemoryDeviceClass),
> +};
> +
> +static void memory_device_register_types(void)
> +{
> +    type_register_static(&memory_device_info);
> +}
> +
> +type_init(memory_device_register_types)
> diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> index 51350d9c2d..1dbf699e02 100644
> --- a/hw/mem/pc-dimm.c
> +++ b/hw/mem/pc-dimm.c
> @@ -21,6 +21,7 @@
>  #include "qemu/osdep.h"
>  #include "hw/mem/pc-dimm.h"
>  #include "hw/mem/nvdimm.h"
> +#include "hw/mem/memory-device.h"
>  #include "qapi/error.h"
>  #include "qemu/config-file.h"
>  #include "qapi/visitor.h"
> @@ -158,11 +159,6 @@ uint64_t pc_existing_dimms_capacity(Error **errp)
>      return cap.size;
>  }
>  
> -uint64_t get_plugged_memory_size(void)
> -{
> -    return pc_existing_dimms_capacity(&error_abort);
> -}
> -
>  static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
>  {
>      unsigned long *bitmap = opaque;
> @@ -238,57 +234,6 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
>      return 0;
>  }
>  
> -MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
> -{
> -    GSList *dimms = NULL, *item;
> -    MemoryDeviceInfoList *list = NULL, *prev = NULL;
> -
> -    object_child_foreach(qdev_get_machine(), pc_dimm_built_list, &dimms);
> -
> -    for (item = dimms; item; item = g_slist_next(item)) {
> -        PCDIMMDevice *dimm = PC_DIMM(item->data);
> -        Object *obj = OBJECT(dimm);
> -        MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
> -        MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
> -        PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
> -        bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
> -        DeviceClass *dc = DEVICE_GET_CLASS(obj);
> -        DeviceState *dev = DEVICE(obj);
> -
> -        if (dev->id) {
> -            di->has_id = true;
> -            di->id = g_strdup(dev->id);
> -        }
> -        di->hotplugged = dev->hotplugged;
> -        di->hotpluggable = dc->hotpluggable;
> -        di->addr = dimm->addr;
> -        di->slot = dimm->slot;
> -        di->node = dimm->node;
> -        di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
> -        di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
> -
> -        if (!is_nvdimm) {
> -            info->u.dimm.data = di;
> -            info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
> -        } else {
> -            info->u.nvdimm.data = di;
> -            info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
> -        }
> -        elem->value = info;
> -        elem->next = NULL;
> -        if (prev) {
> -            prev->next = elem;
> -        } else {
> -            list = elem;
> -        }
> -        prev = elem;
> -    }
> -
> -    g_slist_free(dimms);
> -
> -    return list;
> -}
> -
>  uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
>                                 uint64_t address_space_size,
>                                 uint64_t *hint, uint64_t align, uint64_t
>                                 size,
> @@ -445,10 +390,62 @@ static MemoryRegion
> *pc_dimm_get_vmstate_memory_region(PCDIMMDevice *dimm)
>      return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
>  }
>  
> +static uint64_t pc_dimm_md_get_addr(MemoryDeviceState *md)
> +{
> +    PCDIMMDevice *dimm = PC_DIMM(md);
> +
> +    return dimm->addr;
> +}
> +
> +static uint64_t pc_dimm_md_get_region_size(MemoryDeviceState *md, Error
> **errp)
> +{
> +    PCDIMMDevice *dimm = PC_DIMM(md);
> +    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(md);
> +    MemoryRegion *mr;
> +
> +    mr = ddc->get_memory_region(dimm, errp);
> +    if (!mr) {
> +        return 0;
> +    }
> +
> +    return memory_region_size(mr);
> +}
> +
> +static void pc_dimm_md_fill_device_info(MemoryDeviceState *md,
> +                                        MemoryDeviceInfo *info)
> +{
> +    PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
> +    DeviceClass *dc = DEVICE_GET_CLASS(md);
> +    PCDIMMDevice *dimm = PC_DIMM(md);
> +    DeviceState *dev = DEVICE(md);
> +
> +    if (dev->id) {
> +        di->has_id = true;
> +        di->id = g_strdup(dev->id);
> +    }
> +    di->hotplugged = dev->hotplugged;
> +    di->hotpluggable = dc->hotpluggable;
> +    di->addr = dimm->addr;
> +    di->slot = dimm->slot;
> +    di->node = dimm->node;
> +    di->size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
> +                                        NULL);
> +    di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
> +
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
> +        info->u.nvdimm.data = di;
> +        info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
> +    } else {
> +        info->u.dimm.data = di;
> +        info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
> +    }
> +}
> +
>  static void pc_dimm_class_init(ObjectClass *oc, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(oc);
>      PCDIMMDeviceClass *ddc = PC_DIMM_CLASS(oc);
> +    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(oc);
>  
>      dc->realize = pc_dimm_realize;
>      dc->unrealize = pc_dimm_unrealize;
> @@ -457,6 +454,12 @@ static void pc_dimm_class_init(ObjectClass *oc, void
> *data)
>  
>      ddc->get_memory_region = pc_dimm_get_memory_region;
>      ddc->get_vmstate_memory_region = pc_dimm_get_vmstate_memory_region;
> +
> +    mdc->get_addr = pc_dimm_md_get_addr;
> +    /* for a dimm plugged_size == region_size */
> +    mdc->get_plugged_size = pc_dimm_md_get_region_size;
> +    mdc->get_region_size = pc_dimm_md_get_region_size;
> +    mdc->fill_device_info = pc_dimm_md_fill_device_info;
>  }
>  
>  static TypeInfo pc_dimm_info = {
> @@ -466,6 +469,10 @@ static TypeInfo pc_dimm_info = {
>      .instance_init = pc_dimm_init,
>      .class_init    = pc_dimm_class_init,
>      .class_size    = sizeof(PCDIMMDeviceClass),
> +    .interfaces = (InterfaceInfo[]) {
> +        { TYPE_MEMORY_DEVICE },
> +        { }
> +    },
>  };
>  
>  static void pc_dimm_register_types(void)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index a81570e7c8..a7428f7da7 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -74,6 +74,7 @@
>  #include "hw/compat.h"
>  #include "qemu/cutils.h"
>  #include "hw/ppc/spapr_cpu_core.h"
> +#include "hw/mem/memory-device.h"
>  
>  #include <libfdt.h>
>  
> @@ -722,7 +723,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState
> *spapr, void *fdt)
>      }
>  
>      if (hotplug_lmb_start) {
> -        dimms = qmp_pc_dimm_device_list();
> +        dimms = qmp_memory_device_list();
>      }
>  
>      /* ibm,dynamic-memory */
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 16bccdd5c0..4cdae3ca3a 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -14,6 +14,7 @@
>  #include "kvm_ppc.h"
>  #include "hw/ppc/spapr_ovec.h"
>  #include "mmu-book3s-v3.h"
> +#include "hw/mem/memory-device.h"
>  
>  struct SPRSyncState {
>      int spr;
> diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
> new file mode 100644
> index 0000000000..3e498b2e61
> --- /dev/null
> +++ b/include/hw/mem/memory-device.h
> @@ -0,0 +1,44 @@
> +/*
> + * Memory Device Interface
> + *
> + * Copyright (c) 2018 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef MEMORY_DEVICE_H
> +#define MEMORY_DEVICE_H
> +
> +#include "qom/object.h"
> +#include "hw/qdev.h"
> +
> +#define TYPE_MEMORY_DEVICE "memory-device"
> +
> +#define MEMORY_DEVICE_CLASS(klass) \
> +     OBJECT_CLASS_CHECK(MemoryDeviceClass, (klass), TYPE_MEMORY_DEVICE)
> +#define MEMORY_DEVICE_GET_CLASS(obj) \
> +    OBJECT_GET_CLASS(MemoryDeviceClass, (obj), TYPE_MEMORY_DEVICE)
> +#define MEMORY_DEVICE(obj) \
> +     INTERFACE_CHECK(MemoryDeviceState, (obj), TYPE_MEMORY_DEVICE)
> +
> +typedef struct MemoryDeviceState {
> +    Object parent_obj;
> +} MemoryDeviceState;
> +
> +typedef struct MemoryDeviceClass {
> +    InterfaceClass parent_class;
> +
> +    uint64_t (*get_addr)(MemoryDeviceState *md);
> +    uint64_t (*get_plugged_size)(MemoryDeviceState *md, Error **errp);
> +    uint64_t (*get_region_size)(MemoryDeviceState *md, Error **errp);

Just not sure if we need second argument 'Error **errp'? Or all functions
declarations should have this? 

> +    void (*fill_device_info)(MemoryDeviceState *md, MemoryDeviceInfo *info);
> +} MemoryDeviceClass;
> +
> +MemoryDeviceInfoList *qmp_memory_device_list(void);
> +uint64_t get_plugged_memory_size(void);
> +
> +#endif
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index 1fc479281c..e88073321f 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -93,9 +93,7 @@ uint64_t pc_dimm_get_free_addr(uint64_t
> address_space_start,
>  
>  int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
>  
> -MemoryDeviceInfoList *qmp_pc_dimm_device_list(void);
>  uint64_t pc_existing_dimms_capacity(Error **errp);
> -uint64_t get_plugged_memory_size(void);
>  void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>                           MemoryRegion *mr, uint64_t align, Error **errp);
>  void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
> diff --git a/numa.c b/numa.c
> index 1116c90af9..6a0eaebc01 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -36,6 +36,7 @@
>  #include "hw/boards.h"
>  #include "sysemu/hostmem.h"
>  #include "hw/mem/pc-dimm.h"
> +#include "hw/mem/memory-device.h"
>  #include "qemu/option.h"
>  #include "qemu/config-file.h"
>  #include "qemu/cutils.h"
> @@ -520,7 +521,7 @@ void memory_region_allocate_system_memory(MemoryRegion
> *mr, Object *owner,
>  
>  static void numa_stat_memory_devices(NumaNodeMem node_mem[])
>  {
> -    MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
> +    MemoryDeviceInfoList *info_list = qmp_memory_device_list();
>      MemoryDeviceInfoList *info;
>      PCDIMMDeviceInfo     *pcdimm_info;
>  
> diff --git a/qmp.c b/qmp.c
> index f72261667f..3de029946a 100644
> --- a/qmp.c
> +++ b/qmp.c
> @@ -39,7 +39,7 @@
>  #include "qapi/qobject-input-visitor.h"
>  #include "hw/boards.h"
>  #include "qom/object_interfaces.h"
> -#include "hw/mem/pc-dimm.h"
> +#include "hw/mem/memory-device.h"
>  #include "hw/acpi/acpi_dev_interface.h"
>  
>  NameInfo *qmp_query_name(Error **errp)
> @@ -731,7 +731,7 @@ void qmp_object_del(const char *id, Error **errp)
>  
>  MemoryDeviceInfoList *qmp_query_memory_devices(Error **errp)
>  {
> -    return qmp_pc_dimm_device_list();
> +    return qmp_memory_device_list();
>  }
>  
>  ACPIOSTInfoList *qmp_query_acpi_ospm_status(Error **errp)
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 2d59d84091..53d3f32cb2 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -34,7 +34,7 @@ stub-obj-y += uuid.o
>  stub-obj-y += vm-stop.o
>  stub-obj-y += vmstate.o
>  stub-obj-$(CONFIG_WIN32) += fd-register.o
> -stub-obj-y += qmp_pc_dimm.o
> +stub-obj-y += qmp_memory_device.o
>  stub-obj-y += target-monitor-defs.o
>  stub-obj-y += target-get-monitor-def.o
>  stub-obj-y += pc_madt_cpu_entry.o
> diff --git a/stubs/qmp_pc_dimm.c b/stubs/qmp_memory_device.c
> similarity index 61%
> rename from stubs/qmp_pc_dimm.c
> rename to stubs/qmp_memory_device.c
> index b6b2cca89e..85ff8f2d7e 100644
> --- a/stubs/qmp_pc_dimm.c
> +++ b/stubs/qmp_memory_device.c
> @@ -1,8 +1,8 @@
>  #include "qemu/osdep.h"
>  #include "qom/object.h"
> -#include "hw/mem/pc-dimm.h"
> +#include "hw/mem/memory-device.h"
>  
> -MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
> +MemoryDeviceInfoList *qmp_memory_device_list(void)
>  {
>     return NULL;
>  }
> --
> 2.14.3
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-22  4:58 ` [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice Pankaj Gupta
@ 2018-04-22  8:20   ` David Hildenbrand
  2018-04-23  4:58     ` Pankaj Gupta
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-22  8:20 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: qemu-devel, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, Igor Mammedov, David Gibson, Richard Henderson

On 22.04.2018 06:58, Pankaj Gupta wrote:
> 
> Hello,
> 
> This is good re-factoring and needed for 'virtio-pmem' as well to
> reserve memory region in system address space.
> 
> I have tested this code with virtio-pmem and its working fine. Thank you
> for the work.
> 
> I just have a small suggestion : when functions like(get_addr(), get_plugged_size etc) 
> in the interface are not provided by derived class, Qemu crashes.
> 
> I think having a contract for must override functions with NULL check and error
> at the calling sites would be better?

I expect that all of these functions are implemented. It's a contract
for devices that are mapped into address space. We might later have
additional functions that might not be required to be provided and will
be checked for NULL.

So for the current set of functions, I don't think it makes sense to
make them optional.

Thanks!

> 
> Thanks,
> Pankaj  


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-22  4:26   ` David Gibson
@ 2018-04-22  8:21     ` David Hildenbrand
  2018-04-22 10:10       ` David Gibson
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-22  8:21 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

On 22.04.2018 06:26, David Gibson wrote:
> On Fri, Apr 20, 2018 at 02:34:54PM +0200, David Hildenbrand wrote:
>> On the qmp level, we already have the concept of memory devices:
>>     "query-memory-devices"
>> Right now, we only support NVDIMM and PCDIMM.
>>
>> We want to map other devices later into the address space of the guest.
>> Such device could e.g. be virtio devices. These devices will have a
>> guest memory range assigned but won't be exposed via e.g. ACPI. We want
>> to make them look like memory device, but not glued to pc-dimm.
>>
>> Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
>> class (e.g. virtio devices). Let's use an interface instead. As a first
>> part, convert handling of
>> - qmp_pc_dimm_device_list
>> - get_plugged_memory_size
>> to our new model. plug/unplug stuff etc. will follow later.
>>
>> A memory device will have to provide the following functions:
>> - get_addr(): Necessary, as the property "addr" can e.g. not be used for
>>               virtio devices (already defined).
>> - get_plugged_size(): The amount this device offers to the guest as of
>>                       now.
>> - get_region_size(): Because this can later on be bigger than the
>>                      plugged size.
>> - fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> with the exception of some tiny nits..
> 
> [snip]
>> +static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
>> +{
>> +    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
>> +    MemoryDeviceState *md_b = MEMORY_DEVICE(b);
> 
> These probably should be const MemoryDeviceState *.

Yes, that could work (depends on what the functions getting called
internally expect). Will look into it

> 
>> +    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
>> +    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
>> +    const uint64_t addr_a = mdc_a->get_addr(md_a);
>> +    const uint64_t addr_b = mdc_b->get_addr(md_b);
>> +
>> +    if (addr_a > addr_b) {
>> +        return 1;
>> +    } else if (addr_a < addr_b) {
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int memory_device_built_list(Object *obj, void *opaque)
> 
> s/built/build/ will read a bit more clearly I think.
> 

Indeed, thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-22  5:09   ` Pankaj Gupta
@ 2018-04-22  8:26     ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-22  8:26 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: qemu-devel, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, Igor Mammedov, David Gibson, Richard Henderson

>> +typedef struct MemoryDeviceClass {
>> +    InterfaceClass parent_class;
>> +
>> +    uint64_t (*get_addr)(MemoryDeviceState *md);
>> +    uint64_t (*get_plugged_size)(MemoryDeviceState *md, Error **errp);
>> +    uint64_t (*get_region_size)(MemoryDeviceState *md, Error **errp);
> 
> Just not sure if we need second argument 'Error **errp'? Or all functions
> declarations should have this? 

We need them right now due to the existing implementation of PCDIMM. I
don't think we will ever hit such a case (not if anything else is
seriously wrong). If nobody has a problem with it, I'll drop these two
parameters and directly use &error_abort internally (PCDIMM).

Adding Error **errp to functions where we don't expect to be errors
doesn't feel right.

Thanks for pointing this out!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-22  8:21     ` David Hildenbrand
@ 2018-04-22 10:10       ` David Gibson
  2018-04-23  9:52         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: David Gibson @ 2018-04-22 10:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Sun, Apr 22, 2018 at 10:21:34AM +0200, David Hildenbrand wrote:
> On 22.04.2018 06:26, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 02:34:54PM +0200, David Hildenbrand wrote:
> >> On the qmp level, we already have the concept of memory devices:
> >>     "query-memory-devices"
> >> Right now, we only support NVDIMM and PCDIMM.
> >>
> >> We want to map other devices later into the address space of the guest.
> >> Such device could e.g. be virtio devices. These devices will have a
> >> guest memory range assigned but won't be exposed via e.g. ACPI. We want
> >> to make them look like memory device, but not glued to pc-dimm.
> >>
> >> Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
> >> class (e.g. virtio devices). Let's use an interface instead. As a first
> >> part, convert handling of
> >> - qmp_pc_dimm_device_list
> >> - get_plugged_memory_size
> >> to our new model. plug/unplug stuff etc. will follow later.
> >>
> >> A memory device will have to provide the following functions:
> >> - get_addr(): Necessary, as the property "addr" can e.g. not be used for
> >>               virtio devices (already defined).
> >> - get_plugged_size(): The amount this device offers to the guest as of
> >>                       now.
> >> - get_region_size(): Because this can later on be bigger than the
> >>                      plugged size.
> >> - fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > with the exception of some tiny nits..
> > 
> > [snip]
> >> +static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
> >> +{
> >> +    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
> >> +    MemoryDeviceState *md_b = MEMORY_DEVICE(b);
> > 
> > These probably should be const MemoryDeviceState *.
> 
> Yes, that could work (depends on what the functions getting called
> internally expect). Will look into it

Well, the only thing being called here is the ->get_addr() hook, which
you can redefine if necessary.  Discarding the const in the passed in
pointers seems like bad idea, even though it won't explicitly warn due
to the existing MEMORY_DEVICE() cast.

> >> +    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
> >> +    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
> >> +    const uint64_t addr_a = mdc_a->get_addr(md_a);
> >> +    const uint64_t addr_b = mdc_b->get_addr(md_b);
> >> +
> >> +    if (addr_a > addr_b) {
> >> +        return 1;
> >> +    } else if (addr_a < addr_b) {
> >> +        return -1;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static int memory_device_built_list(Object *obj, void *opaque)
> > 
> > s/built/build/ will read a bit more clearly I think.
> > 
> 
> Indeed, thanks!
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine David Hildenbrand
@ 2018-04-23  3:28   ` David Gibson
  2018-04-23  9:36     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: David Gibson @ 2018-04-23  3:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

[-- Attachment #1: Type: text/plain, Size: 5940 bytes --]

On Fri, Apr 20, 2018 at 02:34:55PM +0200, David Hildenbrand wrote:
> Let's allow to query the MemoryHotplugState from the machine.
> 
> This allows us to generically detect if a certain machine has support
> for memory devices, and to generically manage it (find free address
> range, plug/unplug a memory region).
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

So, we're creating a hook where it seems very likely that the only
implementationss will be simply to retrieve the right field from the
machine specific structure.

So.. should we instead just move the hotplug_memory structure to the
based MachineState type?

> ---
>  hw/i386/pc.c             | 12 ++++++++++++
>  hw/ppc/spapr.c           | 12 ++++++++++++
>  include/hw/boards.h      | 16 ++++++++++++++++
>  include/hw/mem/pc-dimm.h | 12 +-----------
>  4 files changed, 41 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index d36bac8c89..fa8862af33 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -2337,6 +2337,17 @@ static void x86_nmi(NMIState *n, int cpu_index, Error **errp)
>      }
>  }
>  
> +static MemoryHotplugState *pc_machine_get_memory_hotplug_state(MachineState *ms)
> +{
> +    PCMachineState *pcms = PC_MACHINE(ms);
> +
> +    if (!pcms->hotplug_memory.base) {
> +        /* hotplug not supported */
> +        return NULL;
> +    }
> +    return &pcms->hotplug_memory;
> +}
> +
>  static void pc_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> @@ -2376,6 +2387,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      hc->unplug = pc_machine_device_unplug_cb;
>      nc->nmi_monitor_handler = x86_nmi;
>      mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
> +    mc->get_memory_hotplug_state = pc_machine_get_memory_hotplug_state;
>  
>      object_class_property_add(oc, PC_MACHINE_MEMHP_REGION_SIZE, "int",
>          pc_machine_get_hotplug_memory_region_size, NULL,
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index a7428f7da7..7ccdb705b3 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3868,6 +3868,17 @@ PowerPCCPU *spapr_find_cpu(int vcpu_id)
>      return NULL;
>  }
>  
> +static MemoryHotplugState *spapr_get_memory_hotplug_state(MachineState *ms)
> +{
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(ms);
> +
> +    if (!spapr->hotplug_memory.base) {
> +        /* hotplug not supported */
> +        return NULL;
> +    }
> +    return &spapr->hotplug_memory;
> +}
> +
>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> @@ -3927,6 +3938,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>       * in which LMBs are represented and hot-added
>       */
>      mc->numa_mem_align_shift = 28;
> +    mc->get_memory_hotplug_state = spapr_get_memory_hotplug_state;
>  
>      smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
>      smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index a609239112..447bdc7219 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -105,6 +105,17 @@ typedef struct {
>      CPUArchId cpus[0];
>  } CPUArchIdList;
>  
> +/**
> + * MemoryHotplugState:
> + * @base: address in guest physical address space where hotplug memory
> + * address space begins.
> + * @mr: hotplug memory address space container
> + */
> +typedef struct MemoryHotplugState {
> +    hwaddr base;
> +    MemoryRegion mr;
> +} MemoryHotplugState;
> +
>  /**
>   * MachineClass:
>   * @max_cpus: maximum number of CPUs supported. Default: 1
> @@ -156,6 +167,10 @@ typedef struct {
>   *    should instead use "unimplemented-device" for all memory ranges where
>   *    the guest will attempt to probe for a device that QEMU doesn't
>   *    implement and a stub device is required.
> + * @get_memory_hotplug_state:
> + *    If a machine support memory hotplug, the returned data ontains
> + *    information about the portion of guest physical address space
> + *    where memory devices can be mapped to (e.g. to hotplug a pc-dimm).
>   */
>  struct MachineClass {
>      /*< private >*/
> @@ -212,6 +227,7 @@ struct MachineClass {
>                                                           unsigned cpu_index);
>      const CPUArchIdList *(*possible_cpu_arch_ids)(MachineState *machine);
>      int64_t (*get_default_cpu_node_id)(const MachineState *ms, int idx);
> +    MemoryHotplugState *(*get_memory_hotplug_state)(MachineState *ms);
>  };
>  
>  /**
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index e88073321f..8bda37adab 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -19,6 +19,7 @@
>  #include "exec/memory.h"
>  #include "sysemu/hostmem.h"
>  #include "hw/qdev.h"
> +#include "hw/boards.h"
>  
>  #define TYPE_PC_DIMM "pc-dimm"
>  #define PC_DIMM(obj) \
> @@ -75,17 +76,6 @@ typedef struct PCDIMMDeviceClass {
>      MemoryRegion *(*get_vmstate_memory_region)(PCDIMMDevice *dimm);
>  } PCDIMMDeviceClass;
>  
> -/**
> - * MemoryHotplugState:
> - * @base: address in guest physical address space where hotplug memory
> - * address space begins.
> - * @mr: hotplug memory address space container
> - */
> -typedef struct MemoryHotplugState {
> -    hwaddr base;
> -    MemoryRegion mr;
> -} MemoryHotplugState;
> -
>  uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
>                                 uint64_t address_space_size,
>                                 uint64_t *hint, uint64_t align, uint64_t size,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-22  8:20   ` David Hildenbrand
@ 2018-04-23  4:58     ` Pankaj Gupta
  0 siblings, 0 replies; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-23  4:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, Igor Mammedov, David Gibson, Richard Henderson


> > 
> > Hello,
> > 
> > This is good re-factoring and needed for 'virtio-pmem' as well to
> > reserve memory region in system address space.
> > 
> > I have tested this code with virtio-pmem and its working fine. Thank you
> > for the work.
> > 
> > I just have a small suggestion : when functions like(get_addr(),
> > get_plugged_size etc)
> > in the interface are not provided by derived class, Qemu crashes.
> > 
> > I think having a contract for must override functions with NULL check and
> > error
> > at the calling sites would be better?
> 
> I expect that all of these functions are implemented. It's a contract
> for devices that are mapped into address space. We might later have
> additional functions that might not be required to be provided and will
> be checked for NULL.
> 
> So for the current set of functions, I don't think it makes sense to
> make them optional.

o.k. that's reasonable.

Thanks,
Pankaj
> 
> Thanks!
> 
> > 
> > Thanks,
> > Pankaj
> 
> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine
  2018-04-23  3:28   ` David Gibson
@ 2018-04-23  9:36     ` David Hildenbrand
  2018-04-23 10:44       ` David Gibson
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23  9:36 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

On 23.04.2018 05:28, David Gibson wrote:
> On Fri, Apr 20, 2018 at 02:34:55PM +0200, David Hildenbrand wrote:
>> Let's allow to query the MemoryHotplugState from the machine.
>>
>> This allows us to generically detect if a certain machine has support
>> for memory devices, and to generically manage it (find free address
>> range, plug/unplug a memory region).
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> So, we're creating a hook where it seems very likely that the only
> implementationss will be simply to retrieve the right field from the
> machine specific structure.
> 
> So.. should we instead just move the hotplug_memory structure to the
> based MachineState type?

It allows us in patch nr. 3 to report different error messages.

"Not supported" vs. "Not enabled (maxmem)".

We could also handle that via a simple boolean flag inside of the
struct. What do you think?


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface
  2018-04-22 10:10       ` David Gibson
@ 2018-04-23  9:52         ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23  9:52 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

On 22.04.2018 12:10, David Gibson wrote:
> On Sun, Apr 22, 2018 at 10:21:34AM +0200, David Hildenbrand wrote:
>> On 22.04.2018 06:26, David Gibson wrote:
>>> On Fri, Apr 20, 2018 at 02:34:54PM +0200, David Hildenbrand wrote:
>>>> On the qmp level, we already have the concept of memory devices:
>>>>     "query-memory-devices"
>>>> Right now, we only support NVDIMM and PCDIMM.
>>>>
>>>> We want to map other devices later into the address space of the guest.
>>>> Such device could e.g. be virtio devices. These devices will have a
>>>> guest memory range assigned but won't be exposed via e.g. ACPI. We want
>>>> to make them look like memory device, but not glued to pc-dimm.
>>>>
>>>> Especially, it will not always be possible to have TYPE_PC_DIMM as a parent
>>>> class (e.g. virtio devices). Let's use an interface instead. As a first
>>>> part, convert handling of
>>>> - qmp_pc_dimm_device_list
>>>> - get_plugged_memory_size
>>>> to our new model. plug/unplug stuff etc. will follow later.
>>>>
>>>> A memory device will have to provide the following functions:
>>>> - get_addr(): Necessary, as the property "addr" can e.g. not be used for
>>>>               virtio devices (already defined).
>>>> - get_plugged_size(): The amount this device offers to the guest as of
>>>>                       now.
>>>> - get_region_size(): Because this can later on be bigger than the
>>>>                      plugged size.
>>>> - fill_device_info(): Fill MemoryDeviceInfo, e.g. for qmp.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>
>>> with the exception of some tiny nits..
>>>
>>> [snip]
>>>> +static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
>>>> +{
>>>> +    MemoryDeviceState *md_a = MEMORY_DEVICE(a);
>>>> +    MemoryDeviceState *md_b = MEMORY_DEVICE(b);
>>>
>>> These probably should be const MemoryDeviceState *.
>>
>> Yes, that could work (depends on what the functions getting called
>> internally expect). Will look into it
> 
> Well, the only thing being called here is the ->get_addr() hook, which
> you can redefine if necessary.  Discarding the const in the passed in
> pointers seems like bad idea, even though it won't explicitly warn due
> to the existing MEMORY_DEVICE() cast.

I converted all MemoryDevice functions to eat "const MemoryDeviceState *md".

The only problematic part now is

static uint64_t pc_dimm_md_get_region_size(const MemoryDeviceState *md)

It calls "ddc->get_memory_region(dimm, &error_abort);", and it looks
like we cannot convert this to eat a const DIMM pointer (as it has to
return a !const pointer to the memory region).

So we will actually drop the const here - which is fine because we don't
modify anything.

> 
>>>> +    MemoryDeviceClass *mdc_a = MEMORY_DEVICE_GET_CLASS(a);
>>>> +    MemoryDeviceClass *mdc_b = MEMORY_DEVICE_GET_CLASS(b);
>>>> +    const uint64_t addr_a = mdc_a->get_addr(md_a);
>>>> +    const uint64_t addr_b = mdc_b->get_addr(md_b);
>>>> +
>>>> +    if (addr_a > addr_b) {
>>>> +        return 1;
>>>> +    } else if (addr_a < addr_b) {
>>>> +        return -1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int memory_device_built_list(Object *obj, void *opaque)
>>>
>>> s/built/build/ will read a bit more clearly I think.
>>>
>>
>> Indeed, thanks!
>>
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine
  2018-04-23  9:36     ` David Hildenbrand
@ 2018-04-23 10:44       ` David Gibson
  2018-04-23 11:11         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: David Gibson @ 2018-04-23 10:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Mon, Apr 23, 2018 at 11:36:48AM +0200, David Hildenbrand wrote:
> On 23.04.2018 05:28, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 02:34:55PM +0200, David Hildenbrand wrote:
> >> Let's allow to query the MemoryHotplugState from the machine.
> >>
> >> This allows us to generically detect if a certain machine has support
> >> for memory devices, and to generically manage it (find free address
> >> range, plug/unplug a memory region).
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > So, we're creating a hook where it seems very likely that the only
> > implementationss will be simply to retrieve the right field from the
> > machine specific structure.
> > 
> > So.. should we instead just move the hotplug_memory structure to the
> > based MachineState type?
> 
> It allows us in patch nr. 3 to report different error messages.
> 
> "Not supported" vs. "Not enabled (maxmem)".
> 
> We could also handle that via a simple boolean flag inside of the
> struct. What do you think?

A third option would be to make it a pointer, rather than directly
embedded in the MachineState.  That would also avoid the extra
allocation for machines that don't use it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine
  2018-04-23 10:44       ` David Gibson
@ 2018-04-23 11:11         ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 11:11 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Igor Mammedov,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Markus Armbruster, qemu-ppc, Pankaj Gupta

On 23.04.2018 12:44, David Gibson wrote:
> On Mon, Apr 23, 2018 at 11:36:48AM +0200, David Hildenbrand wrote:
>> On 23.04.2018 05:28, David Gibson wrote:
>>> On Fri, Apr 20, 2018 at 02:34:55PM +0200, David Hildenbrand wrote:
>>>> Let's allow to query the MemoryHotplugState from the machine.
>>>>
>>>> This allows us to generically detect if a certain machine has support
>>>> for memory devices, and to generically manage it (find free address
>>>> range, plug/unplug a memory region).
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> So, we're creating a hook where it seems very likely that the only
>>> implementationss will be simply to retrieve the right field from the
>>> machine specific structure.
>>>
>>> So.. should we instead just move the hotplug_memory structure to the
>>> based MachineState type?
>>
>> It allows us in patch nr. 3 to report different error messages.
>>
>> "Not supported" vs. "Not enabled (maxmem)".
>>
>> We could also handle that via a simple boolean flag inside of the
>> struct. What do you think?
> 
> A third option would be to make it a pointer, rather than directly
> embedded in the MachineState.  That would also avoid the extra
> allocation for machines that don't use it.
> 

That sounds like a good alternative, will look into it! Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
@ 2018-04-23 12:19   ` Igor Mammedov
  2018-04-23 12:44     ` David Hildenbrand
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-04-23 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta

On Fri, 20 Apr 2018 14:34:56 +0200
David Hildenbrand <david@redhat.com> wrote:

> To be able to reuse MemoryDevice logic from other devices besides
> pc-dimm, factor the relevant stuff out into the MemoryDevice code.
> 
> As we don't care about slots for memory devices that are not pc-dimm,
> don't factor that part out.
that's not really true, you still consume kvm and vhost slots (whatever it is)
whenever you map it into address space as ram memory region.

Also ram_slots currently are (ab)used as flag that user enabled memory
hotplug via CLI.
 
> Most of this patch just moves checks and logic around. While at it, make
> the code properly detect certain error conditions better (e.g. fragmented
> memory).
I'd suggest splitting patch in several smaller ones if possible,
especially parts that do anything more than just moving code around.


> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/i386/pc.c                   |  12 +--
>  hw/mem/memory-device.c         | 162 ++++++++++++++++++++++++++++++++++++
>  hw/mem/pc-dimm.c               | 185 +++--------------------------------------
>  hw/ppc/spapr.c                 |   9 +-
>  include/hw/mem/memory-device.h |   4 +
>  include/hw/mem/pc-dimm.h       |  14 +---
>  6 files changed, 185 insertions(+), 201 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index fa8862af33..1c25546a0c 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1711,7 +1711,7 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>          goto out;
>      }
>  
> -    pc_dimm_memory_plug(dev, &pcms->hotplug_memory, mr, align, &local_err);
> +    pc_dimm_memory_plug(dev, align, &local_err);
Is there a reason why you are dropping pcms->hotplug_memory argument
and fall back to qdev_get_machine()?

I'd rather see it going other direction,
i.e. move hotplug_memory from PC
machine to MachineState and then pass it down as argument whenever it's needed.

>      if (local_err) {
>          goto out;
>      }
> @@ -1761,17 +1761,9 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
>                             DeviceState *dev, Error **errp)
>  {
>      PCMachineState *pcms = PC_MACHINE(hotplug_dev);
> -    PCDIMMDevice *dimm = PC_DIMM(dev);
> -    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> -    MemoryRegion *mr;
>      HotplugHandlerClass *hhc;
>      Error *local_err = NULL;
>  
> -    mr = ddc->get_memory_region(dimm, &local_err);
> -    if (local_err) {
> -        goto out;
> -    }
> -
>      hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
>      hhc->unplug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &local_err);
>  
> @@ -1779,7 +1771,7 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
>          goto out;
>      }
>  
> -    pc_dimm_memory_unplug(dev, &pcms->hotplug_memory, mr);
> +    pc_dimm_memory_unplug(dev);
ditto

>      object_unparent(OBJECT(dev));
>  
>   out:
> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> index b860c9c582..b96efa3bf4 100644
> --- a/hw/mem/memory-device.c
> +++ b/hw/mem/memory-device.c
> @@ -15,6 +15,8 @@
>  #include "qapi/error.h"
>  #include "hw/boards.h"
>  #include "qemu/range.h"
> +#include "hw/virtio/vhost.h"
> +#include "sysemu/kvm.h"
>  
>  static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
>  {
> @@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
>      return size;
>  }
>  
> +static int memory_device_used_region_size_internal(Object *obj, void *opaque)
> +{
> +    uint64_t *size = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
> +        DeviceState *dev = DEVICE(obj);
> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
> +
> +        if (dev->realized) {
> +            *size += mdc->get_region_size(md, &error_abort);
> +        }
> +    }
> +
> +    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
> +    return 0;
> +}
> +
> +static uint64_t memory_device_used_region_size(void)
> +{
> +    uint64_t size = 0;
> +
> +    memory_device_used_region_size_internal(qdev_get_machine(), &size);
> +
> +    return size;
> +}
> +
> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
> +                                     uint64_t size, Error **errp)
I'd suggest to pc_dimm_memory_plug/pc_dimm_get_free_addr first,
namely most of the stuff it does like checks and assigning default
values should go to pre_plug (pre realize) handler and then only
actual mapping is left for plug (after realize) handler to deal with:

    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);               
    vmstate_register_ram(vmstate_mr, dev); 

> +{
> +    const uint64_t used_region_size = memory_device_used_region_size();
> +    uint64_t address_space_start, address_space_end;
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    MemoryHotplugState *hpms;
> +    GSList *list = NULL, *item;
> +    uint64_t new_addr = 0;
> +
> +    if (!mc->get_memory_hotplug_state) {
> +        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
> +                         "supported by the machine");
> +        return 0;
> +    }
> +
> +    hpms = mc->get_memory_hotplug_state(machine);
> +    if (!hpms || !memory_region_size(&hpms->mr)) {
> +        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
> +                         "enabled, please specify the maxmem option");
> +        return 0;
> +    }
above 2 checks are repeated multiple times separate helper to do it
would be better.

PS:
there is no need to check for it every time device is plugged,
doing this check once (at machine_init time) is sufficient.

> +    address_space_start = hpms->base;
> +    address_space_end = address_space_start + memory_region_size(&hpms->mr);
> +    g_assert(address_space_end >= address_space_start);
> +
> +    if (used_region_size + size > machine->maxram_size - machine->ram_size) {
> +        error_setg(errp, "not enough space, currently 0x%" PRIx64
> +                   " in use of total hot pluggable 0x" RAM_ADDR_FMT,
> +                   used_region_size, machine->maxram_size - machine->ram_size);
> +        return 0;
> +    }
> +
> +    if (hint && QEMU_ALIGN_UP(*hint, align) != *hint) {
> +        error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
> +                   align);
> +        return 0;
> +    }
> +
> +    if (QEMU_ALIGN_UP(size, align) != size) {
> +        error_setg(errp, "backend memory size must be multiple of 0x%"
> +                   PRIx64, align);
> +        return 0;
> +    }
> +
> +    if (hint) {
> +        new_addr = *hint;
> +        if (new_addr < address_space_start) {
> +            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
> +                       "] at 0x%" PRIx64, new_addr, size, address_space_start);
> +            return 0;
> +        } else if ((new_addr + size) > address_space_end) {
> +            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
> +                       "] beyond 0x%" PRIx64, new_addr, size,
> +                       address_space_end);
> +            return 0;
> +        }
> +    } else {
> +        new_addr = address_space_start;
> +    }
> +
> +    /* find address range that will fit new memory device */
> +    object_child_foreach(qdev_get_machine(), memory_device_built_list, &list);
> +    for (item = list; item; item = g_slist_next(item)) {
> +        MemoryDeviceState *md = item->data;
> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(OBJECT(md));
> +        uint64_t md_size, md_addr;
> +
> +        md_addr = mdc->get_addr(md);
> +        md_size = mdc->get_region_size(md, errp);
> +        if (*errp) {
> +            goto out;
> +        }
> +
> +        if (ranges_overlap(md_addr, md_size, new_addr, size)) {
> +            if (hint) {
> +                DeviceState *d = DEVICE(md);
> +                error_setg(errp, "address range conflicts with '%s'", d->id);
> +                goto out;
> +            }
> +            new_addr = QEMU_ALIGN_UP(md_addr + md_size, align);
> +        }
> +    }
> +
> +    if (new_addr + size > address_space_end) {
> +        error_setg(errp, "could not find position in guest address space for "
> +                   "memory device - memory fragmented due to alignments");
> +        goto out;
> +    }
> +out:
> +    g_slist_free(list);
> +    return new_addr;
> +}
> +
> +void memory_device_plug_region(MemoryRegion *mr, uint64_t addr, Error **errp)
> +{
[...]

> +    /* we will need a new memory slot for kvm and vhost */
> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> +        error_setg(errp, "hypervisor has no free memory slots left");
> +        return;
> +    }
> +    if (!vhost_has_free_slot()) {
> +        error_setg(errp, "a used vhost backend has no free memory slots left");
> +        return;
> +    }
move these checks to pre_plug time

> +
> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
missing vmstate registration?

> +}
> +
> +void memory_device_unplug_region(MemoryRegion *mr)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    MemoryHotplugState *hpms;
> +
> +    /* we expect a previous call to memory_device_get_free_addr() */
> +    g_assert(mc->get_memory_hotplug_state);
> +    hpms = mc->get_memory_hotplug_state(machine);
> +    g_assert(hpms);
> +
> +    memory_region_del_subregion(&hpms->mr, mr);
> +}
> +
>  static const TypeInfo memory_device_info = {
>      .name          = TYPE_MEMORY_DEVICE,
>      .parent        = TYPE_INTERFACE,
> diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> index 1dbf699e02..cf23ab5d76 100644
> --- a/hw/mem/pc-dimm.c
> +++ b/hw/mem/pc-dimm.c
> @@ -25,19 +25,10 @@
>  #include "qapi/error.h"
>  #include "qemu/config-file.h"
>  #include "qapi/visitor.h"
> -#include "qemu/range.h"
>  #include "sysemu/numa.h"
> -#include "sysemu/kvm.h"
>  #include "trace.h"
> -#include "hw/virtio/vhost.h"
>  
> -typedef struct pc_dimms_capacity {
> -     uint64_t size;
> -     Error    **errp;
> -} pc_dimms_capacity;
> -
> -void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
> -                         MemoryRegion *mr, uint64_t align, Error **errp)
> +void pc_dimm_memory_plug(DeviceState *dev, uint64_t align, Error **errp)
>  {
>      int slot;
>      MachineState *machine = MACHINE(qdev_get_machine());
> @@ -45,37 +36,26 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
>      MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
>      Error *local_err = NULL;
> -    uint64_t existing_dimms_capacity = 0;
> +    MemoryRegion *mr;
>      uint64_t addr;
>  
> -    addr = object_property_get_uint(OBJECT(dimm),
> -                                    PC_DIMM_ADDR_PROP, &local_err);
> +    mr = ddc->get_memory_region(dimm, &local_err);
>      if (local_err) {
>          goto out;
>      }
>  
> -    addr = pc_dimm_get_free_addr(hpms->base,
> -                                 memory_region_size(&hpms->mr),
> -                                 !addr ? NULL : &addr, align,
> -                                 memory_region_size(mr), &local_err);
> +    addr = object_property_get_uint(OBJECT(dimm),
> +                                    PC_DIMM_ADDR_PROP, &local_err);
>      if (local_err) {
>          goto out;
>      }
>  
> -    existing_dimms_capacity = pc_existing_dimms_capacity(&local_err);
> +    addr = memory_device_get_free_addr(!addr ? NULL : &addr, align,
> +                                       memory_region_size(mr), &local_err);
>      if (local_err) {
>          goto out;
>      }
>  
> -    if (existing_dimms_capacity + memory_region_size(mr) >
> -        machine->maxram_size - machine->ram_size) {
> -        error_setg(&local_err, "not enough space, currently 0x%" PRIx64
> -                   " in use of total hot pluggable 0x" RAM_ADDR_FMT,
> -                   existing_dimms_capacity,
> -                   machine->maxram_size - machine->ram_size);
> -        goto out;
> -    }
> -
>      object_property_set_uint(OBJECT(dev), addr, PC_DIMM_ADDR_PROP, &local_err);
>      if (local_err) {
>          goto out;
> @@ -98,67 +78,27 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      }
>      trace_mhp_pc_dimm_assigned_slot(slot);
>  
> -    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> -        error_setg(&local_err, "hypervisor has no free memory slots left");
> -        goto out;
> -    }
> -
> -    if (!vhost_has_free_slot()) {
> -        error_setg(&local_err, "a used vhost backend has no free"
> -                               " memory slots left");
> +    memory_device_plug_region(mr, addr, &local_err);
> +    if (local_err) {
>          goto out;
>      }
> -
> -    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
>      vmstate_register_ram(vmstate_mr, dev);
>  
>  out:
>      error_propagate(errp, local_err);
>  }
>  
> -void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
> -                           MemoryRegion *mr)
> +void pc_dimm_memory_unplug(DeviceState *dev)
>  {
>      PCDIMMDevice *dimm = PC_DIMM(dev);
>      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
>      MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
> +    MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);
>  
> -    memory_region_del_subregion(&hpms->mr, mr);
> +    memory_device_unplug_region(mr);
>      vmstate_unregister_ram(vmstate_mr, dev);
>  }
>  
> -static int pc_existing_dimms_capacity_internal(Object *obj, void *opaque)
> -{
> -    pc_dimms_capacity *cap = opaque;
> -    uint64_t *size = &cap->size;
> -
> -    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> -        DeviceState *dev = DEVICE(obj);
> -
> -        if (dev->realized) {
> -            (*size) += object_property_get_uint(obj, PC_DIMM_SIZE_PROP,
> -                cap->errp);
> -        }
> -
> -        if (cap->errp && *cap->errp) {
> -            return 1;
> -        }
> -    }
> -    object_child_foreach(obj, pc_existing_dimms_capacity_internal, opaque);
> -    return 0;
> -}
> -
> -uint64_t pc_existing_dimms_capacity(Error **errp)
> -{
> -    pc_dimms_capacity cap;
> -
> -    cap.size = 0;
> -    cap.errp = errp;
> -
> -    pc_existing_dimms_capacity_internal(qdev_get_machine(), &cap);
> -    return cap.size;
> -}
> -
>  static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
>  {
>      unsigned long *bitmap = opaque;
> @@ -205,107 +145,6 @@ out:
>      return slot;
>  }
>  
> -static gint pc_dimm_addr_sort(gconstpointer a, gconstpointer b)
> -{
> -    PCDIMMDevice *x = PC_DIMM(a);
> -    PCDIMMDevice *y = PC_DIMM(b);
> -    Int128 diff = int128_sub(int128_make64(x->addr), int128_make64(y->addr));
> -
> -    if (int128_lt(diff, int128_zero())) {
> -        return -1;
> -    } else if (int128_gt(diff, int128_zero())) {
> -        return 1;
> -    }
> -    return 0;
> -}
> -
> -static int pc_dimm_built_list(Object *obj, void *opaque)
> -{
> -    GSList **list = opaque;
> -
> -    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> -        DeviceState *dev = DEVICE(obj);
> -        if (dev->realized) { /* only realized DIMMs matter */
> -            *list = g_slist_insert_sorted(*list, dev, pc_dimm_addr_sort);
> -        }
> -    }
> -
> -    object_child_foreach(obj, pc_dimm_built_list, opaque);
> -    return 0;
> -}
> -
> -uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
> -                               uint64_t address_space_size,
> -                               uint64_t *hint, uint64_t align, uint64_t size,
> -                               Error **errp)
> -{
> -    GSList *list = NULL, *item;
> -    uint64_t new_addr, ret = 0;
> -    uint64_t address_space_end = address_space_start + address_space_size;
> -
> -    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
> -
> -    if (!address_space_size) {
> -        error_setg(errp, "memory hotplug is not enabled, "
> -                         "please add maxmem option");
> -        goto out;
> -    }
> -
> -    if (hint && QEMU_ALIGN_UP(*hint, align) != *hint) {
> -        error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
> -                   align);
> -        goto out;
> -    }
> -
> -    if (QEMU_ALIGN_UP(size, align) != size) {
> -        error_setg(errp, "backend memory size must be multiple of 0x%"
> -                   PRIx64, align);
> -        goto out;
> -    }
> -
> -    assert(address_space_end > address_space_start);
> -    object_child_foreach(qdev_get_machine(), pc_dimm_built_list, &list);
> -
> -    if (hint) {
> -        new_addr = *hint;
> -    } else {
> -        new_addr = address_space_start;
> -    }
> -
> -    /* find address range that will fit new DIMM */
> -    for (item = list; item; item = g_slist_next(item)) {
> -        PCDIMMDevice *dimm = item->data;
> -        uint64_t dimm_size = object_property_get_uint(OBJECT(dimm),
> -                                                      PC_DIMM_SIZE_PROP,
> -                                                      errp);
> -        if (errp && *errp) {
> -            goto out;
> -        }
> -
> -        if (ranges_overlap(dimm->addr, dimm_size, new_addr, size)) {
> -            if (hint) {
> -                DeviceState *d = DEVICE(dimm);
> -                error_setg(errp, "address range conflicts with '%s'", d->id);
> -                goto out;
> -            }
> -            new_addr = QEMU_ALIGN_UP(dimm->addr + dimm_size, align);
> -        }
> -    }
> -    ret = new_addr;
> -
> -    if (new_addr < address_space_start) {
> -        error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
> -                   "] at 0x%" PRIx64, new_addr, size, address_space_start);
> -    } else if ((new_addr + size) > address_space_end) {
> -        error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
> -                   "] beyond 0x%" PRIx64, new_addr, size, address_space_end);
> -    }
> -
> -out:
> -    g_slist_free(list);
> -    return ret;
> -}
> -
>  static Property pc_dimm_properties[] = {
>      DEFINE_PROP_UINT64(PC_DIMM_ADDR_PROP, PCDIMMDevice, addr, 0),
>      DEFINE_PROP_UINT32(PC_DIMM_NODE_PROP, PCDIMMDevice, node, 0),
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 7ccdb705b3..7757a49335 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3041,7 +3041,7 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>      align = memory_region_get_alignment(mr);
>      size = memory_region_size(mr);
>  
> -    pc_dimm_memory_plug(dev, &ms->hotplug_memory, mr, align, &local_err);
> +    pc_dimm_memory_plug(dev, align, &local_err);
>      if (local_err) {
>          goto out;
>      }
> @@ -3062,7 +3062,7 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>      return;
>  
>  out_unplug:
> -    pc_dimm_memory_unplug(dev, &ms->hotplug_memory, mr);
> +    pc_dimm_memory_unplug(dev);
>  out:
>      error_propagate(errp, local_err);
>  }
> @@ -3180,9 +3180,6 @@ static sPAPRDIMMState *spapr_recover_pending_dimm_state(sPAPRMachineState *ms,
>  void spapr_lmb_release(DeviceState *dev)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_hotplug_handler(dev));
> -    PCDIMMDevice *dimm = PC_DIMM(dev);
> -    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> -    MemoryRegion *mr = ddc->get_memory_region(dimm, &error_abort);
>      sPAPRDIMMState *ds = spapr_pending_dimm_unplugs_find(spapr, PC_DIMM(dev));
>  
>      /* This information will get lost if a migration occurs
> @@ -3202,7 +3199,7 @@ void spapr_lmb_release(DeviceState *dev)
>       * Now that all the LMBs have been removed by the guest, call the
>       * pc-dimm unplug handler to cleanup up the pc-dimm device.
>       */
> -    pc_dimm_memory_unplug(dev, &spapr->hotplug_memory, mr);
> +    pc_dimm_memory_unplug(dev);
>      object_unparent(OBJECT(dev));
>      spapr_pending_dimm_unplugs_remove(spapr, ds);
>  }
> diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
> index 3e498b2e61..722620da24 100644
> --- a/include/hw/mem/memory-device.h
> +++ b/include/hw/mem/memory-device.h
> @@ -40,5 +40,9 @@ typedef struct MemoryDeviceClass {
>  
>  MemoryDeviceInfoList *qmp_memory_device_list(void);
>  uint64_t get_plugged_memory_size(void);
> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
> +                                     uint64_t size, Error **errp);
> +void memory_device_plug_region(MemoryRegion *mr, uint64_t addr, Error **errp);
> +void memory_device_unplug_region(MemoryRegion *mr);
>  
>  #endif
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index 8bda37adab..2e7c2abe35 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -19,7 +19,6 @@
>  #include "exec/memory.h"
>  #include "sysemu/hostmem.h"
>  #include "hw/qdev.h"
> -#include "hw/boards.h"
>  
>  #define TYPE_PC_DIMM "pc-dimm"
>  #define PC_DIMM(obj) \
> @@ -76,16 +75,7 @@ typedef struct PCDIMMDeviceClass {
>      MemoryRegion *(*get_vmstate_memory_region)(PCDIMMDevice *dimm);
>  } PCDIMMDeviceClass;
>  
> -uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
> -                               uint64_t address_space_size,
> -                               uint64_t *hint, uint64_t align, uint64_t size,
> -                               Error **errp);
> -
>  int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
> -
> -uint64_t pc_existing_dimms_capacity(Error **errp);
> -void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
> -                         MemoryRegion *mr, uint64_t align, Error **errp);
> -void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
> -                           MemoryRegion *mr);
> +void pc_dimm_memory_plug(DeviceState *dev, uint64_t align, Error **errp);
> +void pc_dimm_memory_unplug(DeviceState *dev);
>  #endif

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
                   ` (3 preceding siblings ...)
  2018-04-22  4:58 ` [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice Pankaj Gupta
@ 2018-04-23 12:31 ` Igor Mammedov
  2018-04-23 12:50   ` David Hildenbrand
  2018-04-23 15:32   ` Pankaj Gupta
  4 siblings, 2 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-04-23 12:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, David Gibson, Richard Henderson

On Fri, 20 Apr 2018 14:34:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
> future, we might want to do the same for virtio devices - e.g.
> virtio-pmem or virtio-mem. Especially, they should be able to live side
> by side to each other.
> 
> E.g. the virto based memory devices regions will not be exposed via ACPI
> and friends. They will be detected just like other virtio devices and
> indicate the applicable memory region. This makes it possible to also use
> them on architectures without memory device detection support (e.g. s390x).
> 
> Let's factor out the memory device code into a MemoryDevice interface.
A couple of high level questions as relevant code is not here:

  1. what would hotplug/unplug call chain look like in case of virtio-pmem device
     (reason I'm asking is that pmem being PCI device would trigger
      PCI bus hotplug controller and then it somehow should piggyback
      to Machine provided hotplug handlers, so I wonder what kind of
      havoc it would cause on hotplug infrastructure)

  2. why not use PCI bar mapping mechanism to do mapping since pmem is PCI device?  

 
> v2 -> v3:
> - "pc-dimm: factor out MemoryDevice interface"
> --> Lookup both classes when comparing (David Gibson)  
> 
> v1 -> v2:
> - Fix compile issues on ppc (still untested  )
> 
> 
> David Hildenbrand (3):
>   pc-dimm: factor out MemoryDevice interface
>   machine: make MemoryHotplugState accessible via the machine
>   pc-dimm: factor out address space logic into MemoryDevice code
> 
>  hw/i386/acpi-build.c                         |   3 +-
>  hw/i386/pc.c                                 |  24 ++-
>  hw/mem/Makefile.objs                         |   1 +
>  hw/mem/memory-device.c                       | 282 +++++++++++++++++++++++++
>  hw/mem/pc-dimm.c                             | 304 +++++++--------------------
>  hw/ppc/spapr.c                               |  24 ++-
>  hw/ppc/spapr_hcall.c                         |   1 +
>  include/hw/boards.h                          |  16 ++
>  include/hw/mem/memory-device.h               |  48 +++++
>  include/hw/mem/pc-dimm.h                     |  26 +--
>  numa.c                                       |   3 +-
>  qmp.c                                        |   4 +-
>  stubs/Makefile.objs                          |   2 +-
>  stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
>  14 files changed, 465 insertions(+), 277 deletions(-)
>  create mode 100644 hw/mem/memory-device.c
>  create mode 100644 include/hw/mem/memory-device.h
>  rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-23 12:19   ` Igor Mammedov
@ 2018-04-23 12:44     ` David Hildenbrand
  2018-04-24 13:28       ` Igor Mammedov
  2018-04-23 12:52     ` David Hildenbrand
  2018-04-23 14:44     ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
  2 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 12:44 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta

On 23.04.2018 14:19, Igor Mammedov wrote:
> On Fri, 20 Apr 2018 14:34:56 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> To be able to reuse MemoryDevice logic from other devices besides
>> pc-dimm, factor the relevant stuff out into the MemoryDevice code.
>>
>> As we don't care about slots for memory devices that are not pc-dimm,
>> don't factor that part out.
> that's not really true, you still consume kvm and vhost slots (whatever it is)
> whenever you map it into address space as ram memory region.

Let me rephrase ACPI slots are not of interest. That user visible part
is not needed for other memory devices.

KVM and VHOST slots are different (just memory regions, we don't care
about which specific slot is taken)

> 
> Also ram_slots currently are (ab)used as flag that user enabled memory
> hotplug via CLI.

Yes, have a patch for this :)

>  
>> Most of this patch just moves checks and logic around. While at it, make
>> the code properly detect certain error conditions better (e.g. fragmented
>> memory).
> I'd suggest splitting patch in several smaller ones if possible,
> especially parts that do anything more than just moving code around.

I tried to do it in smaller steps but most of it turned out to just
introduce and delete temporary code. Will have a look if this can be
done after we got a common understanding of what the end result should
look like.

> 
> 
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/i386/pc.c                   |  12 +--
>>  hw/mem/memory-device.c         | 162 ++++++++++++++++++++++++++++++++++++
>>  hw/mem/pc-dimm.c               | 185 +++--------------------------------------
>>  hw/ppc/spapr.c                 |   9 +-
>>  include/hw/mem/memory-device.h |   4 +
>>  include/hw/mem/pc-dimm.h       |  14 +---
>>  6 files changed, 185 insertions(+), 201 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index fa8862af33..1c25546a0c 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1711,7 +1711,7 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>>          goto out;
>>      }
>>  
>> -    pc_dimm_memory_plug(dev, &pcms->hotplug_memory, mr, align, &local_err);
>> +    pc_dimm_memory_plug(dev, align, &local_err);
> Is there a reason why you are dropping pcms->hotplug_memory argument
> and fall back to qdev_get_machine()?

Yes, because we
a) access machine either way internally (a couple of times even).
b) this only works if we have a hotplug handler on the machine (esp: not
for virtio devices). Otherwise we have to get the machine from the
virtio realize function - also ugly.

> 
> I'd rather see it going other direction,
> i.e. move hotplug_memory from PC
> machine to MachineState and then pass it down as argument whenever it's needed.

As said, ugly for virtio devices. And I don't see a benefit if we access
the machine internally already either way.

> 
>>      if (local_err) {
>>          goto out;
>>      }
>> @@ -1761,17 +1761,9 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
>>                             DeviceState *dev, Error **errp)
>>  {
>>      PCMachineState *pcms = PC_MACHINE(hotplug_dev);
>> -    PCDIMMDevice *dimm = PC_DIMM(dev);
>> -    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
>> -    MemoryRegion *mr;
>>      HotplugHandlerClass *hhc;
>>      Error *local_err = NULL;
>>  
>> -    mr = ddc->get_memory_region(dimm, &local_err);
>> -    if (local_err) {
>> -        goto out;
>> -    }
>> -
>>      hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
>>      hhc->unplug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &local_err);
>>  
>> @@ -1779,7 +1771,7 @@ static void pc_dimm_unplug(HotplugHandler *hotplug_dev,
>>          goto out;
>>      }
>>  
>> -    pc_dimm_memory_unplug(dev, &pcms->hotplug_memory, mr);
>> +    pc_dimm_memory_unplug(dev);
> ditto

(and I still think it looks cleaner, but we can discuss)

> 
>>      object_unparent(OBJECT(dev));
>>  
>>   out:
>> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
>> index b860c9c582..b96efa3bf4 100644
>> --- a/hw/mem/memory-device.c
>> +++ b/hw/mem/memory-device.c
>> @@ -15,6 +15,8 @@
>>  #include "qapi/error.h"
>>  #include "hw/boards.h"
>>  #include "qemu/range.h"
>> +#include "hw/virtio/vhost.h"
>> +#include "sysemu/kvm.h"
>>  
>>  static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
>>  {
>> @@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
>>      return size;
>>  }
>>  
>> +static int memory_device_used_region_size_internal(Object *obj, void *opaque)
>> +{
>> +    uint64_t *size = opaque;
>> +
>> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
>> +        DeviceState *dev = DEVICE(obj);
>> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
>> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
>> +
>> +        if (dev->realized) {
>> +            *size += mdc->get_region_size(md, &error_abort);
>> +        }
>> +    }
>> +
>> +    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
>> +    return 0;
>> +}
>> +
>> +static uint64_t memory_device_used_region_size(void)
>> +{
>> +    uint64_t size = 0;
>> +
>> +    memory_device_used_region_size_internal(qdev_get_machine(), &size);
>> +
>> +    return size;
>> +}
>> +
>> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
>> +                                     uint64_t size, Error **errp)
> I'd suggest to pc_dimm_memory_plug/pc_dimm_get_free_addr first,
> namely most of the stuff it does like checks and assigning default
> values should go to pre_plug (pre realize) handler and then only
> actual mapping is left for plug (after realize) handler to deal with:
> 

Can you elaborate what you mean by pre-plug? If this is about pre plug
handler of the (machine) hotplug handler, it might be problematic for
virtio devices.

>     memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);               
>     vmstate_register_ram(vmstate_mr, dev); 
> 
>> +{
>> +    const uint64_t used_region_size = memory_device_used_region_size();
>> +    uint64_t address_space_start, address_space_end;
>> +    MachineState *machine = MACHINE(qdev_get_machine());
>> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
>> +    MemoryHotplugState *hpms;
>> +    GSList *list = NULL, *item;
>> +    uint64_t new_addr = 0;
>> +
>> +    if (!mc->get_memory_hotplug_state) {
>> +        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
>> +                         "supported by the machine");
>> +        return 0;
>> +    }
>> +
>> +    hpms = mc->get_memory_hotplug_state(machine);
>> +    if (!hpms || !memory_region_size(&hpms->mr)) {
>> +        error_setg(errp, "memory devices (e.g. for memory hotplug) are not "
>> +                         "enabled, please specify the maxmem option");
>> +        return 0;
>> +    }
> above 2 checks are repeated multiple times separate helper to do it
> would be better.

In the current code there is only one place left (namely) here.

> 
> PS:
> there is no need to check for it every time device is plugged,
> doing this check once (at machine_init time) is sufficient.

Can you elaborate? There has to be one central place where we check if
we can hotplug a memory device. E.g. virtio devices don't go via the
machine hotplug handler.

> 
>> +    address_space_start = hpms->base;
>> +    address_space_end = address_space_start + memory_region_size(&hpms->mr);
>> +    g_assert(address_space_end >= address_space_start);
>> +
>> +    if (used_region_size + size > machine->maxram_size - machine->ram_size) {
>> +        error_setg(errp, "not enough space, currently 0x%" PRIx64
>> +                   " in use of total hot pluggable 0x" RAM_ADDR_FMT,
>> +                   used_region_size, machine->maxram_size - machine->ram_size);
>> +        return 0;
>> +    }
>> +
>> +    if (hint && QEMU_ALIGN_UP(*hint, align) != *hint) {
>> +        error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
>> +                   align);
>> +        return 0;
>> +    }
>> +
>> +    if (QEMU_ALIGN_UP(size, align) != size) {
>> +        error_setg(errp, "backend memory size must be multiple of 0x%"
>> +                   PRIx64, align);
>> +        return 0;
>> +    }
>> +
>> +    if (hint) {
>> +        new_addr = *hint;
>> +        if (new_addr < address_space_start) {
>> +            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
>> +                       "] at 0x%" PRIx64, new_addr, size, address_space_start);
>> +            return 0;
>> +        } else if ((new_addr + size) > address_space_end) {
>> +            error_setg(errp, "can't add memory [0x%" PRIx64 ":0x%" PRIx64
>> +                       "] beyond 0x%" PRIx64, new_addr, size,
>> +                       address_space_end);
>> +            return 0;
>> +        }
>> +    } else {
>> +        new_addr = address_space_start;
>> +    }
>> +
>> +    /* find address range that will fit new memory device */
>> +    object_child_foreach(qdev_get_machine(), memory_device_built_list, &list);
>> +    for (item = list; item; item = g_slist_next(item)) {
>> +        MemoryDeviceState *md = item->data;
>> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(OBJECT(md));
>> +        uint64_t md_size, md_addr;
>> +
>> +        md_addr = mdc->get_addr(md);
>> +        md_size = mdc->get_region_size(md, errp);
>> +        if (*errp) {
>> +            goto out;
>> +        }
>> +
>> +        if (ranges_overlap(md_addr, md_size, new_addr, size)) {
>> +            if (hint) {
>> +                DeviceState *d = DEVICE(md);
>> +                error_setg(errp, "address range conflicts with '%s'", d->id);
>> +                goto out;
>> +            }
>> +            new_addr = QEMU_ALIGN_UP(md_addr + md_size, align);
>> +        }
>> +    }
>> +
>> +    if (new_addr + size > address_space_end) {
>> +        error_setg(errp, "could not find position in guest address space for "
>> +                   "memory device - memory fragmented due to alignments");
>> +        goto out;
>> +    }
>> +out:
>> +    g_slist_free(list);
>> +    return new_addr;
>> +}
>> +
>> +void memory_device_plug_region(MemoryRegion *mr, uint64_t addr, Error **errp)
>> +{
> [...]
> 
>> +    /* we will need a new memory slot for kvm and vhost */
>> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
>> +        error_setg(errp, "hypervisor has no free memory slots left");
>> +        return;
>> +    }
>> +    if (!vhost_has_free_slot()) {
>> +        error_setg(errp, "a used vhost backend has no free memory slots left");
>> +        return;
>> +    }
> move these checks to pre_plug time

That would then be a different flow as we have right now. But it sounds
sane.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-23 12:31 ` Igor Mammedov
@ 2018-04-23 12:50   ` David Hildenbrand
  2018-04-23 15:32   ` Pankaj Gupta
  1 sibling, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 12:50 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, qemu-s390x, qemu-ppc, Paolo Bonzini,
	Marcel Apfelbaum, David Gibson, Richard Henderson

On 23.04.2018 14:31, Igor Mammedov wrote:
> On Fri, 20 Apr 2018 14:34:53 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
>> future, we might want to do the same for virtio devices - e.g.
>> virtio-pmem or virtio-mem. Especially, they should be able to live side
>> by side to each other.
>>
>> E.g. the virto based memory devices regions will not be exposed via ACPI
>> and friends. They will be detected just like other virtio devices and
>> indicate the applicable memory region. This makes it possible to also use
>> them on architectures without memory device detection support (e.g. s390x).
>>
>> Let's factor out the memory device code into a MemoryDevice interface.
> A couple of high level questions as relevant code is not here:

Important remark first: We also have s390x with virtio-ccw.

This essentially will also allow s390x guests to have
- memory hotplug via virtio-mem
- fake dax devices via virtio-pmem

> 
>   1. what would hotplug/unplug call chain look like in case of virtio-pmem device
>      (reason I'm asking is that pmem being PCI device would trigger
>       PCI bus hotplug controller and then it somehow should piggyback
>       to Machine provided hotplug handlers, so I wonder what kind of
>       havoc it would cause on hotplug infrastructure)
> 
>   2. why not use PCI bar mapping mechanism to do mapping since pmem is PCI device?  

pmem might be a PCI device, but virtio-pmem is a virtio device (however
it will be exposed via a proxy)

> 

I think both questions are best answered by Pankaj (already on CC).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-23 12:19   ` Igor Mammedov
  2018-04-23 12:44     ` David Hildenbrand
@ 2018-04-23 12:52     ` David Hildenbrand
  2018-04-24 13:31       ` Igor Mammedov
  2018-04-23 14:44     ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
  2 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 12:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta


> 
>> +    /* we will need a new memory slot for kvm and vhost */
>> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
>> +        error_setg(errp, "hypervisor has no free memory slots left");
>> +        return;
>> +    }
>> +    if (!vhost_has_free_slot()) {
>> +        error_setg(errp, "a used vhost backend has no free memory slots left");
>> +        return;
>> +    }
> move these checks to pre_plug time
> 
>> +
>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
> missing vmstate registration?

Missed this one: To be called by the caller. Important because e.g. for
virtio-pmem we don't want this (I assume :) ).

Thanks!


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-23 12:19   ` Igor Mammedov
  2018-04-23 12:44     ` David Hildenbrand
  2018-04-23 12:52     ` David Hildenbrand
@ 2018-04-23 14:44     ` David Hildenbrand
  2 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 14:44 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, qemu-s390x, Michael S . Tsirkin, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, David Gibson,
	Markus Armbruster, qemu-ppc, Pankaj Gupta

On 23.04.2018 14:19, Igor Mammedov wrote:
> On Fri, 20 Apr 2018 14:34:56 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> To be able to reuse MemoryDevice logic from other devices besides
>> pc-dimm, factor the relevant stuff out into the MemoryDevice code.
>>
>> As we don't care about slots for memory devices that are not pc-dimm,
>> don't factor that part out.
> that's not really true, you still consume kvm and vhost slots (whatever it is)
> whenever you map it into address space as ram memory region.
> 
> Also ram_slots currently are (ab)used as flag that user enabled memory
> hotplug via CLI.
>  
>> Most of this patch just moves checks and logic around. While at it, make
>> the code properly detect certain error conditions better (e.g. fragmented
>> memory).
> I'd suggest splitting patch in several smaller ones if possible,
> especially parts that do anything more than just moving code around.
> 
> 
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/i386/pc.c                   |  12 +--
>>  hw/mem/memory-device.c         | 162 ++++++++++++++++++++++++++++++++++++
>>  hw/mem/pc-dimm.c               | 185 +++--------------------------------------
>>  hw/ppc/spapr.c                 |   9 +-
>>  include/hw/mem/memory-device.h |   4 +
>>  include/hw/mem/pc-dimm.h       |  14 +---
>>  6 files changed, 185 insertions(+), 201 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index fa8862af33..1c25546a0c 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1711,7 +1711,7 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>>          goto out;
>>      }
>>  
>> -    pc_dimm_memory_plug(dev, &pcms->hotplug_memory, mr, align, &local_err);
>> +    pc_dimm_memory_plug(dev, align, &local_err);
> Is there a reason why you are dropping pcms->hotplug_memory argument
> and fall back to qdev_get_machine()?
> 
> I'd rather see it going other direction,
> i.e. move hotplug_memory from PC
> machine to MachineState and then pass it down as argument whenever it's needed.

FWIW, I think I found a way to split this into smaller patches.

The current prototypes will look like this for pc_dimm

void pc_dimm_memory_plug(DeviceState *dev, MachineState *machine,
                         uint64_t align, Error **errp);
void pc_dimm_memory_unplug(DeviceState *dev, MachineState *machine);

I am not sure yet if I'll work on the pre-plug stuff for pc-dimm (I want
to get memory devices running not rewrite all of the pc-dimm memory
hotplug code :) ), but that can be reworked later on easily.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-23 12:31 ` Igor Mammedov
  2018-04-23 12:50   ` David Hildenbrand
@ 2018-04-23 15:32   ` Pankaj Gupta
  2018-04-23 16:35     ` David Hildenbrand
  2018-04-24 14:00     ` Igor Mammedov
  1 sibling, 2 replies; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-23 15:32 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson, David Gibson


Hi Igor,

> 
> > Right now we can only map PCDIMM/NVDIMM into guest address space. In the
> > future, we might want to do the same for virtio devices - e.g.
> > virtio-pmem or virtio-mem. Especially, they should be able to live side
> > by side to each other.
> > 
> > E.g. the virto based memory devices regions will not be exposed via ACPI
> > and friends. They will be detected just like other virtio devices and
> > indicate the applicable memory region. This makes it possible to also use
> > them on architectures without memory device detection support (e.g. s390x).
> > 
> > Let's factor out the memory device code into a MemoryDevice interface.
> A couple of high level questions as relevant code is not here:
> 
>   1. what would hotplug/unplug call chain look like in case of virtio-pmem
>   device
>      (reason I'm asking is that pmem being PCI device would trigger
>       PCI bus hotplug controller and then it somehow should piggyback
>       to Machine provided hotplug handlers, so I wonder what kind of
>       havoc it would cause on hotplug infrastructure)

For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
functions?

> 
>   2. why not use PCI bar mapping mechanism to do mapping since pmem is PCI
>   device?

I think even we use if as PCI BAR mapping with PCI we still need free guest physical 
address to provide to VM for mapping the memory range. For that there needs to 
be coordination between PCDIMM and VIRTIO pci device? Also, if we use RAM from QEMU 
address space tied to big region(system_memory) memory accounting gets easier and at single place.

Honestly speaking I don't think there will be much difference between the two approaches? unless
I am missing something important?

> 
>  
> > v2 -> v3:
> > - "pc-dimm: factor out MemoryDevice interface"
> > --> Lookup both classes when comparing (David Gibson)
> > 
> > v1 -> v2:
> > - Fix compile issues on ppc (still untested  )
> > 
> > 
> > David Hildenbrand (3):
> >   pc-dimm: factor out MemoryDevice interface
> >   machine: make MemoryHotplugState accessible via the machine
> >   pc-dimm: factor out address space logic into MemoryDevice code
> > 
> >  hw/i386/acpi-build.c                         |   3 +-
> >  hw/i386/pc.c                                 |  24 ++-
> >  hw/mem/Makefile.objs                         |   1 +
> >  hw/mem/memory-device.c                       | 282
> >  +++++++++++++++++++++++++
> >  hw/mem/pc-dimm.c                             | 304
> >  +++++++--------------------
> >  hw/ppc/spapr.c                               |  24 ++-
> >  hw/ppc/spapr_hcall.c                         |   1 +
> >  include/hw/boards.h                          |  16 ++
> >  include/hw/mem/memory-device.h               |  48 +++++
> >  include/hw/mem/pc-dimm.h                     |  26 +--
> >  numa.c                                       |   3 +-
> >  qmp.c                                        |   4 +-
> >  stubs/Makefile.objs                          |   2 +-
> >  stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
> >  14 files changed, 465 insertions(+), 277 deletions(-)
> >  create mode 100644 hw/mem/memory-device.c
> >  create mode 100644 include/hw/mem/memory-device.h
> >  rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-23 15:32   ` Pankaj Gupta
@ 2018-04-23 16:35     ` David Hildenbrand
  2018-04-24 14:00     ` Igor Mammedov
  1 sibling, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-23 16:35 UTC (permalink / raw)
  To: Pankaj Gupta, Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson

On 23.04.2018 17:32, Pankaj Gupta wrote:
> 
> Hi Igor,
> 
>>
>>> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
>>> future, we might want to do the same for virtio devices - e.g.
>>> virtio-pmem or virtio-mem. Especially, they should be able to live side
>>> by side to each other.
>>>
>>> E.g. the virto based memory devices regions will not be exposed via ACPI
>>> and friends. They will be detected just like other virtio devices and
>>> indicate the applicable memory region. This makes it possible to also use
>>> them on architectures without memory device detection support (e.g. s390x).
>>>
>>> Let's factor out the memory device code into a MemoryDevice interface.
>> A couple of high level questions as relevant code is not here:
>>
>>   1. what would hotplug/unplug call chain look like in case of virtio-pmem
>>   device
>>      (reason I'm asking is that pmem being PCI device would trigger
>>       PCI bus hotplug controller and then it somehow should piggyback
>>       to Machine provided hotplug handlers, so I wonder what kind of
>>       havoc it would cause on hotplug infrastructure)
> 
> For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
> 'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
> for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
> functions?
> 
>>
>>   2. why not use PCI bar mapping mechanism to do mapping since pmem is PCI
>>   device?
> 
> I think even we use if as PCI BAR mapping with PCI we still need free guest physical 
> address to provide to VM for mapping the memory range. For that there needs to 
> be coordination between PCDIMM and VIRTIO pci device? Also, if we use RAM from QEMU 
> address space tied to big region(system_memory) memory accounting gets easier and at single place.
> 
> Honestly speaking I don't think there will be much difference between the two approaches? unless
> I am missing something important?

The difference by gluing virtio devices to architecture specific
technologies will be unnecessary complicated. (my humble opinion)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-23 12:44     ` David Hildenbrand
@ 2018-04-24 13:28       ` Igor Mammedov
  2018-04-24 13:39         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-24 13:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On Mon, 23 Apr 2018 14:44:34 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 23.04.2018 14:19, Igor Mammedov wrote:
> > On Fri, 20 Apr 2018 14:34:56 +0200
> > David Hildenbrand <david@redhat.com> wrote:
considering v4 queued, I'm dropping mostly nor relevant points at this point.
wrt, virtio I'll elaborate more in reply to Pankaj

[...]

> >> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> >> index b860c9c582..b96efa3bf4 100644
> >> --- a/hw/mem/memory-device.c
> >> +++ b/hw/mem/memory-device.c
> >> @@ -15,6 +15,8 @@
> >>  #include "qapi/error.h"
> >>  #include "hw/boards.h"
> >>  #include "qemu/range.h"
> >> +#include "hw/virtio/vhost.h"
> >> +#include "sysemu/kvm.h"
> >>  
> >>  static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
> >>  {
> >> @@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
> >>      return size;
> >>  }
> >>  
> >> +static int memory_device_used_region_size_internal(Object *obj, void *opaque)
> >> +{
> >> +    uint64_t *size = opaque;
> >> +
> >> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
> >> +        DeviceState *dev = DEVICE(obj);
> >> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
> >> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
> >> +
> >> +        if (dev->realized) {
> >> +            *size += mdc->get_region_size(md, &error_abort);
> >> +        }
> >> +    }
> >> +
> >> +    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
> >> +    return 0;
> >> +}
> >> +
> >> +static uint64_t memory_device_used_region_size(void)
> >> +{
> >> +    uint64_t size = 0;
> >> +
> >> +    memory_device_used_region_size_internal(qdev_get_machine(), &size);
> >> +
> >> +    return size;
> >> +}
> >> +
> >> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
> >> +                                     uint64_t size, Error **errp)  
> > I'd suggest to pc_dimm_memory_plug/pc_dimm_get_free_addr first,
> > namely most of the stuff it does like checks and assigning default
> > values should go to pre_plug (pre realize) handler and then only
> > actual mapping is left for plug (after realize) handler to deal with:
> >   
> 
> Can you elaborate what you mean by pre-plug? If this is about pre plug
> handler of the (machine) hotplug handler, it might be problematic for
> virtio devices.
yes, something along these lines: c871bc70b

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-23 12:52     ` David Hildenbrand
@ 2018-04-24 13:31       ` Igor Mammedov
  2018-04-24 13:41         ` David Hildenbrand
  2018-04-25  5:45         ` Pankaj Gupta
  0 siblings, 2 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-04-24 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On Mon, 23 Apr 2018 14:52:37 +0200
David Hildenbrand <david@redhat.com> wrote:

> >   
> >> +    /* we will need a new memory slot for kvm and vhost */
> >> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> >> +        error_setg(errp, "hypervisor has no free memory slots left");
> >> +        return;
> >> +    }
> >> +    if (!vhost_has_free_slot()) {
> >> +        error_setg(errp, "a used vhost backend has no free memory slots left");
> >> +        return;
> >> +    }  
> > move these checks to pre_plug time
> >   
> >> +
> >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);  
> > missing vmstate registration?  
> 
> Missed this one: To be called by the caller. Important because e.g. for
> virtio-pmem we don't want this (I assume :) ).
if pmem isn't on shared storage, then We'd probably want to migrate
it as well, otherwise target would experience data loss.
Anyways, I'd just reat it as normal RAM in migration case

> 
> Thanks!
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 13:28       ` Igor Mammedov
@ 2018-04-24 13:39         ` David Hildenbrand
  2018-04-24 14:38           ` Igor Mammedov
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-24 13:39 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On 24.04.2018 15:28, Igor Mammedov wrote:
> On Mon, 23 Apr 2018 14:44:34 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 23.04.2018 14:19, Igor Mammedov wrote:
>>> On Fri, 20 Apr 2018 14:34:56 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
> considering v4 queued, I'm dropping mostly nor relevant points at this point.
> wrt, virtio I'll elaborate more in reply to Pankaj
> 
> [...]
> 
>>>> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
>>>> index b860c9c582..b96efa3bf4 100644
>>>> --- a/hw/mem/memory-device.c
>>>> +++ b/hw/mem/memory-device.c
>>>> @@ -15,6 +15,8 @@
>>>>  #include "qapi/error.h"
>>>>  #include "hw/boards.h"
>>>>  #include "qemu/range.h"
>>>> +#include "hw/virtio/vhost.h"
>>>> +#include "sysemu/kvm.h"
>>>>  
>>>>  static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
>>>>  {
>>>> @@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
>>>>      return size;
>>>>  }
>>>>  
>>>> +static int memory_device_used_region_size_internal(Object *obj, void *opaque)
>>>> +{
>>>> +    uint64_t *size = opaque;
>>>> +
>>>> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
>>>> +        DeviceState *dev = DEVICE(obj);
>>>> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
>>>> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
>>>> +
>>>> +        if (dev->realized) {
>>>> +            *size += mdc->get_region_size(md, &error_abort);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static uint64_t memory_device_used_region_size(void)
>>>> +{
>>>> +    uint64_t size = 0;
>>>> +
>>>> +    memory_device_used_region_size_internal(qdev_get_machine(), &size);
>>>> +
>>>> +    return size;
>>>> +}
>>>> +
>>>> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
>>>> +                                     uint64_t size, Error **errp)  
>>> I'd suggest to pc_dimm_memory_plug/pc_dimm_get_free_addr first,
>>> namely most of the stuff it does like checks and assigning default
>>> values should go to pre_plug (pre realize) handler and then only
>>> actual mapping is left for plug (after realize) handler to deal with:
>>>   
>>
>> Can you elaborate what you mean by pre-plug? If this is about pre plug
>> handler of the (machine) hotplug handler, it might be problematic for
>> virtio devices.
> yes, something along these lines: c871bc70b
> 
> 

Yes, we can factor that out (at least) for pc-dimm later on easily.
Seems to be just about moving a couple of calls.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 13:31       ` Igor Mammedov
@ 2018-04-24 13:41         ` David Hildenbrand
  2018-04-24 14:44           ` Igor Mammedov
  2018-04-25  5:45         ` Pankaj Gupta
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-24 13:41 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On 24.04.2018 15:31, Igor Mammedov wrote:
> On Mon, 23 Apr 2018 14:52:37 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>>   
>>>> +    /* we will need a new memory slot for kvm and vhost */
>>>> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
>>>> +        error_setg(errp, "hypervisor has no free memory slots left");
>>>> +        return;
>>>> +    }
>>>> +    if (!vhost_has_free_slot()) {
>>>> +        error_setg(errp, "a used vhost backend has no free memory slots left");
>>>> +        return;
>>>> +    }  
>>> move these checks to pre_plug time
>>>   
>>>> +
>>>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);  
>>> missing vmstate registration?  
>>
>> Missed this one: To be called by the caller. Important because e.g. for
>> virtio-pmem we don't want this (I assume :) ).
> if pmem isn't on shared storage, then We'd probably want to migrate
> it as well, otherwise target would experience data loss.
> Anyways, I'd just reat it as normal RAM in migration case

Yes, if we realize that all MemoryDevices need this call, we can move it
to that place, too.

Wonder if we might want to make this configurable for virtio-pmem later
on (via a flag or sth like that).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-23 15:32   ` Pankaj Gupta
  2018-04-23 16:35     ` David Hildenbrand
@ 2018-04-24 14:00     ` Igor Mammedov
  2018-04-24 15:42       ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-24 14:00 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson, David Gibson

On Mon, 23 Apr 2018 11:32:25 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> Hi Igor,
> 
> >   
> > > Right now we can only map PCDIMM/NVDIMM into guest address space. In the
> > > future, we might want to do the same for virtio devices - e.g.
> > > virtio-pmem or virtio-mem. Especially, they should be able to live side
> > > by side to each other.
> > > 
> > > E.g. the virto based memory devices regions will not be exposed via ACPI
> > > and friends. They will be detected just like other virtio devices and
> > > indicate the applicable memory region. This makes it possible to also use
> > > them on architectures without memory device detection support (e.g. s390x).
> > > 
> > > Let's factor out the memory device code into a MemoryDevice interface.  
> > A couple of high level questions as relevant code is not here:
> > 
> >   1. what would hotplug/unplug call chain look like in case of virtio-pmem
> >   device
> >      (reason I'm asking is that pmem being PCI device would trigger
> >       PCI bus hotplug controller and then it somehow should piggyback
> >       to Machine provided hotplug handlers, so I wonder what kind of
> >       havoc it would cause on hotplug infrastructure)  
> 
> For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
> 'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
> for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
> functions?
the problem is with trying to use PCI bus based device with bus-less
infrastructure used by (pc|nv)dimms.

The important point which we should not to break here while trying to glue
PCI hotplug handler with machine hotplug handler is:

container MachineState::device_memory is owned by machine and
it's up to machine plug handler (container's owner) to map device's mr
into its address space.
(i.e. nor device's realize nor PCI bus hotplug handler should do it)

Not sure about virtio-mem but if it would use device_memory container,
it should use machine's plug handler.

I don't have out head ideas how to glue it cleanly, may be
MachineState::device_memory is just not right thing to use
for such devices.

> >   2. why not use PCI bar mapping mechanism to do mapping since pmem is PCI
> >   device?  
> 
> I think even we use if as PCI BAR mapping with PCI we still need free guest physical 
> address to provide to VM for mapping the memory range. For that there needs to 
> be coordination between PCDIMM and VIRTIO pci device? Also, if we use RAM from QEMU 
> address space tied to big region(system_memory) memory accounting gets easier and at single place.
if PCI bar mapping is used, then during accounting we would need to
additionally scan system_memory for virtio-pmem device to account
for its memory (not a huge deal as we even differentiate between
pc-dimm and nvdimm when building a list of memory devices so
special casing another device types won't hurt),
at least device management (most complex part) will be governed
by respective subsystem the device belongs to.

> Honestly speaking I don't think there will be much difference between the two approaches? unless
> I am missing something important?
> 
> > 
> >    
> > > v2 -> v3:
> > > - "pc-dimm: factor out MemoryDevice interface"  
> > > --> Lookup both classes when comparing (David Gibson)  
> > > 
> > > v1 -> v2:
> > > - Fix compile issues on ppc (still untested  )
> > > 
> > > 
> > > David Hildenbrand (3):
> > >   pc-dimm: factor out MemoryDevice interface
> > >   machine: make MemoryHotplugState accessible via the machine
> > >   pc-dimm: factor out address space logic into MemoryDevice code
> > > 
> > >  hw/i386/acpi-build.c                         |   3 +-
> > >  hw/i386/pc.c                                 |  24 ++-
> > >  hw/mem/Makefile.objs                         |   1 +
> > >  hw/mem/memory-device.c                       | 282
> > >  +++++++++++++++++++++++++
> > >  hw/mem/pc-dimm.c                             | 304
> > >  +++++++--------------------
> > >  hw/ppc/spapr.c                               |  24 ++-
> > >  hw/ppc/spapr_hcall.c                         |   1 +
> > >  include/hw/boards.h                          |  16 ++
> > >  include/hw/mem/memory-device.h               |  48 +++++
> > >  include/hw/mem/pc-dimm.h                     |  26 +--
> > >  numa.c                                       |   3 +-
> > >  qmp.c                                        |   4 +-
> > >  stubs/Makefile.objs                          |   2 +-
> > >  stubs/{qmp_pc_dimm.c => qmp_memory_device.c} |   4 +-
> > >  14 files changed, 465 insertions(+), 277 deletions(-)
> > >  create mode 100644 hw/mem/memory-device.c
> > >  create mode 100644 include/hw/mem/memory-device.h
> > >  rename stubs/{qmp_pc_dimm.c => qmp_memory_device.c} (61%)
> > >   
> > 
> > 
> >   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 13:39         ` David Hildenbrand
@ 2018-04-24 14:38           ` Igor Mammedov
  0 siblings, 0 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-04-24 14:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson

On Tue, 24 Apr 2018 15:39:30 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 24.04.2018 15:28, Igor Mammedov wrote:
> > On Mon, 23 Apr 2018 14:44:34 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 23.04.2018 14:19, Igor Mammedov wrote:  
> >>> On Fri, 20 Apr 2018 14:34:56 +0200
> >>> David Hildenbrand <david@redhat.com> wrote:  
> > considering v4 queued, I'm dropping mostly nor relevant points at this point.
> > wrt, virtio I'll elaborate more in reply to Pankaj
> > 
> > [...]
> >   
> >>>> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> >>>> index b860c9c582..b96efa3bf4 100644
> >>>> --- a/hw/mem/memory-device.c
> >>>> +++ b/hw/mem/memory-device.c
> >>>> @@ -15,6 +15,8 @@
> >>>>  #include "qapi/error.h"
> >>>>  #include "hw/boards.h"
> >>>>  #include "qemu/range.h"
> >>>> +#include "hw/virtio/vhost.h"
> >>>> +#include "sysemu/kvm.h"
> >>>>  
> >>>>  static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
> >>>>  {
> >>>> @@ -106,6 +108,166 @@ uint64_t get_plugged_memory_size(void)
> >>>>      return size;
> >>>>  }
> >>>>  
> >>>> +static int memory_device_used_region_size_internal(Object *obj, void *opaque)
> >>>> +{
> >>>> +    uint64_t *size = opaque;
> >>>> +
> >>>> +    if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
> >>>> +        DeviceState *dev = DEVICE(obj);
> >>>> +        MemoryDeviceState *md = MEMORY_DEVICE(obj);
> >>>> +        MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
> >>>> +
> >>>> +        if (dev->realized) {
> >>>> +            *size += mdc->get_region_size(md, &error_abort);
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    object_child_foreach(obj, memory_device_used_region_size_internal, opaque);
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static uint64_t memory_device_used_region_size(void)
> >>>> +{
> >>>> +    uint64_t size = 0;
> >>>> +
> >>>> +    memory_device_used_region_size_internal(qdev_get_machine(), &size);
> >>>> +
> >>>> +    return size;
> >>>> +}
> >>>> +
> >>>> +uint64_t memory_device_get_free_addr(uint64_t *hint, uint64_t align,
> >>>> +                                     uint64_t size, Error **errp)    
> >>> I'd suggest to pc_dimm_memory_plug/pc_dimm_get_free_addr first,
> >>> namely most of the stuff it does like checks and assigning default
> >>> values should go to pre_plug (pre realize) handler and then only
> >>> actual mapping is left for plug (after realize) handler to deal with:
> >>>     
> >>
> >> Can you elaborate what you mean by pre-plug? If this is about pre plug
> >> handler of the (machine) hotplug handler, it might be problematic for
> >> virtio devices.  
> > yes, something along these lines: c871bc70b
> > 
> >   
> 
> Yes, we can factor that out (at least) for pc-dimm later on easily.
> Seems to be just about moving a couple of calls.
yep, but there is nice side effect,
there is no need to call devicefoo::unrealize() on failure since
devicefoo:realize() hasn't been called yet.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 13:41         ` David Hildenbrand
@ 2018-04-24 14:44           ` Igor Mammedov
  2018-04-24 15:23             ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-24 14:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On Tue, 24 Apr 2018 15:41:23 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 24.04.2018 15:31, Igor Mammedov wrote:
> > On Mon, 23 Apr 2018 14:52:37 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >>>     
> >>>> +    /* we will need a new memory slot for kvm and vhost */
> >>>> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> >>>> +        error_setg(errp, "hypervisor has no free memory slots left");
> >>>> +        return;
> >>>> +    }
> >>>> +    if (!vhost_has_free_slot()) {
> >>>> +        error_setg(errp, "a used vhost backend has no free memory slots left");
> >>>> +        return;
> >>>> +    }    
> >>> move these checks to pre_plug time
> >>>     
> >>>> +
> >>>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);    
> >>> missing vmstate registration?    
> >>
> >> Missed this one: To be called by the caller. Important because e.g. for
> >> virtio-pmem we don't want this (I assume :) ).  
> > if pmem isn't on shared storage, then We'd probably want to migrate
> > it as well, otherwise target would experience data loss.
> > Anyways, I'd just reat it as normal RAM in migration case  
> 
> Yes, if we realize that all MemoryDevices need this call, we can move it
> to that place, too.
> 
> Wonder if we might want to make this configurable for virtio-pmem later
> on (via a flag or sth like that).
I don't see any reason why we wouldn't like it to be migrated,
it's the same as nvdimm but with another qemu:guest ABI
and async flush instead of sync one we have with nvdimm.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 14:44           ` Igor Mammedov
@ 2018-04-24 15:23             ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-24 15:23 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, David Gibson, Richard Henderson

On 24.04.2018 16:44, Igor Mammedov wrote:
> On Tue, 24 Apr 2018 15:41:23 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 24.04.2018 15:31, Igor Mammedov wrote:
>>> On Mon, 23 Apr 2018 14:52:37 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>>>     
>>>>>> +    /* we will need a new memory slot for kvm and vhost */
>>>>>> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
>>>>>> +        error_setg(errp, "hypervisor has no free memory slots left");
>>>>>> +        return;
>>>>>> +    }
>>>>>> +    if (!vhost_has_free_slot()) {
>>>>>> +        error_setg(errp, "a used vhost backend has no free memory slots left");
>>>>>> +        return;
>>>>>> +    }    
>>>>> move these checks to pre_plug time
>>>>>     
>>>>>> +
>>>>>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);    
>>>>> missing vmstate registration?    
>>>>
>>>> Missed this one: To be called by the caller. Important because e.g. for
>>>> virtio-pmem we don't want this (I assume :) ).  
>>> if pmem isn't on shared storage, then We'd probably want to migrate
>>> it as well, otherwise target would experience data loss.
>>> Anyways, I'd just reat it as normal RAM in migration case  
>>
>> Yes, if we realize that all MemoryDevices need this call, we can move it
>> to that place, too.
>>
>> Wonder if we might want to make this configurable for virtio-pmem later
>> on (via a flag or sth like that).
> I don't see any reason why we wouldn't like it to be migrated,
> it's the same as nvdimm but with another qemu:guest ABI
> and async flush instead of sync one we have with nvdimm.
> 

Didn't you just mention "shared storage" ? :)

Anyhow, I leave such stuff to Pankaj to figure out. I remember him
working on some page cache details. Once clarified, this is easily
refactored later on.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-24 14:00     ` Igor Mammedov
@ 2018-04-24 15:42       ` David Hildenbrand
  2018-04-25 12:15         ` Igor Mammedov
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-24 15:42 UTC (permalink / raw)
  To: Igor Mammedov, Pankaj Gupta
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson

On 24.04.2018 16:00, Igor Mammedov wrote:
> On Mon, 23 Apr 2018 11:32:25 -0400 (EDT)
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
>> Hi Igor,
>>
>>>   
>>>> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
>>>> future, we might want to do the same for virtio devices - e.g.
>>>> virtio-pmem or virtio-mem. Especially, they should be able to live side
>>>> by side to each other.
>>>>
>>>> E.g. the virto based memory devices regions will not be exposed via ACPI
>>>> and friends. They will be detected just like other virtio devices and
>>>> indicate the applicable memory region. This makes it possible to also use
>>>> them on architectures without memory device detection support (e.g. s390x).
>>>>
>>>> Let's factor out the memory device code into a MemoryDevice interface.  
>>> A couple of high level questions as relevant code is not here:
>>>
>>>   1. what would hotplug/unplug call chain look like in case of virtio-pmem
>>>   device
>>>      (reason I'm asking is that pmem being PCI device would trigger
>>>       PCI bus hotplug controller and then it somehow should piggyback
>>>       to Machine provided hotplug handlers, so I wonder what kind of
>>>       havoc it would cause on hotplug infrastructure)  
>>
>> For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
>> 'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
>> for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
>> functions?
> the problem is with trying to use PCI bus based device with bus-less
> infrastructure used by (pc|nv)dimms.

I can understand your reasoning, but for me these are some QEMU internal details
that should not stop the virtio-(p)mem train from rolling.

In my world, device hotplug is composed of the following steps

1. Resource allocation
2. Attaching the device to a bus (making it accessible by the guest)
3. Notifying the guest

I would e.g. also call ACPI sort of a bus structure. Now, the machine hotplug
handler currently does parts of 1. and then hands of to ACPI to do 2 and 3.

virtio-mem and virtio-pmem do 1. partially in the realize function and then
let 2. and 3. be handled by the proxy device specific hotplug handlers.

Mean people might say that the machine should not call the ACPI code but there
should be a ACPI hotplug handler. So we would end up with the same result.

But anyhow, the resource allocation (getting an address and getting plugged) will
be done in the first step out of the virtio-(p)mem realize function:

static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
{
   ...
   /* try to get a mapping in guest address space */
    vm->phys_addr = memory_device_get_free_addr(MACHINE(qdev_get_machine))...
    if (local_err) {
        goto out;
    }
    ...

    /* register the memory region */
    memory_device_plug_region(MACHINE(qdev_get_machine()), vm->mr,
                              vm->phys_addr);
   ...
}

So this happens before any hotplug handler is called. Everything works
just fine. What you don't like about this is the qdev_get_machine(). I
also don't like it but in the short term I don't see any problem with
it. It is resource allocation and not a "device plug" in the typical form.

> 
> The important point which we should not to break here while trying to glue
> PCI hotplug handler with machine hotplug handler is:

I could later on imagine something like a 2 step approach.

1. resource allocation handler by a machine for MemoryDevices
- assigns address, registers memory region
2. hotplug handler (ACPI, PCI, CCW ...)
- assigns bus specific stuff, attaches device, notifies guest

Importantly the device is not visible to the guest until 2.
Of course, we could also take care of pre-plug things as you mentioned.

> 
> container MachineState::device_memory is owned by machine and
> it's up to machine plug handler (container's owner) to map device's mr
> into its address space.
> (i.e. nor device's realize nor PCI bus hotplug handler should do it)

I agree, but I think these are internal details.

> 
> Not sure about virtio-mem but if it would use device_memory container,
> it should use machine's plug handler.
> 
> I don't have out head ideas how to glue it cleanly, may be
> MachineState::device_memory is just not right thing to use
> for such devices.

I strongly disagree. From the user point of view it should not matter what
was added/plugged. There is just one guest physical memory and maxmem is
defined for one QEMU instance. Exposing such details to the user should
definitely be avoided.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-24 13:31       ` Igor Mammedov
  2018-04-24 13:41         ` David Hildenbrand
@ 2018-04-25  5:45         ` Pankaj Gupta
  2018-04-25 13:23           ` Igor Mammedov
  1 sibling, 1 reply; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-25  5:45 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, David Gibson, Richard Henderson


> 
> > >   
> > >> +    /* we will need a new memory slot for kvm and vhost */
> > >> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> > >> +        error_setg(errp, "hypervisor has no free memory slots left");
> > >> +        return;
> > >> +    }
> > >> +    if (!vhost_has_free_slot()) {
> > >> +        error_setg(errp, "a used vhost backend has no free memory slots
> > >> left");
> > >> +        return;
> > >> +    }
> > > move these checks to pre_plug time
> > >   
> > >> +
> > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
> > > missing vmstate registration?
> > 
> > Missed this one: To be called by the caller. Important because e.g. for
> > virtio-pmem we don't want this (I assume :) ).
> if pmem isn't on shared storage, then We'd probably want to migrate
> it as well, otherwise target would experience data loss.
> Anyways, I'd just reat it as normal RAM in migration case

Main difference between RAM and pmem it acts like combination of RAM and disk.
Saying this, in normal use-case size would be 100 GB's - few TB's range. 
I am not sure we really want to migrate it for non-shared storage use-case.

One reason why nvdimm added vmstate info could be: still there would be transient
writes in memory with fake DAX and there is no way(till now) to flush the guest 
writes. But with virtio-pmem we can flush such writes before migration and automatically
at destination host with shared disk we will have updated data.


Thanks,
Pankaj  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-24 15:42       ` David Hildenbrand
@ 2018-04-25 12:15         ` Igor Mammedov
  2018-04-25 12:46           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-25 12:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson

On Tue, 24 Apr 2018 17:42:44 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 24.04.2018 16:00, Igor Mammedov wrote:
> > On Mon, 23 Apr 2018 11:32:25 -0400 (EDT)
> > Pankaj Gupta <pagupta@redhat.com> wrote:
> >   
> >> Hi Igor,
> >>  
> >>>     
> >>>> Right now we can only map PCDIMM/NVDIMM into guest address space. In the
> >>>> future, we might want to do the same for virtio devices - e.g.
> >>>> virtio-pmem or virtio-mem. Especially, they should be able to live side
> >>>> by side to each other.
> >>>>
> >>>> E.g. the virto based memory devices regions will not be exposed via ACPI
> >>>> and friends. They will be detected just like other virtio devices and
> >>>> indicate the applicable memory region. This makes it possible to also use
> >>>> them on architectures without memory device detection support (e.g. s390x).
> >>>>
> >>>> Let's factor out the memory device code into a MemoryDevice interface.    
> >>> A couple of high level questions as relevant code is not here:
> >>>
> >>>   1. what would hotplug/unplug call chain look like in case of virtio-pmem
> >>>   device
> >>>      (reason I'm asking is that pmem being PCI device would trigger
> >>>       PCI bus hotplug controller and then it somehow should piggyback
> >>>       to Machine provided hotplug handlers, so I wonder what kind of
> >>>       havoc it would cause on hotplug infrastructure)    
> >>
> >> For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
> >> 'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
> >> for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
> >> functions?  
> > the problem is with trying to use PCI bus based device with bus-less
> > infrastructure used by (pc|nv)dimms.  
> 
> I can understand your reasoning, but for me these are some QEMU internal details
> that should not stop the virtio-(p)mem train from rolling.
If it's quickly hacked up prototypes to play with than it's fine
as far as they are not being merged into QEMU.
If one plans to merge it, then code should be adapted to
whatever QEMU internal requirements are.

> In my world, device hotplug is composed of the following steps
> 
> 1. Resource allocation
> 2. Attaching the device to a bus (making it accessible by the guest)
> 3. Notifying the guest
> 
> I would e.g. also call ACPI sort of a bus structure. Now, the machine hotplug
> handler currently does parts of 1. and then hands of to ACPI to do 2 and 3.
it's not a bus, it's concrete device implementing GPE logic,
on x86 it does the job on notifier #3 in case of hotplug.

> virtio-mem and virtio-pmem do 1. partially in the realize function and then
> let 2. and 3. be handled by the proxy device specific hotplug handlers.
> 
> Mean people might say that the machine should not call the ACPI code but there
> should be a ACPI hotplug handler. So we would end up with the same result.
it should be fine for parent to manage its children but not other way around


> But anyhow, the resource allocation (getting an address and getting plugged) will
> be done in the first step out of the virtio-(p)mem realize function:
> 
> static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
> {
>    ...
>    /* try to get a mapping in guest address space */
>     vm->phys_addr = memory_device_get_free_addr(MACHINE(qdev_get_machine))...
this should be a property, and if it's not set then realize should error out

>     if (local_err) {
>         goto out;
>     }
>     ...
> 
>     /* register the memory region */
>     memory_device_plug_region(MACHINE(qdev_get_machine()), vm->mr,
>                               vm->phys_addr);
>    ...
> }
> 
> So this happens before any hotplug handler is called. Everything works
> just fine. What you don't like about this is the qdev_get_machine(). I
> also don't like it but in the short term I don't see any problem with
> it. It is resource allocation and not a "device plug" in the typical form.

It's not qdev_get_machine() that's issue, it's layer violation,
where child device is allocating and mapping resources of one of its parents.

that's been an issue and show stopper for patches in the past,
and that's probably not going to change in this case either.

 
> > The important point which we should not to break here while trying to glue
> > PCI hotplug handler with machine hotplug handler is:  
> 
> I could later on imagine something like a 2 step approach.
> 
> 1. resource allocation handler by a machine for MemoryDevices
> - assigns address, registers memory region
> 2. hotplug handler (ACPI, PCI, CCW ...)
> - assigns bus specific stuff, attaches device, notifies guest
>
> Importantly the device is not visible to the guest until 2.
So far it's about how QEMU models and manages wiring process,
that's why pre_plug/plug handlers were introduced, to allow
resource owner to attach devices that's plugged into it.

i.e. PCI devices are managed by PCI subsystem and DIMM
devices are managed by board where they are mapped into
reserved address space by board code that owns it.

Allowing random device to manage board resources directly
isn't really acceptable (even as temporary solution).

In case of virtio-pmem it might be much cleaner to use
mapping mechanism provided by PCI sybsytem than trying
to bridge bus and buss-less device wiring as from device
modeling point of view (aside from providing RAM to guest)
it's 2 quite different devices.

i.e. if you think new device is RAM, which is governed by
-m option, then model it as bus-less device like dimm and
plug it directly into board, if its plugged in to a bus
it's that bus owner responsibility to allocate/manage
address space or bridge it to parent device.

(btw: virtio-pmem looks sort of like ivshmem, maybe they
can share some code on qemu side)

> Of course, we could also take care of pre-plug things as you mentioned.
> 
> > 
> > container MachineState::device_memory is owned by machine and
> > it's up to machine plug handler (container's owner) to map device's mr
> > into its address space.
> > (i.e. nor device's realize nor PCI bus hotplug handler should do it)  
> 
> I agree, but I think these are internal details.
it's internal details that we choose not to violate in QEMU
and were working towards that direction, getting rid of places
that do it wrongly.

> > Not sure about virtio-mem but if it would use device_memory container,
> > it should use machine's plug handler.
> > 
> > I don't have out head ideas how to glue it cleanly, may be
> > MachineState::device_memory is just not right thing to use
> > for such devices.  
> 
> I strongly disagree. From the user point of view it should not matter what
> was added/plugged. There is just one guest physical memory and maxmem is
> defined for one QEMU instance. Exposing such details to the user should
> definitely be avoided.
qemu user have to be exposed to details as he already adds
 -device virtio-pmem,....
to CLI, maxmem accounting is a separate matter and probably
shouldn't be mixed with device model and how it's mapped into
guest's address space.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-25 12:15         ` Igor Mammedov
@ 2018-04-25 12:46           ` David Hildenbrand
  2018-04-25 13:15             ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-04-25 12:46 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson


>>>> For first phase we are using 'virtio-pmem' as cold added devices. AFAIU
>>>> 'VirtioDeviceClass' being parent class and 'hotplug/unplug' methods implemented 
>>>> for virtio-pmem device. So, pci bus hotplug/unplug should call the corresponding
>>>> functions?  
>>> the problem is with trying to use PCI bus based device with bus-less
>>> infrastructure used by (pc|nv)dimms.  
>>
>> I can understand your reasoning, but for me these are some QEMU internal details
>> that should not stop the virtio-(p)mem train from rolling.
> If it's quickly hacked up prototypes to play with than it's fine
> as far as they are not being merged into QEMU.
> If one plans to merge it, then code should be adapted to
> whatever QEMU internal requirements are.

At one point we will have to decide if we want to develop good software
(which tolerates layer violations if there is a good excuse) or build
the perfect internal architecture. And we all know the latter is not the
case right now and never will be.

So yes, I will be looking into ways to make this work "nicer"
internally, but quite frankly, it has very little priority.

> 
>> In my world, device hotplug is composed of the following steps
>>
>> 1. Resource allocation
>> 2. Attaching the device to a bus (making it accessible by the guest)
>> 3. Notifying the guest
>>
>> I would e.g. also call ACPI sort of a bus structure. Now, the machine hotplug
>> handler currently does parts of 1. and then hands of to ACPI to do 2 and 3.
> it's not a bus, it's concrete device implementing GPE logic,
> on x86 it does the job on notifier #3 in case of hotplug.
> 
>> virtio-mem and virtio-pmem do 1. partially in the realize function and then
>> let 2. and 3. be handled by the proxy device specific hotplug handlers.
>>
>> Mean people might say that the machine should not call the ACPI code but there
>> should be a ACPI hotplug handler. So we would end up with the same result.
> it should be fine for parent to manage its children but not other way around

A virtio-bus (e.g. CCW) also "belongs" to the machine. But we won't
start to pass all device starting from the machine downwards to the
concrete implementation.

(but I get your point)

> 
> 
>> But anyhow, the resource allocation (getting an address and getting plugged) will
>> be done in the first step out of the virtio-(p)mem realize function:
>>
>> static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>> {
>>    ...
>>    /* try to get a mapping in guest address space */
>>     vm->phys_addr = memory_device_get_free_addr(MACHINE(qdev_get_machine))...
> this should be a property, and if it's not set then realize should error out

It is a property but if it is 0 we do auto-detection right now (like DIMM)

> 
>>     if (local_err) {
>>         goto out;
>>     }
>>     ...
>>
>>     /* register the memory region */
>>     memory_device_plug_region(MACHINE(qdev_get_machine()), vm->mr,
>>                               vm->phys_addr);
>>    ...
>> }
>>
>> So this happens before any hotplug handler is called. Everything works
>> just fine. What you don't like about this is the qdev_get_machine(). I
>> also don't like it but in the short term I don't see any problem with
>> it. It is resource allocation and not a "device plug" in the typical form.
> 
> It's not qdev_get_machine() that's issue, it's layer violation,
> where child device is allocating and mapping resources of one of its parents.

Quite simple: introduce a function at the machine where the child can
"request" to get an address and "request" to plug/unplug a region.

Or what would be wrong about that?

> 
> that's been an issue and show stopper for patches in the past,
> and that's probably not going to change in this case either.
> 

I can see that, but again, for me these are internal details.

>  
>>> The important point which we should not to break here while trying to glue
>>> PCI hotplug handler with machine hotplug handler is:  
>>
>> I could later on imagine something like a 2 step approach.
>>
>> 1. resource allocation handler by a machine for MemoryDevices
>> - assigns address, registers memory region
>> 2. hotplug handler (ACPI, PCI, CCW ...)
>> - assigns bus specific stuff, attaches device, notifies guest
>>
>> Importantly the device is not visible to the guest until 2.
> So far it's about how QEMU models and manages wiring process,
> that's why pre_plug/plug handlers were introduced, to allow
> resource owner to attach devices that's plugged into it.
> 
> i.e. PCI devices are managed by PCI subsystem and DIMM
> devices are managed by board where they are mapped into
> reserved address space by board code that owns it.
> 
> Allowing random device to manage board resources directly
> isn't really acceptable (even as temporary solution).

I agree to "random" devices. This should not be the design principle.

> 
> In case of virtio-pmem it might be much cleaner to use
> mapping mechanism provided by PCI sybsytem than trying
> to bridge bus and buss-less device wiring as from device
> modeling point of view (aside from providing RAM to guest)
> it's 2 quite different devices.

And again: Please don't forget virtio-ccw. We _don't_ want to glue
virtio device specifics to the underlying proxy here.

> 
> i.e. if you think new device is RAM, which is governed by
> -m option, then model it as bus-less device like dimm and
> plug it directly into board, if its plugged in to a bus
> it's that bus owner responsibility to allocate/manage
> address space or bridge it to parent device.
> 
> (btw: virtio-pmem looks sort of like ivshmem, maybe they
> can share some code on qemu side)
> 
>> Of course, we could also take care of pre-plug things as you mentioned.
>>
>>>
>>> container MachineState::device_memory is owned by machine and
>>> it's up to machine plug handler (container's owner) to map device's mr
>>> into its address space.
>>> (i.e. nor device's realize nor PCI bus hotplug handler should do it)  
>>
>> I agree, but I think these are internal details.
> it's internal details that we choose not to violate in QEMU
> and were working towards that direction, getting rid of places
> that do it wrongly.

Yes, and I'll try my best to avoid it.

> 
>>> Not sure about virtio-mem but if it would use device_memory container,
>>> it should use machine's plug handler.
>>>
>>> I don't have out head ideas how to glue it cleanly, may be
>>> MachineState::device_memory is just not right thing to use
>>> for such devices.  
>>
>> I strongly disagree. From the user point of view it should not matter what
>> was added/plugged. There is just one guest physical memory and maxmem is
>> defined for one QEMU instance. Exposing such details to the user should
>> definitely be avoided.
> qemu user have to be exposed to details as he already adds
>  -device virtio-pmem,....
> to CLI, maxmem accounting is a separate matter and probably
> shouldn't be mixed with device model and how it's mapped into
> guest's address space.

I can't follow. Please step back and have a look at how it works on the
qemu command line:

1. You specify a maxmem option
2. You plug DIMM/NVDIMM/virtio-mem/virtio-pmem

Some machines (e.g. s390x) use maxmem to setup the maximum possible
guest address space in KVM.

Just because DIMM/NVDIMM was the first user does not mean that it is the
only valid user. That is also the reason why it is named
"query-memory-devices" and not "query-dimm-devices". The abstraction is
there for a reason.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice
  2018-04-25 12:46           ` David Hildenbrand
@ 2018-04-25 13:15             ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-04-25 13:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, qemu-ppc, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, David Gibson


>>> So this happens before any hotplug handler is called. Everything works
>>> just fine. What you don't like about this is the qdev_get_machine(). I
>>> also don't like it but in the short term I don't see any problem with
>>> it. It is resource allocation and not a "device plug" in the typical form.
>>
>> It's not qdev_get_machine() that's issue, it's layer violation,
>> where child device is allocating and mapping resources of one of its parents.
> 
> Quite simple: introduce a function at the machine where the child can
> "request" to get an address and "request" to plug/unplug a region.
> 
> Or what would be wrong about that?
> 

To extend on such an idea (looking at virtio_bus_device_plugged()
getting called from virtio realize functions)

Make the machine implement some new interface "MemoryDeviceManager".
>From virtio-mem/pmem realize, call a function like memory_device_plugged().

Lookup the machine (qdev_get_machine() or walk all the way up until we
find one that implements MemoryDeviceManager) and call
pre_plugged/plugged/unplugged.

We could hide that in a new device class TYPE_VIRTIO_MEMORY_DEVICE.

Opinions? Completely nonsense? :) Alternatives?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-25  5:45         ` Pankaj Gupta
@ 2018-04-25 13:23           ` Igor Mammedov
  2018-04-25 13:56             ` Pankaj Gupta
  0 siblings, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-25 13:23 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, David Gibson, Richard Henderson

On Wed, 25 Apr 2018 01:45:12 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> >   
> > > >     
> > > >> +    /* we will need a new memory slot for kvm and vhost */
> > > >> +    if (kvm_enabled() && !kvm_has_free_slot(machine)) {
> > > >> +        error_setg(errp, "hypervisor has no free memory slots left");
> > > >> +        return;
> > > >> +    }
> > > >> +    if (!vhost_has_free_slot()) {
> > > >> +        error_setg(errp, "a used vhost backend has no free memory slots
> > > >> left");
> > > >> +        return;
> > > >> +    }  
> > > > move these checks to pre_plug time
> > > >     
> > > >> +
> > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);  
> > > > missing vmstate registration?  
> > > 
> > > Missed this one: To be called by the caller. Important because e.g. for
> > > virtio-pmem we don't want this (I assume :) ).  
> > if pmem isn't on shared storage, then We'd probably want to migrate
> > it as well, otherwise target would experience data loss.
> > Anyways, I'd just reat it as normal RAM in migration case  
> 
> Main difference between RAM and pmem it acts like combination of RAM and disk.
> Saying this, in normal use-case size would be 100 GB's - few TB's range. 
> I am not sure we really want to migrate it for non-shared storage use-case.
with non shared storage you'd have to migrate it target host but
with shared storage it might be possible to flush it and use directly
from target host. That probably won't work right out of box and would
need some sort of synchronization between src/dst hosts.

The same applies to nv/pc-dimm as well, as backend file easily could be
on pmem storage as well.

Maybe for now we should migrate everything so it would work in case of
non shared NVDIMM on host. And then later add migration-less capability
to all of them.

> One reason why nvdimm added vmstate info could be: still there would be transient
> writes in memory with fake DAX and there is no way(till now) to flush the guest 
> writes. But with virtio-pmem we can flush such writes before migration and automatically
> at destination host with shared disk we will have updated data.
nvdimm has concept of flush address hint (may be not implemented in qemu yet)
but it can flush. The only reason I'm buying into virtio-mem idea
is that would allow async flush queues which would reduce number
of vmexits.

> 
> 
> Thanks,
> Pankaj  
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-25 13:23           ` Igor Mammedov
@ 2018-04-25 13:56             ` Pankaj Gupta
  2018-04-25 15:26               ` Igor Mammedov
  0 siblings, 1 reply; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-25 13:56 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, David Gibson, Richard Henderson


> > > > >     
> > > > >> +
> > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
> > > > > missing vmstate registration?
> > > > 
> > > > Missed this one: To be called by the caller. Important because e.g. for
> > > > virtio-pmem we don't want this (I assume :) ).
> > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > it as well, otherwise target would experience data loss.
> > > Anyways, I'd just reat it as normal RAM in migration case
> > 
> > Main difference between RAM and pmem it acts like combination of RAM and
> > disk.
> > Saying this, in normal use-case size would be 100 GB's - few TB's range.
> > I am not sure we really want to migrate it for non-shared storage use-case.
> with non shared storage you'd have to migrate it target host but
> with shared storage it might be possible to flush it and use directly
> from target host. That probably won't work right out of box and would
> need some sort of synchronization between src/dst hosts.

Shared storage should work out of the box. Only thing is data in destination
host will be cache cold and existing pages in cache should be invalidated first. 
But if we migrate entire fake DAX RAMstate it will populate destination host page 
cache including pages while were idle in source host. This would unnecessarily 
create entropy in destination host. 

To me this feature don't make much sense. Problem which we are solving is:
Efficiently use guest RAM. 

> 
> The same applies to nv/pc-dimm as well, as backend file easily could be
> on pmem storage as well.

Are you saying backing file is in actual actual nvdimm hardware? we don't need 
emulation at all. 

> 
> Maybe for now we should migrate everything so it would work in case of
> non shared NVDIMM on host. And then later add migration-less capability
> to all of them.

not sure I agree.

> 
> > One reason why nvdimm added vmstate info could be: still there would be
> > transient
> > writes in memory with fake DAX and there is no way(till now) to flush the
> > guest
> > writes. But with virtio-pmem we can flush such writes before migration and
> > automatically
> > at destination host with shared disk we will have updated data.
> nvdimm has concept of flush address hint (may be not implemented in qemu yet)
> but it can flush. The only reason I'm buying into virtio-mem idea
> is that would allow async flush queues which would reduce number
> of vmexits.

Thats correct.

Thanks,
Pankaj

 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-25 13:56             ` Pankaj Gupta
@ 2018-04-25 15:26               ` Igor Mammedov
  2018-04-26  7:37                 ` Pankaj Gupta
  0 siblings, 1 reply; 49+ messages in thread
From: Igor Mammedov @ 2018-04-25 15:26 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Eduardo Habkost, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, David Gibson, Richard Henderson

On Wed, 25 Apr 2018 09:56:49 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> > > > > >       
> > > > > >> +
> > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);  
> > > > > > missing vmstate registration?  
> > > > > 
> > > > > Missed this one: To be called by the caller. Important because e.g. for
> > > > > virtio-pmem we don't want this (I assume :) ).  
> > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > it as well, otherwise target would experience data loss.
> > > > Anyways, I'd just reat it as normal RAM in migration case  
> > > 
> > > Main difference between RAM and pmem it acts like combination of RAM and
> > > disk.
> > > Saying this, in normal use-case size would be 100 GB's - few TB's range.
> > > I am not sure we really want to migrate it for non-shared storage use-case.  
> > with non shared storage you'd have to migrate it target host but
> > with shared storage it might be possible to flush it and use directly
> > from target host. That probably won't work right out of box and would
> > need some sort of synchronization between src/dst hosts.  
> 
> Shared storage should work out of the box.
> Only thing is data in destination
> host will be cache cold and existing pages in cache should be invalidated first. 
> But if we migrate entire fake DAX RAMstate it will populate destination host page 
> cache including pages while were idle in source host. This would unnecessarily 
> create entropy in destination host. 
> 
> To me this feature don't make much sense. Problem which we are solving is:
> Efficiently use guest RAM.
What would live migration handover flow look like in case of 
guest constantly dirting memory provided by virtio-pmem and
and sometimes issuing async flush req along with it?


> > The same applies to nv/pc-dimm as well, as backend file easily could be
> > on pmem storage as well.  
> 
> Are you saying backing file is in actual actual nvdimm hardware? we don't need 
> emulation at all.
depends on if file is on DAX filesystem, but your argument about
migrating huge 100Gb- TB's range applies in this case as well.

> 
> > 
> > Maybe for now we should migrate everything so it would work in case of
> > non shared NVDIMM on host. And then later add migration-less capability
> > to all of them.  
> 
> not sure I agree.
So would you inhibit migration in case of non shared backend storage,
to avoid loosing data since they aren't migrated?


> > > One reason why nvdimm added vmstate info could be: still there would be
> > > transient
> > > writes in memory with fake DAX and there is no way(till now) to flush the
> > > guest
> > > writes. But with virtio-pmem we can flush such writes before migration and
> > > automatically
> > > at destination host with shared disk we will have updated data.  
> > nvdimm has concept of flush address hint (may be not implemented in qemu yet)
> > but it can flush. The only reason I'm buying into virtio-mem idea
> > is that would allow async flush queues which would reduce number
> > of vmexits.  
> 
> Thats correct.
> 
> Thanks,
> Pankaj
> 
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code
  2018-04-25 15:26               ` Igor Mammedov
@ 2018-04-26  7:37                 ` Pankaj Gupta
  2018-05-04  9:13                   ` [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable??? Igor Mammedov
  0 siblings, 1 reply; 49+ messages in thread
From: Pankaj Gupta @ 2018-04-26  7:37 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	qemu-devel, Markus Armbruster, qemu-s390x, qemu-ppc,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson, David Gibson,
	Dr. David Alan Gilbert


> > > > > > >> +
> > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > >> mr);
> > > > > > > missing vmstate registration?
> > > > > > 
> > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > for
> > > > > > virtio-pmem we don't want this (I assume :) ).
> > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > it as well, otherwise target would experience data loss.
> > > > > Anyways, I'd just reat it as normal RAM in migration case
> > > > 
> > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > and
> > > > disk.
> > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > range.
> > > > I am not sure we really want to migrate it for non-shared storage
> > > > use-case.
> > > with non shared storage you'd have to migrate it target host but
> > > with shared storage it might be possible to flush it and use directly
> > > from target host. That probably won't work right out of box and would
> > > need some sort of synchronization between src/dst hosts.
> > 
> > Shared storage should work out of the box.
> > Only thing is data in destination
> > host will be cache cold and existing pages in cache should be invalidated
> > first.
> > But if we migrate entire fake DAX RAMstate it will populate destination
> > host page
> > cache including pages while were idle in source host. This would
> > unnecessarily
> > create entropy in destination host.
> > 
> > To me this feature don't make much sense. Problem which we are solving is:
> > Efficiently use guest RAM.
> What would live migration handover flow look like in case of
> guest constantly dirting memory provided by virtio-pmem and
> and sometimes issuing async flush req along with it?

Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
would get dirty and we need to handle that. I just want to say moving entire
pmem (disk) is not efficient solution because we are using this solution to
manage guest memory efficiently. Otherwise it will be like any block device copy
with non-shared storage.   
 
> 
> 
> > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > on pmem storage as well.
> > 
> > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > need
> > emulation at all.
> depends on if file is on DAX filesystem, but your argument about
> migrating huge 100Gb- TB's range applies in this case as well.
> 
> > 
> > > 
> > > Maybe for now we should migrate everything so it would work in case of
> > > non shared NVDIMM on host. And then later add migration-less capability
> > > to all of them.
> > 
> > not sure I agree.
> So would you inhibit migration in case of non shared backend storage,
> to avoid loosing data since they aren't migrated?

I am just thinking what features we want to support with pmem. And live migration
with shared storage is the one which comes to my mind.

If live migration with non-shared storage is what we want to support (I don't know
yet) we can add this? Even with shared storage it would copy entire pmem state?

Thanks,
Pankaj
 
> 
> 
> > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > transient
> > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > the
> > > > guest
> > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > and
> > > > automatically
> > > > at destination host with shared disk we will have updated data.
> > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > yet)
> > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > is that would allow async flush queues which would reduce number
> > > of vmexits.
> > 
> > Thats correct.
> > 
> > Thanks,
> > Pankaj
> > 
> >  
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-04-26  7:37                 ` Pankaj Gupta
@ 2018-05-04  9:13                   ` Igor Mammedov
  2018-05-04  9:30                     ` David Hildenbrand
  2018-05-04 12:26                     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-05-04  9:13 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert

On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

trimming CC list to keep people that might be interested in the topic
and renaming thread to reflect it.

> > > > > > > >> +
> > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > > >> mr);  
> > > > > > > > missing vmstate registration?  
> > > > > > > 
> > > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > > for
> > > > > > > virtio-pmem we don't want this (I assume :) ).  
> > > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > > it as well, otherwise target would experience data loss.
> > > > > > Anyways, I'd just reat it as normal RAM in migration case  
> > > > > 
> > > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > > and
> > > > > disk.
> > > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > > range.
> > > > > I am not sure we really want to migrate it for non-shared storage
> > > > > use-case.  
> > > > with non shared storage you'd have to migrate it target host but
> > > > with shared storage it might be possible to flush it and use directly
> > > > from target host. That probably won't work right out of box and would
> > > > need some sort of synchronization between src/dst hosts.  
> > > 
> > > Shared storage should work out of the box.
> > > Only thing is data in destination
> > > host will be cache cold and existing pages in cache should be invalidated
> > > first.
> > > But if we migrate entire fake DAX RAMstate it will populate destination
> > > host page
> > > cache including pages while were idle in source host. This would
> > > unnecessarily
> > > create entropy in destination host.
> > > 
> > > To me this feature don't make much sense. Problem which we are solving is:
> > > Efficiently use guest RAM.  
> > What would live migration handover flow look like in case of
> > guest constantly dirting memory provided by virtio-pmem and
> > and sometimes issuing async flush req along with it?  
> 
> Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
> would get dirty and we need to handle that. I just want to say moving entire
> pmem (disk) is not efficient solution because we are using this solution to
> manage guest memory efficiently. Otherwise it will be like any block device copy
> with non-shared storage.   
not sure if we can use block layer analogy here.

> > > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > > on pmem storage as well.  
> > > 
> > > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > > need
> > > emulation at all.  
> > depends on if file is on DAX filesystem, but your argument about
> > migrating huge 100Gb- TB's range applies in this case as well.
> >   
> > >   
> > > > 
> > > > Maybe for now we should migrate everything so it would work in case of
> > > > non shared NVDIMM on host. And then later add migration-less capability
> > > > to all of them.  
> > > 
> > > not sure I agree.  
> > So would you inhibit migration in case of non shared backend storage,
> > to avoid loosing data since they aren't migrated?  
> 
> I am just thinking what features we want to support with pmem. And live migration
> with shared storage is the one which comes to my mind.
> 
> If live migration with non-shared storage is what we want to support (I don't know
> yet) we can add this? Even with shared storage it would copy entire pmem state?
Perhaps we should register vmstate like for normal ram and use something similar to
  http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
to skip shared memory on migration.
In this case we could use this for pc-dimms as well.

David,
 what's your take on it?

> Thanks,
> Pankaj
>  
> > 
> >   
> > > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > > transient
> > > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > > the
> > > > > guest
> > > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > > and
> > > > > automatically
> > > > > at destination host with shared disk we will have updated data.  
> > > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > > yet)
> > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > is that would allow async flush queues which would reduce number
> > > > of vmexits.  
> > > 
> > > Thats correct.
> > > 
> > > Thanks,
> > > Pankaj
> > > 
> > >    
> > 
> > 
> >   
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-04  9:13                   ` [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable??? Igor Mammedov
@ 2018-05-04  9:30                     ` David Hildenbrand
  2018-05-04 11:59                       ` Pankaj Gupta
  2018-05-04 12:26                     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-05-04  9:30 UTC (permalink / raw)
  To: Igor Mammedov, Pankaj Gupta
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert

On 04.05.2018 11:13, Igor Mammedov wrote:
> On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
> trimming CC list to keep people that might be interested in the topic
> and renaming thread to reflect it.
> 
>>>>>>>>>> +
>>>>>>>>>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
>>>>>>>>>> mr);  
>>>>>>>>> missing vmstate registration?  
>>>>>>>>
>>>>>>>> Missed this one: To be called by the caller. Important because e.g.
>>>>>>>> for
>>>>>>>> virtio-pmem we don't want this (I assume :) ).  
>>>>>>> if pmem isn't on shared storage, then We'd probably want to migrate
>>>>>>> it as well, otherwise target would experience data loss.
>>>>>>> Anyways, I'd just reat it as normal RAM in migration case  
>>>>>>
>>>>>> Main difference between RAM and pmem it acts like combination of RAM
>>>>>> and
>>>>>> disk.
>>>>>> Saying this, in normal use-case size would be 100 GB's - few TB's
>>>>>> range.
>>>>>> I am not sure we really want to migrate it for non-shared storage
>>>>>> use-case.  
>>>>> with non shared storage you'd have to migrate it target host but
>>>>> with shared storage it might be possible to flush it and use directly
>>>>> from target host. That probably won't work right out of box and would
>>>>> need some sort of synchronization between src/dst hosts.  
>>>>
>>>> Shared storage should work out of the box.
>>>> Only thing is data in destination
>>>> host will be cache cold and existing pages in cache should be invalidated
>>>> first.
>>>> But if we migrate entire fake DAX RAMstate it will populate destination
>>>> host page
>>>> cache including pages while were idle in source host. This would
>>>> unnecessarily
>>>> create entropy in destination host.
>>>>
>>>> To me this feature don't make much sense. Problem which we are solving is:
>>>> Efficiently use guest RAM.  
>>> What would live migration handover flow look like in case of
>>> guest constantly dirting memory provided by virtio-pmem and
>>> and sometimes issuing async flush req along with it?  
>>
>> Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
>> would get dirty and we need to handle that. I just want to say moving entire
>> pmem (disk) is not efficient solution because we are using this solution to
>> manage guest memory efficiently. Otherwise it will be like any block device copy
>> with non-shared storage.   
> not sure if we can use block layer analogy here.
> 
>>>>> The same applies to nv/pc-dimm as well, as backend file easily could be
>>>>> on pmem storage as well.  
>>>>
>>>> Are you saying backing file is in actual actual nvdimm hardware? we don't
>>>> need
>>>> emulation at all.  
>>> depends on if file is on DAX filesystem, but your argument about
>>> migrating huge 100Gb- TB's range applies in this case as well.
>>>   
>>>>   
>>>>>
>>>>> Maybe for now we should migrate everything so it would work in case of
>>>>> non shared NVDIMM on host. And then later add migration-less capability
>>>>> to all of them.  
>>>>
>>>> not sure I agree.  
>>> So would you inhibit migration in case of non shared backend storage,
>>> to avoid loosing data since they aren't migrated?  
>>
>> I am just thinking what features we want to support with pmem. And live migration
>> with shared storage is the one which comes to my mind.
>>
>> If live migration with non-shared storage is what we want to support (I don't know
>> yet) we can add this? Even with shared storage it would copy entire pmem state?
> Perhaps we should register vmstate like for normal ram and use something similar to
>   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> to skip shared memory on migration.
> In this case we could use this for pc-dimms as well.
> 
> David,
>  what's your take on it?

(I assume you were talking to David Gilbert, but still my take on it :))

"shared RAM" use in that context is rather "shared between processes",
no? What would be the benefit of enabling migration for a ramblock but
then again blocking it off?

Anyhow, I think this detail discussion is way to early right now. Pankaj
has some other problems to solve before moving on to migration (yes,
migration is important to keep in mind but not the top priority right
now). And I would consider migration in this context as the next step
once we have the basics sorted out (flushing, threads ...)

> 
>> Thanks,
>> Pankaj
>>  


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-04  9:30                     ` David Hildenbrand
@ 2018-05-04 11:59                       ` Pankaj Gupta
  0 siblings, 0 replies; 49+ messages in thread
From: Pankaj Gupta @ 2018-05-04 11:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Igor Mammedov, Paolo Bonzini, qemu-devel, Dr. David Alan Gilbert



> > 
> >>>>>>>>>> +
> >>>>>>>>>> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> >>>>>>>>>> mr);
> >>>>>>>>> missing vmstate registration?
> >>>>>>>>
> >>>>>>>> Missed this one: To be called by the caller. Important because e.g.
> >>>>>>>> for
> >>>>>>>> virtio-pmem we don't want this (I assume :) ).
> >>>>>>> if pmem isn't on shared storage, then We'd probably want to migrate
> >>>>>>> it as well, otherwise target would experience data loss.
> >>>>>>> Anyways, I'd just reat it as normal RAM in migration case
> >>>>>>
> >>>>>> Main difference between RAM and pmem it acts like combination of RAM
> >>>>>> and
> >>>>>> disk.
> >>>>>> Saying this, in normal use-case size would be 100 GB's - few TB's
> >>>>>> range.
> >>>>>> I am not sure we really want to migrate it for non-shared storage
> >>>>>> use-case.
> >>>>> with non shared storage you'd have to migrate it target host but
> >>>>> with shared storage it might be possible to flush it and use directly
> >>>>> from target host. That probably won't work right out of box and would
> >>>>> need some sort of synchronization between src/dst hosts.
> >>>>
> >>>> Shared storage should work out of the box.
> >>>> Only thing is data in destination
> >>>> host will be cache cold and existing pages in cache should be
> >>>> invalidated
> >>>> first.
> >>>> But if we migrate entire fake DAX RAMstate it will populate destination
> >>>> host page
> >>>> cache including pages while were idle in source host. This would
> >>>> unnecessarily
> >>>> create entropy in destination host.
> >>>>
> >>>> To me this feature don't make much sense. Problem which we are solving
> >>>> is:
> >>>> Efficiently use guest RAM.
> >>> What would live migration handover flow look like in case of
> >>> guest constantly dirting memory provided by virtio-pmem and
> >>> and sometimes issuing async flush req along with it?
> >>
> >> Dirty entire pmem (disk) at once not a usual scenario. Some part of
> >> disk/pmem
> >> would get dirty and we need to handle that. I just want to say moving
> >> entire
> >> pmem (disk) is not efficient solution because we are using this solution
> >> to
> >> manage guest memory efficiently. Otherwise it will be like any block
> >> device copy
> >> with non-shared storage.
> > not sure if we can use block layer analogy here.
> > 
> >>>>> The same applies to nv/pc-dimm as well, as backend file easily could be
> >>>>> on pmem storage as well.
> >>>>
> >>>> Are you saying backing file is in actual actual nvdimm hardware? we
> >>>> don't
> >>>> need
> >>>> emulation at all.
> >>> depends on if file is on DAX filesystem, but your argument about
> >>> migrating huge 100Gb- TB's range applies in this case as well.
> >>>   
> >>>>   
> >>>>>
> >>>>> Maybe for now we should migrate everything so it would work in case of
> >>>>> non shared NVDIMM on host. And then later add migration-less capability
> >>>>> to all of them.
> >>>>
> >>>> not sure I agree.
> >>> So would you inhibit migration in case of non shared backend storage,
> >>> to avoid loosing data since they aren't migrated?
> >>
> >> I am just thinking what features we want to support with pmem. And live
> >> migration
> >> with shared storage is the one which comes to my mind.
> >>
> >> If live migration with non-shared storage is what we want to support (I
> >> don't know
> >> yet) we can add this? Even with shared storage it would copy entire pmem
> >> state?
> > Perhaps we should register vmstate like for normal ram and use something
> > similar to
> >   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> > to skip shared memory on migration.
> > In this case we could use this for pc-dimms as well.
> > 
> > David,
> >  what's your take on it?
> 
> (I assume you were talking to David Gilbert, but still my take on it :))
> 
> "shared RAM" use in that context is rather "shared between processes",
> no? What would be the benefit of enabling migration for a ramblock but
> then again blocking it off?
> 
> Anyhow, I think this detail discussion is way to early right now. Pankaj
> has some other problems to solve before moving on to migration (yes,
> migration is important to keep in mind but not the top priority right
> now). And I would consider migration in this context as the next step
> once we have the basics sorted out (flushing, threads ...)

Right. Still it would be very helpful to have Dave Gilbert's inputs on 
this ongoing discussion. That would clear some doubts regarding expectation 
for this type of memory from live migration point of view.

Then we can discuss in more details after we have first version accepted 
and start optimization for live migration. 

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-04  9:13                   ` [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable??? Igor Mammedov
  2018-05-04  9:30                     ` David Hildenbrand
@ 2018-05-04 12:26                     ` Dr. David Alan Gilbert
  2018-05-07  8:12                       ` Igor Mammedov
  1 sibling, 1 reply; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-04 12:26 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: Pankaj Gupta, David Hildenbrand, qemu-devel, Paolo Bonzini

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
> trimming CC list to keep people that might be interested in the topic
> and renaming thread to reflect it.
> 
> > > > > > > > >> +
> > > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > > > >> mr);  
> > > > > > > > > missing vmstate registration?  
> > > > > > > > 
> > > > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > > > for
> > > > > > > > virtio-pmem we don't want this (I assume :) ).  
> > > > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > > > it as well, otherwise target would experience data loss.
> > > > > > > Anyways, I'd just reat it as normal RAM in migration case  
> > > > > > 
> > > > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > > > and
> > > > > > disk.
> > > > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > > > range.
> > > > > > I am not sure we really want to migrate it for non-shared storage
> > > > > > use-case.  
> > > > > with non shared storage you'd have to migrate it target host but
> > > > > with shared storage it might be possible to flush it and use directly
> > > > > from target host. That probably won't work right out of box and would
> > > > > need some sort of synchronization between src/dst hosts.  
> > > > 
> > > > Shared storage should work out of the box.
> > > > Only thing is data in destination
> > > > host will be cache cold and existing pages in cache should be invalidated
> > > > first.
> > > > But if we migrate entire fake DAX RAMstate it will populate destination
> > > > host page
> > > > cache including pages while were idle in source host. This would
> > > > unnecessarily
> > > > create entropy in destination host.
> > > > 
> > > > To me this feature don't make much sense. Problem which we are solving is:
> > > > Efficiently use guest RAM.  
> > > What would live migration handover flow look like in case of
> > > guest constantly dirting memory provided by virtio-pmem and
> > > and sometimes issuing async flush req along with it?  
> > 
> > Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
> > would get dirty and we need to handle that. I just want to say moving entire
> > pmem (disk) is not efficient solution because we are using this solution to
> > manage guest memory efficiently. Otherwise it will be like any block device copy
> > with non-shared storage.   
> not sure if we can use block layer analogy here.
> 
> > > > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > > > on pmem storage as well.  
> > > > 
> > > > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > > > need
> > > > emulation at all.  
> > > depends on if file is on DAX filesystem, but your argument about
> > > migrating huge 100Gb- TB's range applies in this case as well.
> > >   
> > > >   
> > > > > 
> > > > > Maybe for now we should migrate everything so it would work in case of
> > > > > non shared NVDIMM on host. And then later add migration-less capability
> > > > > to all of them.  
> > > > 
> > > > not sure I agree.  
> > > So would you inhibit migration in case of non shared backend storage,
> > > to avoid loosing data since they aren't migrated?  
> > 
> > I am just thinking what features we want to support with pmem. And live migration
> > with shared storage is the one which comes to my mind.
> > 
> > If live migration with non-shared storage is what we want to support (I don't know
> > yet) we can add this? Even with shared storage it would copy entire pmem state?
> Perhaps we should register vmstate like for normal ram and use something similar to
>   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> to skip shared memory on migration.
> In this case we could use this for pc-dimms as well.
> 
> David,
>  what's your take on it?

My feel is that something is going to have to migrate it, I'm just not
sure how.
So let me just check I understand:
  a) It's potentially huge
  b) It's a RAMBlock
  c) It's backed by ????
     c1) Something machine local - i.e. a physical lump of flash in a
         socket rather than something sharable by machines?
  d) It can potentially be rapidly changing as the guest writes to it?

Dave

> > Thanks,
> > Pankaj
> >  
> > > 
> > >   
> > > > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > > > transient
> > > > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > > > the
> > > > > > guest
> > > > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > > > and
> > > > > > automatically
> > > > > > at destination host with shared disk we will have updated data.  
> > > > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > > > yet)
> > > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > > is that would allow async flush queues which would reduce number
> > > > > of vmexits.  
> > > > 
> > > > Thats correct.
> > > > 
> > > > Thanks,
> > > > Pankaj
> > > > 
> > > >    
> > > 
> > > 
> > >   
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-04 12:26                     ` Dr. David Alan Gilbert
@ 2018-05-07  8:12                       ` Igor Mammedov
  2018-05-07 11:19                         ` Pankaj Gupta
  2018-05-08  9:44                         ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 49+ messages in thread
From: Igor Mammedov @ 2018-05-07  8:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pankaj Gupta, David Hildenbrand, qemu-devel, Paolo Bonzini

On Fri, 4 May 2018 13:26:51 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Igor Mammedov (imammedo@redhat.com) wrote:
> > On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> > Pankaj Gupta <pagupta@redhat.com> wrote:
> > 
> > trimming CC list to keep people that might be interested in the topic
> > and renaming thread to reflect it.
> >   
> > > > > > > > > >> +
> > > > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > > > > >> mr);    
> > > > > > > > > > missing vmstate registration?    
> > > > > > > > > 
> > > > > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > > > > for
> > > > > > > > > virtio-pmem we don't want this (I assume :) ).    
> > > > > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > > > > it as well, otherwise target would experience data loss.
> > > > > > > > Anyways, I'd just reat it as normal RAM in migration case    
> > > > > > > 
> > > > > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > > > > and
> > > > > > > disk.
> > > > > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > > > > range.
> > > > > > > I am not sure we really want to migrate it for non-shared storage
> > > > > > > use-case.    
> > > > > > with non shared storage you'd have to migrate it target host but
> > > > > > with shared storage it might be possible to flush it and use directly
> > > > > > from target host. That probably won't work right out of box and would
> > > > > > need some sort of synchronization between src/dst hosts.    
> > > > > 
> > > > > Shared storage should work out of the box.
> > > > > Only thing is data in destination
> > > > > host will be cache cold and existing pages in cache should be invalidated
> > > > > first.
> > > > > But if we migrate entire fake DAX RAMstate it will populate destination
> > > > > host page
> > > > > cache including pages while were idle in source host. This would
> > > > > unnecessarily
> > > > > create entropy in destination host.
> > > > > 
> > > > > To me this feature don't make much sense. Problem which we are solving is:
> > > > > Efficiently use guest RAM.    
> > > > What would live migration handover flow look like in case of
> > > > guest constantly dirting memory provided by virtio-pmem and
> > > > and sometimes issuing async flush req along with it?    
> > > 
> > > Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
> > > would get dirty and we need to handle that. I just want to say moving entire
> > > pmem (disk) is not efficient solution because we are using this solution to
> > > manage guest memory efficiently. Otherwise it will be like any block device copy
> > > with non-shared storage.     
> > not sure if we can use block layer analogy here.
> >   
> > > > > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > > > > on pmem storage as well.    
> > > > > 
> > > > > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > > > > need
> > > > > emulation at all.    
> > > > depends on if file is on DAX filesystem, but your argument about
> > > > migrating huge 100Gb- TB's range applies in this case as well.
> > > >     
> > > > >     
> > > > > > 
> > > > > > Maybe for now we should migrate everything so it would work in case of
> > > > > > non shared NVDIMM on host. And then later add migration-less capability
> > > > > > to all of them.    
> > > > > 
> > > > > not sure I agree.    
> > > > So would you inhibit migration in case of non shared backend storage,
> > > > to avoid loosing data since they aren't migrated?    
> > > 
> > > I am just thinking what features we want to support with pmem. And live migration
> > > with shared storage is the one which comes to my mind.
> > > 
> > > If live migration with non-shared storage is what we want to support (I don't know
> > > yet) we can add this? Even with shared storage it would copy entire pmem state?  
> > Perhaps we should register vmstate like for normal ram and use something similar to
> >   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> > to skip shared memory on migration.
> > In this case we could use this for pc-dimms as well.
> > 
> > David,
> >  what's your take on it?  
> 
> My feel is that something is going to have to migrate it, I'm just not
> sure how.
> So let me just check I understand:
>   a) It's potentially huge
yep, assume it could be in storage quarantines (100s of Gb)

>   b) It's a RAMBlock
it is

>   c) It's backed by ????
>      c1) Something machine local - i.e. a physical lump of flash in a
>          socket rather than something sharable by machines?
it's backed by memory-backend-foo, so it could be really anything (RAM,
file on local or shared storage, file descriptor)

>   d) It can potentially be rapidly changing as the guest writes to it?
it's sort of like NVDIMM but without NVDIMM interface, it uses virtio to
to force flushing instead. Otherwise it's directly mapped into guest
address space, so guest can do anything with it including fast dirtying.


> Dave
> 
> > > Thanks,
> > > Pankaj
> > >    
> > > > 
> > > >     
> > > > > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > > > > transient
> > > > > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > > > > the
> > > > > > > guest
> > > > > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > > > > and
> > > > > > > automatically
> > > > > > > at destination host with shared disk we will have updated data.    
> > > > > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > > > > yet)
> > > > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > > > is that would allow async flush queues which would reduce number
> > > > > > of vmexits.    
> > > > > 
> > > > > Thats correct.
> > > > > 
> > > > > Thanks,
> > > > > Pankaj
> > > > > 
> > > > >      
> > > > 
> > > > 
> > > >     
> > >   
> >   
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-07  8:12                       ` Igor Mammedov
@ 2018-05-07 11:19                         ` Pankaj Gupta
  2018-05-08  9:44                         ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 49+ messages in thread
From: Pankaj Gupta @ 2018-05-07 11:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Igor Mammedov
  Cc: Paolo Bonzini, qemu-devel, David Hildenbrand


> 
> On Fri, 4 May 2018 13:26:51 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Igor Mammedov (imammedo@redhat.com) wrote:
> > > On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> > > Pankaj Gupta <pagupta@redhat.com> wrote:
> > > 
> > > trimming CC list to keep people that might be interested in the topic
> > > and renaming thread to reflect it.
> > >   
> > > > > > > > > > >> +
> > > > > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr -
> > > > > > > > > > >> hpms->base,
> > > > > > > > > > >> mr);
> > > > > > > > > > > missing vmstate registration?
> > > > > > > > > > 
> > > > > > > > > > Missed this one: To be called by the caller. Important
> > > > > > > > > > because e.g.
> > > > > > > > > > for
> > > > > > > > > > virtio-pmem we don't want this (I assume :) ).
> > > > > > > > > if pmem isn't on shared storage, then We'd probably want to
> > > > > > > > > migrate
> > > > > > > > > it as well, otherwise target would experience data loss.
> > > > > > > > > Anyways, I'd just reat it as normal RAM in migration case
> > > > > > > > 
> > > > > > > > Main difference between RAM and pmem it acts like combination
> > > > > > > > of RAM
> > > > > > > > and
> > > > > > > > disk.
> > > > > > > > Saying this, in normal use-case size would be 100 GB's - few
> > > > > > > > TB's
> > > > > > > > range.
> > > > > > > > I am not sure we really want to migrate it for non-shared
> > > > > > > > storage
> > > > > > > > use-case.
> > > > > > > with non shared storage you'd have to migrate it target host but
> > > > > > > with shared storage it might be possible to flush it and use
> > > > > > > directly
> > > > > > > from target host. That probably won't work right out of box and
> > > > > > > would
> > > > > > > need some sort of synchronization between src/dst hosts.
> > > > > > 
> > > > > > Shared storage should work out of the box.
> > > > > > Only thing is data in destination
> > > > > > host will be cache cold and existing pages in cache should be
> > > > > > invalidated
> > > > > > first.
> > > > > > But if we migrate entire fake DAX RAMstate it will populate
> > > > > > destination
> > > > > > host page
> > > > > > cache including pages while were idle in source host. This would
> > > > > > unnecessarily
> > > > > > create entropy in destination host.
> > > > > > 
> > > > > > To me this feature don't make much sense. Problem which we are
> > > > > > solving is:
> > > > > > Efficiently use guest RAM.
> > > > > What would live migration handover flow look like in case of
> > > > > guest constantly dirting memory provided by virtio-pmem and
> > > > > and sometimes issuing async flush req along with it?
> > > > 
> > > > Dirty entire pmem (disk) at once not a usual scenario. Some part of
> > > > disk/pmem
> > > > would get dirty and we need to handle that. I just want to say moving
> > > > entire
> > > > pmem (disk) is not efficient solution because we are using this
> > > > solution to
> > > > manage guest memory efficiently. Otherwise it will be like any block
> > > > device copy
> > > > with non-shared storage.
> > > not sure if we can use block layer analogy here.
> > >   
> > > > > > > The same applies to nv/pc-dimm as well, as backend file easily
> > > > > > > could be
> > > > > > > on pmem storage as well.
> > > > > > 
> > > > > > Are you saying backing file is in actual actual nvdimm hardware? we
> > > > > > don't
> > > > > > need
> > > > > > emulation at all.
> > > > > depends on if file is on DAX filesystem, but your argument about
> > > > > migrating huge 100Gb- TB's range applies in this case as well.
> > > > >     
> > > > > >     
> > > > > > > 
> > > > > > > Maybe for now we should migrate everything so it would work in
> > > > > > > case of
> > > > > > > non shared NVDIMM on host. And then later add migration-less
> > > > > > > capability
> > > > > > > to all of them.
> > > > > > 
> > > > > > not sure I agree.
> > > > > So would you inhibit migration in case of non shared backend storage,
> > > > > to avoid loosing data since they aren't migrated?
> > > > 
> > > > I am just thinking what features we want to support with pmem. And live
> > > > migration
> > > > with shared storage is the one which comes to my mind.
> > > > 
> > > > If live migration with non-shared storage is what we want to support (I
> > > > don't know
> > > > yet) we can add this? Even with shared storage it would copy entire
> > > > pmem state?
> > > Perhaps we should register vmstate like for normal ram and use something
> > > similar to
> > >   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> > > to skip shared memory on migration.
> > > In this case we could use this for pc-dimms as well.
> > > 
> > > David,
> > >  what's your take on it?
> > 
> > My feel is that something is going to have to migrate it, I'm just not
> > sure how.
> > So let me just check I understand:
> >   a) It's potentially huge
> yep, assume it could be in storage quarantines (100s of Gb)
> 
> >   b) It's a RAMBlock
> it is
> 
> >   c) It's backed by ????
> >      c1) Something machine local - i.e. a physical lump of flash in a
> >          socket rather than something sharable by machines?
> it's backed by memory-backend-foo, so it could be really anything (RAM,
> file on local or shared storage, file descriptor)

Just a point I want to add.

Currently, we are proposing file-backed memory which is 'mmaped' in Qemu 
address space. This is to achieve 'persistent' property similar to real 
NVDIMM storage. Latest guest writes should be synced to backed file after 
guest performs a 'fsync' operation with DAX capable file-system.
 
> 
> >   d) It can potentially be rapidly changing as the guest writes to it?
> it's sort of like NVDIMM but without NVDIMM interface, it uses virtio to
> to force flushing instead. Otherwise it's directly mapped into guest
> address space, so guest can do anything with it including fast dirtying.
> 
> 
> > Dave
> > 
> > > > Thanks,
> > > > Pankaj
> > > >    
> > > > > 
> > > > >     
> > > > > > > > One reason why nvdimm added vmstate info could be: still there
> > > > > > > > would be
> > > > > > > > transient
> > > > > > > > writes in memory with fake DAX and there is no way(till now) to
> > > > > > > > flush
> > > > > > > > the
> > > > > > > > guest
> > > > > > > > writes. But with virtio-pmem we can flush such writes before
> > > > > > > > migration
> > > > > > > > and
> > > > > > > > automatically
> > > > > > > > at destination host with shared disk we will have updated data.
> > > > > > > nvdimm has concept of flush address hint (may be not implemented
> > > > > > > in qemu
> > > > > > > yet)
> > > > > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > > > > is that would allow async flush queues which would reduce number
> > > > > > > of vmexits.
> > > > > > 
> > > > > > Thats correct.
> > > > > > 
> > > > > > Thanks,
> > > > > > Pankaj
> > > > > > 
> > > > > >      
> > > > > 
> > > > > 
> > > > >     
> > > >   
> > >   
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
  2018-05-07  8:12                       ` Igor Mammedov
  2018-05-07 11:19                         ` Pankaj Gupta
@ 2018-05-08  9:44                         ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-08  9:44 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Pankaj Gupta, David Hildenbrand, qemu-devel, Paolo Bonzini,
	eblake, stefanha

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Fri, 4 May 2018 13:26:51 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Igor Mammedov (imammedo@redhat.com) wrote:
> > > On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> > > Pankaj Gupta <pagupta@redhat.com> wrote:
> > > 
> > > trimming CC list to keep people that might be interested in the topic
> > > and renaming thread to reflect it.
> > >   
> > > > > > > > > > >> +
> > > > > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > > > > > >> mr);    
> > > > > > > > > > > missing vmstate registration?    
> > > > > > > > > > 
> > > > > > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > > > > > for
> > > > > > > > > > virtio-pmem we don't want this (I assume :) ).    
> > > > > > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > > > > > it as well, otherwise target would experience data loss.
> > > > > > > > > Anyways, I'd just reat it as normal RAM in migration case    
> > > > > > > > 
> > > > > > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > > > > > and
> > > > > > > > disk.
> > > > > > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > > > > > range.
> > > > > > > > I am not sure we really want to migrate it for non-shared storage
> > > > > > > > use-case.    
> > > > > > > with non shared storage you'd have to migrate it target host but
> > > > > > > with shared storage it might be possible to flush it and use directly
> > > > > > > from target host. That probably won't work right out of box and would
> > > > > > > need some sort of synchronization between src/dst hosts.    
> > > > > > 
> > > > > > Shared storage should work out of the box.
> > > > > > Only thing is data in destination
> > > > > > host will be cache cold and existing pages in cache should be invalidated
> > > > > > first.
> > > > > > But if we migrate entire fake DAX RAMstate it will populate destination
> > > > > > host page
> > > > > > cache including pages while were idle in source host. This would
> > > > > > unnecessarily
> > > > > > create entropy in destination host.
> > > > > > 
> > > > > > To me this feature don't make much sense. Problem which we are solving is:
> > > > > > Efficiently use guest RAM.    
> > > > > What would live migration handover flow look like in case of
> > > > > guest constantly dirting memory provided by virtio-pmem and
> > > > > and sometimes issuing async flush req along with it?    
> > > > 
> > > > Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
> > > > would get dirty and we need to handle that. I just want to say moving entire
> > > > pmem (disk) is not efficient solution because we are using this solution to
> > > > manage guest memory efficiently. Otherwise it will be like any block device copy
> > > > with non-shared storage.     
> > > not sure if we can use block layer analogy here.
> > >   
> > > > > > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > > > > > on pmem storage as well.    
> > > > > > 
> > > > > > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > > > > > need
> > > > > > emulation at all.    
> > > > > depends on if file is on DAX filesystem, but your argument about
> > > > > migrating huge 100Gb- TB's range applies in this case as well.
> > > > >     
> > > > > >     
> > > > > > > 
> > > > > > > Maybe for now we should migrate everything so it would work in case of
> > > > > > > non shared NVDIMM on host. And then later add migration-less capability
> > > > > > > to all of them.    
> > > > > > 
> > > > > > not sure I agree.    
> > > > > So would you inhibit migration in case of non shared backend storage,
> > > > > to avoid loosing data since they aren't migrated?    
> > > > 
> > > > I am just thinking what features we want to support with pmem. And live migration
> > > > with shared storage is the one which comes to my mind.
> > > > 
> > > > If live migration with non-shared storage is what we want to support (I don't know
> > > > yet) we can add this? Even with shared storage it would copy entire pmem state?  
> > > Perhaps we should register vmstate like for normal ram and use something similar to
> > >   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> > > to skip shared memory on migration.
> > > In this case we could use this for pc-dimms as well.
> > > 
> > > David,
> > >  what's your take on it?  
> > 
> > My feel is that something is going to have to migrate it, I'm just not
> > sure how.
> > So let me just check I understand:
> >   a) It's potentially huge
> yep, assume it could be in storage quarantines (100s of Gb)
> 
> >   b) It's a RAMBlock
> it is

Well, the good news is migration is going to try and migrate it.
The bad news is migration is going to try and migrate it.

> >   c) It's backed by ????
> >      c1) Something machine local - i.e. a physical lump of flash in a
> >          socket rather than something sharable by machines?
> it's backed by memory-backend-foo, so it could be really anything (RAM,
> file on local or shared storage, file descriptor)

OK, something is going to have to know whether it's on shared storage or
not and do something different in the two cases.   If it's shared
storage then we need to find a way to stop migration trying to migrate
it, because migrating data to the other host when both hosts are really
backed by the same thing ends up with a corrupt mess; we've had block
storage do that when they don't realise they're on an NFS share.
There are a few patches on the list to exclude migration from some
RAMBlock's, so we can build on that once we figure out how we know
if it's shared or not.

If it is shared, then we've also got to worry about consistency to
ensure that the last few writes on the source make it to the destination
before the destination starts, that the destination hasn't cached
any old stuff, and that a failing migraiton lands back on the source
without the destination having changed anything.

> >   d) It can potentially be rapidly changing as the guest writes to it?
> it's sort of like NVDIMM but without NVDIMM interface, it uses virtio to
> to force flushing instead. Otherwise it's directly mapped into guest
> address space, so guest can do anything with it including fast dirtying.

OK.

Getting Postcopy to work with it might be a solution; but depending what
the underlying fd looks like will probably need some kernel changes to
get userfaultfd to work on it.
Postcopy on huge memories should work, but watch out for downtime
due to sending the discard bitmaps.

(cc'd in Eric and Stefan)

Dave

> 
> > Dave
> > 
> > > > Thanks,
> > > > Pankaj
> > > >    
> > > > > 
> > > > >     
> > > > > > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > > > > > transient
> > > > > > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > > > > > the
> > > > > > > > guest
> > > > > > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > > > > > and
> > > > > > > > automatically
> > > > > > > > at destination host with shared disk we will have updated data.    
> > > > > > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > > > > > yet)
> > > > > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > > > > is that would allow async flush queues which would reduce number
> > > > > > > of vmexits.    
> > > > > > 
> > > > > > Thats correct.
> > > > > > 
> > > > > > Thanks,
> > > > > > Pankaj
> > > > > > 
> > > > > >      
> > > > > 
> > > > > 
> > > > >     
> > > >   
> > >   
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2018-05-08  9:44 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
2018-04-22  4:26   ` David Gibson
2018-04-22  8:21     ` David Hildenbrand
2018-04-22 10:10       ` David Gibson
2018-04-23  9:52         ` David Hildenbrand
2018-04-22  5:09   ` Pankaj Gupta
2018-04-22  8:26     ` David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine David Hildenbrand
2018-04-23  3:28   ` David Gibson
2018-04-23  9:36     ` David Hildenbrand
2018-04-23 10:44       ` David Gibson
2018-04-23 11:11         ` David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
2018-04-23 12:19   ` Igor Mammedov
2018-04-23 12:44     ` David Hildenbrand
2018-04-24 13:28       ` Igor Mammedov
2018-04-24 13:39         ` David Hildenbrand
2018-04-24 14:38           ` Igor Mammedov
2018-04-23 12:52     ` David Hildenbrand
2018-04-24 13:31       ` Igor Mammedov
2018-04-24 13:41         ` David Hildenbrand
2018-04-24 14:44           ` Igor Mammedov
2018-04-24 15:23             ` David Hildenbrand
2018-04-25  5:45         ` Pankaj Gupta
2018-04-25 13:23           ` Igor Mammedov
2018-04-25 13:56             ` Pankaj Gupta
2018-04-25 15:26               ` Igor Mammedov
2018-04-26  7:37                 ` Pankaj Gupta
2018-05-04  9:13                   ` [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable??? Igor Mammedov
2018-05-04  9:30                     ` David Hildenbrand
2018-05-04 11:59                       ` Pankaj Gupta
2018-05-04 12:26                     ` Dr. David Alan Gilbert
2018-05-07  8:12                       ` Igor Mammedov
2018-05-07 11:19                         ` Pankaj Gupta
2018-05-08  9:44                         ` Dr. David Alan Gilbert
2018-04-23 14:44     ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
2018-04-22  4:58 ` [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice Pankaj Gupta
2018-04-22  8:20   ` David Hildenbrand
2018-04-23  4:58     ` Pankaj Gupta
2018-04-23 12:31 ` Igor Mammedov
2018-04-23 12:50   ` David Hildenbrand
2018-04-23 15:32   ` Pankaj Gupta
2018-04-23 16:35     ` David Hildenbrand
2018-04-24 14:00     ` Igor Mammedov
2018-04-24 15:42       ` David Hildenbrand
2018-04-25 12:15         ` Igor Mammedov
2018-04-25 12:46           ` David Hildenbrand
2018-04-25 13:15             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.