qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] ppc: spapr: virtual NVDIMM support
@ 2019-10-14 18:37 Shivaprasad G Bhat
  2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-10-14 18:37 UTC (permalink / raw)
  To: imammedo, david, qemu-ppc, xiaoguangrong.eric, mst; +Cc: sbhat, qemu-devel

The patchset attempts to implement the virtual NVDIMM for pseries.

PAPR semantics is such that each NVDIMM device is comprising of
multiple SCM(Storage Class Memory) blocks. The hypervisor is expected
to prepare the FDT for the NVDIMM device and send guest a hotplug
interrupt with new type RTAS_LOG_V6_HP_TYPE_PMEM currently handled by
the upstream kernel. In response to that interrupt, the guest requests
the hypervisor to bind each of the SCM blocks of the NVDIMM device
using hcalls. There can be SCM block unbind requests in case of driver
errors or unplug(not supported now) use cases. The NVDIMM label
read/writes are done through hcalls.

Since each virtual NVDIMM device is divided into multiple SCM blocks,
the bind, unbind, and queries using hcalls on those blocks can come
independently. This doesnt fit well into the qemu device semantics,
where the map/unmap are done at the (whole)device/object level
granularity. The patchset uses the existing NVDIMM class structures
for the implementation. The bind/unbind is left to happen at the
device_add/del phase itself instead of at hcalls on-demand.

The guest kernel makes bind/unbind requests for the virtual NVDIMM
device at the region level granularity. Without interleaving, each
virtual NVDIMM device is presented as separate region. There is no
way to configure the virtual NVDIMM interleaving for the guests today.
So, there is no way a partial bind/unbind request can come for the
vNVDIMM in a hcall for a subset of SCM blocks of a virtual NVDIMM.
Hence it is safe to do bind/unbind everything during the object_add/del.

The free device-memory region which is used for memory hotplug are
done using multiple LMBs of size(256MiB) and are expected to be
aligned to 256 MiB. As the SCM blocks are mapped to the same region,
the SCM blocks also need to be aligned to this size for the subsequent
memory hotplug to work. The minimum SCM block size is set to this size
for that reason and can be made user configurable in future if required.

The first patch moves around the existing static function to common
area for using it in the subsequent patches. Second patch adds the
FDT entries and basic device support, the third patch adds the hcalls
implementation.

The patches are also available at
https://github.com/ShivaprasadGBhat/qemu.git - pseries-nvdimm-v3 branch
and can be used with the upstream kernel. ndctl can be used for
configuring the nvdimms inside the guest.
This is how it can be used ..
Ex :
For coldplug, the device to be added in qemu command line as shown below
-object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
-device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

For hotplug, the device to be added from monitor as below
object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

---
v2: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg02785.html
Changes from v2:
     - Creating the drc indices for the nvdimm devices in advance as
       suggested based on the number of user specified max slots property.
     - Removed the hard dependency on -machine nvdimm=on, enabled by
       default on the current latest pseries machine version.
     - Renamed the functions to spapr_dt_X as suggested.
     - Metadata is byteswapped before read/write to take care of endianness
       semantics during the hcall.
v1 : http://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg01545.html
Changes from v1:
     - Rebased to upstream, this required required dt_populate implementation
       for nvdimm hotplug support
     - Added uuid option to nvdimm device
     - Removed the memory region sizing down code as suggested by Igor,
       now erroring out if NVDIMM size excluding the label area is not
       aligned to 256MB, so patch 2 from previous series no longer needed.
     - Removed un-implemented hcalls
     - Changed the hcalls to different kinds of checks and return
       different values.
     - Addressed comments for v1

Shivaprasad G Bhat (3):
      mem: move nvdimm_device_list to utilities
      spapr: Add NVDIMM device support
      spapr: Add Hcalls to support PAPR NVDIMM device


 default-configs/ppc64-softmmu.mak |    1 
 hw/acpi/nvdimm.c                  |   28 ---
 hw/mem/Kconfig                    |    2 
 hw/mem/nvdimm.c                   |   40 +++++
 hw/ppc/spapr.c                    |  218 +++++++++++++++++++++++++--
 hw/ppc/spapr_drc.c                |   18 ++
 hw/ppc/spapr_events.c             |    4 
 hw/ppc/spapr_hcall.c              |  300 +++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h           |    7 +
 include/hw/ppc/spapr.h            |   19 ++
 include/hw/ppc/spapr_drc.h        |    9 +
 include/qemu/nvdimm-utils.h       |    7 +
 util/Makefile.objs                |    1 
 util/nvdimm-utils.c               |   29 ++++
 14 files changed, 638 insertions(+), 45 deletions(-)
 create mode 100644 include/qemu/nvdimm-utils.h
 create mode 100644 util/nvdimm-utils.c

--
Signature



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/3] mem: move nvdimm_device_list to utilities
  2019-10-14 18:37 [PATCH v3 0/3] ppc: spapr: virtual NVDIMM support Shivaprasad G Bhat
@ 2019-10-14 18:37 ` Shivaprasad G Bhat
  2019-11-19  2:58   ` David Gibson
  2019-11-19  7:13   ` Igor Mammedov
  2019-10-14 18:37 ` [PATCH v3 2/3] spapr: Add NVDIMM device support Shivaprasad G Bhat
  2019-10-14 18:38 ` [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device Shivaprasad G Bhat
  2 siblings, 2 replies; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-10-14 18:37 UTC (permalink / raw)
  To: imammedo, david, qemu-ppc, xiaoguangrong.eric, mst; +Cc: sbhat, qemu-devel

nvdimm_device_list is required for parsing the list for devices
in subsequent patches. Move it to common utility area.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/acpi/nvdimm.c            |   28 +---------------------------
 include/qemu/nvdimm-utils.h |    7 +++++++
 util/Makefile.objs          |    1 +
 util/nvdimm-utils.c         |   29 +++++++++++++++++++++++++++++
 4 files changed, 38 insertions(+), 27 deletions(-)
 create mode 100644 include/qemu/nvdimm-utils.h
 create mode 100644 util/nvdimm-utils.c

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 9fdad6dc3f..5219dd0e2e 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -32,33 +32,7 @@
 #include "hw/acpi/bios-linker-loader.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/mem/nvdimm.h"
-
-static int nvdimm_device_list(Object *obj, void *opaque)
-{
-    GSList **list = opaque;
-
-    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
-        *list = g_slist_append(*list, DEVICE(obj));
-    }
-
-    object_child_foreach(obj, nvdimm_device_list, opaque);
-    return 0;
-}
-
-/*
- * inquire NVDIMM devices and link them into the list which is
- * returned to the caller.
- *
- * Note: it is the caller's responsibility to free the list to avoid
- * memory leak.
- */
-static GSList *nvdimm_get_device_list(void)
-{
-    GSList *list = NULL;
-
-    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
-    return list;
-}
+#include "qemu/nvdimm-utils.h"
 
 #define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
    { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
diff --git a/include/qemu/nvdimm-utils.h b/include/qemu/nvdimm-utils.h
new file mode 100644
index 0000000000..4b8b198ba7
--- /dev/null
+++ b/include/qemu/nvdimm-utils.h
@@ -0,0 +1,7 @@
+#ifndef NVDIMM_UTILS_H
+#define NVDIMM_UTILS_H
+
+#include "qemu/osdep.h"
+
+GSList *nvdimm_get_device_list(void);
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 41bf59d127..a0f40d26e3 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -20,6 +20,7 @@ util-obj-y += envlist.o path.o module.o
 util-obj-y += host-utils.o
 util-obj-y += bitmap.o bitops.o hbitmap.o
 util-obj-y += fifo8.o
+util-obj-y += nvdimm-utils.o
 util-obj-y += cacheinfo.o
 util-obj-y += error.o qemu-error.o
 util-obj-y += qemu-print.o
diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
new file mode 100644
index 0000000000..5cc768ca47
--- /dev/null
+++ b/util/nvdimm-utils.c
@@ -0,0 +1,29 @@
+#include "qemu/nvdimm-utils.h"
+#include "hw/mem/nvdimm.h"
+
+static int nvdimm_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+        *list = g_slist_append(*list, DEVICE(obj));
+    }
+
+    object_child_foreach(obj, nvdimm_device_list, opaque);
+    return 0;
+}
+
+/*
+ * inquire NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+GSList *nvdimm_get_device_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
+    return list;
+}



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-10-14 18:37 [PATCH v3 0/3] ppc: spapr: virtual NVDIMM support Shivaprasad G Bhat
  2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
@ 2019-10-14 18:37 ` Shivaprasad G Bhat
  2019-11-22  4:30   ` David Gibson
  2019-10-14 18:38 ` [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device Shivaprasad G Bhat
  2 siblings, 1 reply; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-10-14 18:37 UTC (permalink / raw)
  To: imammedo, david, qemu-ppc, xiaoguangrong.eric, mst; +Cc: sbhat, qemu-devel

Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm
device interface in QEMU to support virtual NVDIMM devices for Power.
Create the required DT entries for the device (some entries have
dummy values right now).

The patch creates the required DT node and sends a hotplug
interrupt to the guest. Guest is expected to undertake the normal
DR resource add path in response and start issuing PAPR SCM hcalls.

The device support is verified based on the machine version unlike x86.

This is how it can be used ..
Ex :
For coldplug, the device to be added in qemu command line as shown below
-object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
-device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

For hotplug, the device to be added from monitor as below
object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
               [Early implementation]
---
 default-configs/ppc64-softmmu.mak |    1 
 hw/mem/Kconfig                    |    2 
 hw/mem/nvdimm.c                   |   40 +++++++
 hw/ppc/spapr.c                    |  218 ++++++++++++++++++++++++++++++++++---
 hw/ppc/spapr_drc.c                |   18 +++
 hw/ppc/spapr_events.c             |    4 +
 include/hw/mem/nvdimm.h           |    7 +
 include/hw/ppc/spapr.h            |   11 ++
 include/hw/ppc/spapr_drc.h        |    9 ++
 9 files changed, 293 insertions(+), 17 deletions(-)

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index cca52665d9..ae0841fa3a 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -8,3 +8,4 @@ CONFIG_POWERNV=y
 
 # For pSeries
 CONFIG_PSERIES=y
+CONFIG_NVDIMM=y
diff --git a/hw/mem/Kconfig b/hw/mem/Kconfig
index 620fd4cb59..2ad052a536 100644
--- a/hw/mem/Kconfig
+++ b/hw/mem/Kconfig
@@ -8,4 +8,4 @@ config MEM_DEVICE
 config NVDIMM
     bool
     default y
-    depends on PC
+    depends on (PC || PSERIES)
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 375f9a588a..e1238b5bed 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -69,11 +69,51 @@ out:
     error_propagate(errp, local_err);
 }
 
+static void nvdimm_get_uuid(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    char *value = NULL;
+
+    value = qemu_uuid_unparse_strdup(&nvdimm->uuid);
+
+    visit_type_str(v, name, &value, errp);
+    g_free(value);
+}
+
+
+static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    Error *local_err = NULL;
+    char *value;
+
+    visit_type_str(v, name, &value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+    if (qemu_uuid_parse(value, &nvdimm->uuid) != 0) {
+        error_setg(errp, "Property '%s.%s' has invalid value",
+                   object_get_typename(obj), name);
+        goto out;
+    }
+    g_free(value);
+
+out:
+    error_propagate(errp, local_err);
+}
+
+
 static void nvdimm_init(Object *obj)
 {
     object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
                         nvdimm_get_label_size, nvdimm_set_label_size, NULL,
                         NULL, NULL);
+
+    object_property_add(obj, NVDIMM_UUID_PROP, "QemuUUID", nvdimm_get_uuid,
+                        nvdimm_set_uuid, NULL, NULL, NULL);
 }
 
 static void nvdimm_finalize(Object *obj)
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 08a2a5a770..eb5c205078 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -80,6 +80,8 @@
 #include "hw/ppc/spapr_cpu_core.h"
 #include "hw/mem/memory-device.h"
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/mem/nvdimm.h"
+#include "qemu/nvdimm-utils.h"
 
 #include <libfdt.h>
 
@@ -716,7 +718,8 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
     uint8_t *int_buf, *cur_index;
     int ret;
     uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
-    uint64_t addr, cur_addr, size;
+    uint64_t addr, cur_addr, size, slot;
+    uint64_t scm_block_size = SPAPR_MINIMUM_SCM_BLOCK_SIZE;
     uint32_t nr_boot_lmbs = (machine->device_memory->base / lmb_size);
     uint64_t mem_end = machine->device_memory->base +
                        memory_region_size(&machine->device_memory->mr);
@@ -741,6 +744,7 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
         addr = di->addr;
         size = di->size;
         node = di->node;
+        slot = di->slot;
 
         /* Entry for hot-pluggable area */
         if (cur_addr < addr) {
@@ -752,12 +756,20 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
             nr_entries++;
         }
 
-        /* Entry for DIMM */
-        drc = spapr_drc_by_id(TYPE_SPAPR_DRC_LMB, addr / lmb_size);
-        g_assert(drc);
-        elem = spapr_get_drconf_cell(size / lmb_size, addr,
-                                     spapr_drc_index(drc), node,
-                                     SPAPR_LMB_FLAGS_ASSIGNED);
+        if (info->value->type == MEMORY_DEVICE_INFO_KIND_DIMM) {
+            /* Entry for DIMM */
+            drc = spapr_drc_by_id(TYPE_SPAPR_DRC_LMB, addr / lmb_size);
+            g_assert(drc);
+            elem = spapr_get_drconf_cell(size / lmb_size, addr,
+                                         spapr_drc_index(drc), node,
+                                         SPAPR_LMB_FLAGS_ASSIGNED);
+        } else if (info->value->type == MEMORY_DEVICE_INFO_KIND_NVDIMM) {
+            /* Entry for NVDIMM */
+            drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
+            g_assert(drc);
+            elem = spapr_get_drconf_cell(size / scm_block_size, addr,
+                                         spapr_drc_index(drc), -1, 0);
+        }
         QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
         nr_entries++;
         cur_addr = addr + size;
@@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
     }
 }
 
+static int spapr_dt_nvdimm(void *fdt, int parent_offset,
+                           NVDIMMDevice *nvdimm)
+{
+    int child_offset;
+    char buf[40];
+    SpaprDrc *drc;
+    uint32_t drc_idx;
+    uint32_t node = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_NODE_PROP,
+                                             &error_abort);
+    uint64_t slot = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_SLOT_PROP,
+                                             &error_abort);
+    uint32_t associativity[] = {
+        cpu_to_be32(0x4), /* length */
+        cpu_to_be32(0x0), cpu_to_be32(0x0),
+        cpu_to_be32(0x0), cpu_to_be32(node)
+    };
+    uint64_t lsize = nvdimm->label_size;
+    uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
+                                            NULL);
+
+    drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
+    g_assert(drc);
+
+    drc_idx = spapr_drc_index(drc);
+
+    sprintf(buf, "ibm,pmemory@%x", drc_idx);
+    child_offset = fdt_add_subnode(fdt, parent_offset, buf);
+    _FDT(child_offset);
+
+    _FDT((fdt_setprop_cell(fdt, child_offset, "reg", drc_idx)));
+    _FDT((fdt_setprop_string(fdt, child_offset, "compatible", "ibm,pmemory")));
+    _FDT((fdt_setprop_string(fdt, child_offset, "device_type", "ibm,pmemory")));
+
+    _FDT((fdt_setprop(fdt, child_offset, "ibm,associativity", associativity,
+                      sizeof(associativity))));
+
+    qemu_uuid_unparse(&nvdimm->uuid, buf);
+    _FDT((fdt_setprop_string(fdt, child_offset, "ibm,unit-guid", buf)));
+
+    _FDT((fdt_setprop_cell(fdt, child_offset, "ibm,my-drc-index", drc_idx)));
+
+    _FDT((fdt_setprop_u64(fdt, child_offset, "ibm,block-size",
+                          SPAPR_MINIMUM_SCM_BLOCK_SIZE)));
+    _FDT((fdt_setprop_u64(fdt, child_offset, "ibm,number-of-blocks",
+                          size / SPAPR_MINIMUM_SCM_BLOCK_SIZE)));
+    _FDT((fdt_setprop_cell(fdt, child_offset, "ibm,metadata-size", lsize)));
+
+    _FDT((fdt_setprop_string(fdt, child_offset, "ibm,pmem-application",
+                             "operating-system")));
+    _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
+
+    return child_offset;
+}
+
+static void spapr_dt_persistent_memory(void *fdt)
+{
+    int offset = fdt_subnode_offset(fdt, 0, "persistent-memory");
+    GSList *iter, *nvdimms = nvdimm_get_device_list();
+
+    if (offset < 0) {
+        offset = fdt_add_subnode(fdt, 0, "persistent-memory");
+        _FDT(offset);
+        _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 0x1)));
+        _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 0x0)));
+        _FDT((fdt_setprop_string(fdt, offset, "device_type",
+                                 "ibm,persistent-memory")));
+    }
+
+    /* Create DT entries for cold plugged NVDIMM devices */
+    for (iter = nvdimms; iter; iter = iter->next) {
+        NVDIMMDevice *nvdimm = iter->data;
+
+        spapr_dt_nvdimm(fdt, offset, nvdimm);
+    }
+    g_slist_free(nvdimms);
+
+    return;
+}
+
 static void *spapr_build_fdt(SpaprMachineState *spapr)
 {
     MachineState *machine = MACHINE(spapr);
@@ -1392,6 +1483,11 @@ static void *spapr_build_fdt(SpaprMachineState *spapr)
         }
     }
 
+    /* NVDIMM devices */
+    if (mc->nvdimm_supported) {
+        spapr_dt_persistent_memory(fdt);
+    }
+
     return fdt;
 }
 
@@ -2521,6 +2617,16 @@ static void spapr_create_lmb_dr_connectors(SpaprMachineState *spapr)
     }
 }
 
+static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
+{
+    MachineState *machine = MACHINE(spapr);
+    int i;
+
+    for (i = 0; i < machine->ram_slots; i++) {
+        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
+    }
+}
+
 /*
  * If RAM size, maxmem size and individual node mem sizes aren't aligned
  * to SPAPR_MEMORY_BLOCK_SIZE(256MB), then refuse to start the guest
@@ -2734,6 +2840,7 @@ static void spapr_machine_init(MachineState *machine)
 {
     SpaprMachineState *spapr = SPAPR_MACHINE(machine);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
     const char *kernel_filename = machine->kernel_filename;
     const char *initrd_filename = machine->initrd_filename;
     PCIHostState *phb;
@@ -2915,6 +3022,10 @@ static void spapr_machine_init(MachineState *machine)
         spapr_create_lmb_dr_connectors(spapr);
     }
 
+    if (mc->nvdimm_supported) {
+        spapr_create_nvdimm_dr_connectors(spapr);
+    }
+
     filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
     if (!filename) {
         error_report("Could not find LPAR rtas '%s'", "spapr-rtas.bin");
@@ -3436,6 +3547,16 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
     }
 }
 
+int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
+                           void *fdt, int *fdt_start_offset, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(drc->dev);
+
+    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm);
+
+    return 0;
+}
+
 int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                           void *fdt, int *fdt_start_offset, Error **errp)
 {
@@ -3498,13 +3619,34 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size,
     }
 }
 
+static void spapr_add_nvdimm(DeviceState *dev, uint64_t slot, Error **errp)
+{
+    SpaprDrc *drc;
+    bool hotplugged = spapr_drc_hotplugged(dev);
+    Error *local_err = NULL;
+
+    drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
+    g_assert(drc);
+
+    spapr_drc_attach(drc, dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    if (hotplugged) {
+        spapr_hotplug_req_add_by_index(drc);
+    }
+}
+
 static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
                               Error **errp)
 {
     Error *local_err = NULL;
     SpaprMachineState *ms = SPAPR_MACHINE(hotplug_dev);
     PCDIMMDevice *dimm = PC_DIMM(dev);
-    uint64_t size, addr;
+    uint64_t size, addr, slot;
+    bool is_nvdimm = object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM);
 
     size = memory_device_get_region_size(MEMORY_DEVICE(dev), &error_abort);
 
@@ -3513,14 +3655,24 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
         goto out;
     }
 
-    addr = object_property_get_uint(OBJECT(dimm),
-                                    PC_DIMM_ADDR_PROP, &local_err);
-    if (local_err) {
-        goto out_unplug;
+    if (!is_nvdimm) {
+        addr = object_property_get_uint(OBJECT(dimm),
+                                        PC_DIMM_ADDR_PROP, &local_err);
+        if (local_err) {
+            goto out_unplug;
+        }
+        spapr_add_lmbs(dev, addr, size,
+                       spapr_ovec_test(ms->ov5_cas, OV5_HP_EVT),
+                       &local_err);
+    } else {
+        slot = object_property_get_uint(OBJECT(dimm),
+                                        PC_DIMM_SLOT_PROP, &local_err);
+        if (local_err) {
+            goto out_unplug;
+        }
+        spapr_add_nvdimm(dev, slot, &local_err);
     }
 
-    spapr_add_lmbs(dev, addr, size, spapr_ovec_test(ms->ov5_cas, OV5_HP_EVT),
-                   &local_err);
     if (local_err) {
         goto out_unplug;
     }
@@ -3538,6 +3690,8 @@ static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
 {
     const SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(hotplug_dev);
     SpaprMachineState *spapr = SPAPR_MACHINE(hotplug_dev);
+    const MachineClass *mc = MACHINE_CLASS(smc);
+    bool is_nvdimm = object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM);
     PCDIMMDevice *dimm = PC_DIMM(dev);
     Error *local_err = NULL;
     uint64_t size;
@@ -3549,16 +3703,40 @@ static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
         return;
     }
 
+    if (is_nvdimm && !mc->nvdimm_supported) {
+        error_setg(errp, "NVDIMM hotplug not supported for this machine");
+        return;
+    }
+
     size = memory_device_get_region_size(MEMORY_DEVICE(dimm), &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
     }
 
-    if (size % SPAPR_MEMORY_BLOCK_SIZE) {
+    if (!is_nvdimm && size % SPAPR_MEMORY_BLOCK_SIZE) {
         error_setg(errp, "Hotplugged memory size must be a multiple of "
-                      "%" PRIu64 " MB", SPAPR_MEMORY_BLOCK_SIZE / MiB);
+                   "%" PRIu64 " MB", SPAPR_MEMORY_BLOCK_SIZE / MiB);
         return;
+    } else if (is_nvdimm) {
+        char *uuidstr = NULL;
+        QemuUUID uuid;
+
+        if (size % SPAPR_MINIMUM_SCM_BLOCK_SIZE) {
+            error_setg(errp, "NVDIMM memory size excluding the label area"
+                       " must be a multiple of %" PRIu64 "MB",
+                       SPAPR_MINIMUM_SCM_BLOCK_SIZE / MiB);
+            return;
+        }
+
+        uuidstr = object_property_get_str(OBJECT(dimm), NVDIMM_UUID_PROP, NULL);
+        qemu_uuid_parse(uuidstr, &uuid);
+        g_free(uuidstr);
+
+        if (qemu_uuid_is_null(&uuid)) {
+            error_setg(errp, "NVDIMM device requires the uuid to be set");
+            return;
+        }
     }
 
     memdev = object_property_get_link(OBJECT(dimm), PC_DIMM_MEMDEV_PROP,
@@ -3698,6 +3876,12 @@ static void spapr_memory_unplug_request(HotplugHandler *hotplug_dev,
     int i;
     SpaprDrc *drc;
 
+    if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
+        error_setg(&local_err,
+                   "nvdimm device hot unplug is not supported yet.");
+        goto out;
+    }
+
     size = memory_device_get_region_size(MEMORY_DEVICE(dimm), &error_abort);
     nr_lmbs = size / SPAPR_MEMORY_BLOCK_SIZE;
 
@@ -4453,6 +4637,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->update_dt_enabled = true;
     mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power9_v2.0");
     mc->has_hotpluggable_cpus = true;
+    mc->nvdimm_supported = true;
     smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
     fwc->get_dev_path = spapr_get_fw_dev_path;
     nc->nmi_monitor_handler = spapr_nmi;
@@ -4558,6 +4743,7 @@ static void spapr_machine_4_1_class_options(MachineClass *mc)
     };
 
     spapr_machine_4_2_class_options(mc);
+    mc->nvdimm_supported = false;
     smc->linux_pci_probe = false;
     compat_props_add(mc->compat_props, hw_compat_4_1, hw_compat_4_1_len);
     compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat));
diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
index 62f1a42592..815167e42f 100644
--- a/hw/ppc/spapr_drc.c
+++ b/hw/ppc/spapr_drc.c
@@ -708,6 +708,17 @@ static void spapr_drc_phb_class_init(ObjectClass *k, void *data)
     drck->dt_populate = spapr_phb_dt_populate;
 }
 
+static void spapr_drc_pmem_class_init(ObjectClass *k, void *data)
+{
+    SpaprDrcClass *drck = SPAPR_DR_CONNECTOR_CLASS(k);
+
+    drck->typeshift = SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM;
+    drck->typename = "MEM";
+    drck->drc_name_prefix = "PMEM ";
+    drck->release = NULL;
+    drck->dt_populate = spapr_pmem_dt_populate;
+}
+
 static const TypeInfo spapr_dr_connector_info = {
     .name          = TYPE_SPAPR_DR_CONNECTOR,
     .parent        = TYPE_DEVICE,
@@ -758,6 +769,12 @@ static const TypeInfo spapr_drc_phb_info = {
     .class_init    = spapr_drc_phb_class_init,
 };
 
+static const TypeInfo spapr_drc_pmem_info = {
+    .name          = TYPE_SPAPR_DRC_PMEM,
+    .parent        = TYPE_SPAPR_DRC_LOGICAL,
+    .class_init    = spapr_drc_pmem_class_init,
+};
+
 /* helper functions for external users */
 
 SpaprDrc *spapr_drc_by_index(uint32_t index)
@@ -1229,6 +1246,7 @@ static void spapr_drc_register_types(void)
     type_register_static(&spapr_drc_pci_info);
     type_register_static(&spapr_drc_lmb_info);
     type_register_static(&spapr_drc_phb_info);
+    type_register_static(&spapr_drc_pmem_info);
 
     spapr_rtas_register(RTAS_SET_INDICATOR, "set-indicator",
                         rtas_set_indicator);
diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
index 0e4c19523a..b9a4d1607c 100644
--- a/hw/ppc/spapr_events.c
+++ b/hw/ppc/spapr_events.c
@@ -194,6 +194,7 @@ struct rtas_event_log_v6_hp {
 #define RTAS_LOG_V6_HP_TYPE_SLOT                         3
 #define RTAS_LOG_V6_HP_TYPE_PHB                          4
 #define RTAS_LOG_V6_HP_TYPE_PCI                          5
+#define RTAS_LOG_V6_HP_TYPE_PMEM                         6
     uint8_t hotplug_action;
 #define RTAS_LOG_V6_HP_ACTION_ADD                        1
 #define RTAS_LOG_V6_HP_ACTION_REMOVE                     2
@@ -530,6 +531,9 @@ static void spapr_hotplug_req_event(uint8_t hp_id, uint8_t hp_action,
     case SPAPR_DR_CONNECTOR_TYPE_PHB:
         hp->hotplug_type = RTAS_LOG_V6_HP_TYPE_PHB;
         break;
+    case SPAPR_DR_CONNECTOR_TYPE_PMEM:
+        hp->hotplug_type = RTAS_LOG_V6_HP_TYPE_PMEM;
+        break;
     default:
         /* we shouldn't be signaling hotplug events for resources
          * that don't support them
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 523a9b3d4a..4807ca615b 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -25,6 +25,7 @@
 
 #include "hw/mem/pc-dimm.h"
 #include "hw/acpi/bios-linker-loader.h"
+#include "qemu/uuid.h"
 
 #define NVDIMM_DEBUG 0
 #define nvdimm_debug(fmt, ...)                                \
@@ -49,6 +50,7 @@
                                                TYPE_NVDIMM)
 
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
+#define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
 
 struct NVDIMMDevice {
@@ -83,6 +85,11 @@ struct NVDIMMDevice {
      * the guest write persistence.
      */
     bool unarmed;
+
+    /*
+     * The PPC64 - spapr requires each nvdimm device have a uuid.
+     */
+    QemuUUID uuid;
 };
 typedef struct NVDIMMDevice NVDIMMDevice;
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 03111fd55b..a8cb3513d0 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -811,6 +811,8 @@ int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
 void spapr_lmb_release(DeviceState *dev);
 int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                           void *fdt, int *fdt_start_offset, Error **errp);
+int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
+                           void *fdt, int *fdt_start_offset, Error **errp);
 void spapr_phb_release(DeviceState *dev);
 int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                           void *fdt, int *fdt_start_offset, Error **errp);
@@ -846,6 +848,15 @@ int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
 #define SPAPR_LMB_FLAGS_DRC_INVALID 0x00000020
 #define SPAPR_LMB_FLAGS_RESERVED 0x00000080
 
+/*
+ * The nvdimm size should be aligned to SCM block size.
+ * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
+ * inorder to have SCM regions not to overlap with dimm memory regions.
+ * The SCM devices can have variable block sizes. For now, fixing the
+ * block size to the minimum value.
+ */
+#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
+
 void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data arg);
 
 #define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
diff --git a/include/hw/ppc/spapr_drc.h b/include/hw/ppc/spapr_drc.h
index 83f03cc577..df3d958a66 100644
--- a/include/hw/ppc/spapr_drc.h
+++ b/include/hw/ppc/spapr_drc.h
@@ -78,6 +78,13 @@
 #define SPAPR_DRC_PHB(obj) OBJECT_CHECK(SpaprDrc, (obj), \
                                         TYPE_SPAPR_DRC_PHB)
 
+#define TYPE_SPAPR_DRC_PMEM "spapr-drc-pmem"
+#define SPAPR_DRC_PMEM_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(SpaprDrcClass, obj, TYPE_SPAPR_DRC_PMEM)
+#define SPAPR_DRC_PMEM_CLASS(klass) \
+        OBJECT_CLASS_CHECK(SpaprDrcClass, klass, TYPE_SPAPR_DRC_PMEM)
+#define SPAPR_DRC_PMEM(obj) OBJECT_CHECK(SpaprDrc, (obj), \
+                                         TYPE_SPAPR_DRC_PMEM)
 /*
  * Various hotplug types managed by SpaprDrc
  *
@@ -95,6 +102,7 @@ typedef enum {
     SPAPR_DR_CONNECTOR_TYPE_SHIFT_VIO = 3,
     SPAPR_DR_CONNECTOR_TYPE_SHIFT_PCI = 4,
     SPAPR_DR_CONNECTOR_TYPE_SHIFT_LMB = 8,
+    SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM = 9,
 } SpaprDrcTypeShift;
 
 typedef enum {
@@ -104,6 +112,7 @@ typedef enum {
     SPAPR_DR_CONNECTOR_TYPE_VIO = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_VIO,
     SPAPR_DR_CONNECTOR_TYPE_PCI = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_PCI,
     SPAPR_DR_CONNECTOR_TYPE_LMB = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_LMB,
+    SPAPR_DR_CONNECTOR_TYPE_PMEM = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM,
 } SpaprDrcType;
 
 /*



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device
  2019-10-14 18:37 [PATCH v3 0/3] ppc: spapr: virtual NVDIMM support Shivaprasad G Bhat
  2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
  2019-10-14 18:37 ` [PATCH v3 2/3] spapr: Add NVDIMM device support Shivaprasad G Bhat
@ 2019-10-14 18:38 ` Shivaprasad G Bhat
  2019-11-22  5:11   ` David Gibson
  2 siblings, 1 reply; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-10-14 18:38 UTC (permalink / raw)
  To: imammedo, david, qemu-ppc, xiaoguangrong.eric, mst; +Cc: sbhat, qemu-devel

This patch implements few of the necessary hcalls for the nvdimm support.

PAPR semantics is such that each NVDIMM device is comprising of multiple
SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
bind each of the SCM blocks of the NVDIMM device using hcalls. There can
be SCM block unbind requests in case of driver errors or unplug(not
supported now) use cases. The NVDIMM label read/writes are done through
hcalls.

Since each virtual NVDIMM device is divided into multiple SCM blocks,
the bind, unbind, and queries using hcalls on those blocks can come
independently. This doesn't fit well into the qemu device semantics,
where the map/unmap are done at the (whole)device/object level granularity.
The patch doesnt actually bind/unbind on hcalls but let it happen at the
device_add/del phase itself instead.

The guest kernel makes bind/unbind requests for the virtual NVDIMM device
at the region level granularity. Without interleaving, each virtual NVDIMM
device is presented as separate region. There is no way to configure the
virtual NVDIMM interleaving for the guests today. So, there is no way a
partial bind/unbind request can come for the vNVDIMM in a hcall for a
subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do
bind/unbind everything during the device_add/del.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
---
 hw/ppc/spapr_hcall.c   |  300 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |    8 +
 2 files changed, 307 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 23e4bdb829..4e9ad96f7c 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -18,6 +18,10 @@
 #include "hw/ppc/spapr_ovec.h"
 #include "mmu-book3s-v3.h"
 #include "hw/mem/memory-device.h"
+#include "hw/ppc/spapr_drc.h"
+#include "hw/mem/nvdimm.h"
+#include "qemu/range.h"
+#include "qemu/nvdimm-utils.h"
 
 static bool has_spr(PowerPCCPU *cpu, int spr)
 {
@@ -1961,6 +1965,295 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static target_ulong h_scm_read_metadata(PowerPCCPU *cpu,
+                                        SpaprMachineState *spapr,
+                                        target_ulong opcode,
+                                        target_ulong *args)
+{
+    uint32_t drc_index = args[0];
+    uint64_t offset = args[1];
+    uint64_t numBytesToRead = args[2];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    NVDIMMDevice *nvdimm;
+    NVDIMMClass *ddc;
+    __be64 data_be = 0;
+    uint64_t data = 0;
+
+    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (numBytesToRead != 1 && numBytesToRead != 2 &&
+        numBytesToRead != 4 && numBytesToRead != 8) {
+        return H_P3;
+    }
+
+    nvdimm = NVDIMM(drc->dev);
+    if ((offset + numBytesToRead < offset) ||
+        (nvdimm->label_size < numBytesToRead + offset)) {
+        return H_P2;
+    }
+
+    ddc = NVDIMM_GET_CLASS(nvdimm);
+    ddc->read_label_data(nvdimm, &data_be, numBytesToRead, offset);
+
+    switch (numBytesToRead) {
+    case 1:
+        data = data_be & 0xff;
+        break;
+    case 2:
+        data = be16_to_cpu(data_be & 0xffff);
+        break;
+    case 4:
+        data = be32_to_cpu(data_be & 0xffffffff);
+        break;
+    case 8:
+        data = be64_to_cpu(data_be);
+        break;
+    default:
+        break;
+    }
+
+    args[0] = data;
+
+    return H_SUCCESS;
+}
+
+static target_ulong h_scm_write_metadata(PowerPCCPU *cpu,
+                                         SpaprMachineState *spapr,
+                                         target_ulong opcode,
+                                         target_ulong *args)
+{
+    uint32_t drc_index = args[0];
+    uint64_t offset = args[1];
+    uint64_t data = args[2];
+    uint64_t numBytesToWrite = args[3];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    NVDIMMDevice *nvdimm;
+    DeviceState *dev;
+    NVDIMMClass *ddc;
+    __be64 data_be = 0;
+
+    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (numBytesToWrite != 1 && numBytesToWrite != 2 &&
+        numBytesToWrite != 4 && numBytesToWrite != 8) {
+        return H_P4;
+    }
+
+    dev = drc->dev;
+    nvdimm = NVDIMM(dev);
+
+    switch (numBytesToWrite) {
+    case 1:
+        if (data & 0xffffffffffffff00) {
+            return H_P2;
+        }
+        data_be = data & 0xff;
+        break;
+    case 2:
+        if (data & 0xffffffffffff0000) {
+            return H_P2;
+        }
+        data_be = cpu_to_be16(data & 0xffff);
+        break;
+    case 4:
+        if (data & 0xffffffff00000000) {
+            return H_P2;
+        }
+        data_be = cpu_to_be32(data & 0xffffffff);
+        break;
+    case 8:
+        data_be = cpu_to_be64(data);
+        break;
+    default: /* lint */
+            break;
+    }
+
+    ddc = NVDIMM_GET_CLASS(nvdimm);
+    ddc->write_label_data(nvdimm, &data_be, numBytesToWrite, offset);
+
+    return H_SUCCESS;
+}
+
+static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                   target_ulong opcode, target_ulong *args)
+{
+    uint32_t drc_index = args[0];
+    uint64_t starting_idx = args[1];
+    uint64_t no_of_scm_blocks_to_bind = args[2];
+    uint64_t target_logical_mem_addr = args[3];
+    uint64_t continue_token = args[4];
+    uint64_t size;
+    uint64_t total_no_of_scm_blocks;
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    hwaddr addr;
+    DeviceState *dev;
+    PCDIMMDevice *dimm;
+    Error *local_err = NULL;
+
+    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    dev = drc->dev;
+    dimm = PC_DIMM(dev);
+
+    size = object_property_get_uint(OBJECT(dimm),
+                                    PC_DIMM_SIZE_PROP, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return H_PARAMETER;
+    }
+
+    total_no_of_scm_blocks = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
+
+    if ((starting_idx > total_no_of_scm_blocks) ||
+        (no_of_scm_blocks_to_bind > total_no_of_scm_blocks)) {
+        return H_P2;
+    }
+
+    if (((starting_idx + no_of_scm_blocks_to_bind) < starting_idx) ||
+        ((starting_idx + no_of_scm_blocks_to_bind) > total_no_of_scm_blocks)) {
+        return H_P3;
+    }
+
+    /* Currently qemu assigns the address. */
+    if (target_logical_mem_addr != 0xffffffffffffffff) {
+        return H_OVERLAP;
+    }
+
+    /*
+     * Currently continue token should be zero qemu has already bound
+     * everything and this hcall doesnt return H_BUSY.
+     */
+    if (continue_token > 0) {
+        return H_P5;
+    }
+
+    addr = object_property_get_uint(OBJECT(dimm),
+                                    PC_DIMM_ADDR_PROP, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return H_PARAMETER;
+    }
+
+    addr += starting_idx * SPAPR_MINIMUM_SCM_BLOCK_SIZE;
+
+    /* Already bound, Return target logical address in R4 */
+    args[1] = addr;
+    args[2] = no_of_scm_blocks_to_bind;
+
+    return H_SUCCESS;
+}
+
+static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                     target_ulong opcode, target_ulong *args)
+{
+    uint32_t drc_index = args[0];
+    uint64_t starting_scm_logical_addr = args[1];
+    uint64_t no_of_scm_blocks_to_unbind = args[2];
+    uint64_t continue_token = args[3];
+    uint64_t size_to_unbind;
+    Range blockrange = range_empty;
+    Range nvdimmrange = range_empty;
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    DeviceState *dev;
+    PCDIMMDevice *dimm;
+    uint64_t size, addr;
+
+    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    /* Check if starting_scm_logical_addr is block aligned */
+    if (!QEMU_IS_ALIGNED(starting_scm_logical_addr,
+                         SPAPR_MINIMUM_SCM_BLOCK_SIZE)) {
+        return H_P2;
+    }
+
+    dev = drc->dev;
+    dimm = PC_DIMM(dev);
+    size = object_property_get_int(OBJECT(dimm), PC_DIMM_SIZE_PROP, NULL);
+    addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, NULL);
+
+    range_init_nofail(&nvdimmrange, addr, size);
+
+    size_to_unbind = no_of_scm_blocks_to_unbind * SPAPR_MINIMUM_SCM_BLOCK_SIZE;
+
+
+    range_init_nofail(&blockrange, starting_scm_logical_addr, size_to_unbind);
+
+    if (!range_contains_range(&nvdimmrange, &blockrange)) {
+        return H_P3;
+    }
+
+    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
+    if (continue_token > 0) {
+        return H_P3;
+    }
+
+    args[1] = no_of_scm_blocks_to_unbind;
+
+    /* let unplug take care of actual unbind */
+    return H_SUCCESS;
+}
+
+#define H_UNBIND_SCOPE_ALL 0x1
+#define H_UNBIND_SCOPE_DRC 0x2
+
+static target_ulong h_scm_unbind_all(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                     target_ulong opcode, target_ulong *args)
+{
+    uint64_t target_scope = args[0];
+    uint32_t drc_index = args[1];
+    uint64_t continue_token = args[2];
+    NVDIMMDevice *nvdimm;
+    uint64_t size;
+    uint64_t no_of_scm_blocks_unbound = 0;
+
+    if (target_scope == H_UNBIND_SCOPE_DRC) {
+        DeviceState *dev;
+        SpaprDrc *drc = spapr_drc_by_index(drc_index);
+
+        if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+            return H_P2;
+        }
+
+        dev = drc->dev;
+        nvdimm = NVDIMM(dev);
+        size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP, NULL);
+
+        no_of_scm_blocks_unbound = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
+    } else if (target_scope ==  H_UNBIND_SCOPE_ALL) {
+        GSList *list, *dimms;
+
+        dimms = nvdimm_get_device_list();
+        for (list = dimms; list; list = list->next) {
+            nvdimm = list->data;
+            size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
+                                           NULL);
+
+            no_of_scm_blocks_unbound += size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
+        }
+        g_slist_free(dimms);
+    } else {
+        return H_PARAMETER;
+    }
+
+    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
+    if (continue_token > 0) {
+        return H_P4;
+    }
+
+    args[1] = no_of_scm_blocks_unbound;
+
+    /* let unplug take care of actual unbind */
+    return H_SUCCESS;
+}
+
 static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
 static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
 static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
@@ -2079,6 +2372,13 @@ static void hypercall_register_types(void)
     /* qemu/KVM-PPC specific hcalls */
     spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
 
+    /* qemu/scm specific hcalls */
+    spapr_register_hypercall(H_SCM_READ_METADATA, h_scm_read_metadata);
+    spapr_register_hypercall(H_SCM_WRITE_METADATA, h_scm_write_metadata);
+    spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
+    spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
+    spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+
     /* ibm,client-architecture-support support */
     spapr_register_hypercall(KVMPPC_H_CAS, h_client_architecture_support);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a8cb3513d0..e1933e877d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -286,6 +286,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
 
@@ -493,8 +494,13 @@ struct SpaprMachineState {
 #define H_INT_ESB               0x3C8
 #define H_INT_SYNC              0x3CC
 #define H_INT_RESET             0x3D0
+#define H_SCM_READ_METADATA     0x3E4
+#define H_SCM_WRITE_METADATA    0x3E8
+#define H_SCM_BIND_MEM          0x3EC
+#define H_SCM_UNBIND_MEM        0x3F0
+#define H_SCM_UNBIND_ALL        0x3FC
 
-#define MAX_HCALL_OPCODE        H_INT_RESET
+#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] mem: move nvdimm_device_list to utilities
  2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
@ 2019-11-19  2:58   ` David Gibson
  2019-11-19  7:13   ` Igor Mammedov
  1 sibling, 0 replies; 18+ messages in thread
From: David Gibson @ 2019-11-19  2:58 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, sbhat, qemu-devel, qemu-ppc, imammedo

[-- Attachment #1: Type: text/plain, Size: 3947 bytes --]

On Mon, Oct 14, 2019 at 01:37:37PM -0500, Shivaprasad G Bhat wrote:
> nvdimm_device_list is required for parsing the list for devices
> in subsequent patches. Move it to common utility area.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>

LGTM, assuming it's needed in the subsequent patches.  An ack from
NVDIMM people would be nice if I'm to merge it via my tree.

> ---
>  hw/acpi/nvdimm.c            |   28 +---------------------------
>  include/qemu/nvdimm-utils.h |    7 +++++++
>  util/Makefile.objs          |    1 +
>  util/nvdimm-utils.c         |   29 +++++++++++++++++++++++++++++
>  4 files changed, 38 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/nvdimm-utils.h
>  create mode 100644 util/nvdimm-utils.c
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 9fdad6dc3f..5219dd0e2e 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -32,33 +32,7 @@
>  #include "hw/acpi/bios-linker-loader.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/mem/nvdimm.h"
> -
> -static int nvdimm_device_list(Object *obj, void *opaque)
> -{
> -    GSList **list = opaque;
> -
> -    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
> -        *list = g_slist_append(*list, DEVICE(obj));
> -    }
> -
> -    object_child_foreach(obj, nvdimm_device_list, opaque);
> -    return 0;
> -}
> -
> -/*
> - * inquire NVDIMM devices and link them into the list which is
> - * returned to the caller.
> - *
> - * Note: it is the caller's responsibility to free the list to avoid
> - * memory leak.
> - */
> -static GSList *nvdimm_get_device_list(void)
> -{
> -    GSList *list = NULL;
> -
> -    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
> -    return list;
> -}
> +#include "qemu/nvdimm-utils.h"
>  
>  #define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
>     { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
> diff --git a/include/qemu/nvdimm-utils.h b/include/qemu/nvdimm-utils.h
> new file mode 100644
> index 0000000000..4b8b198ba7
> --- /dev/null
> +++ b/include/qemu/nvdimm-utils.h
> @@ -0,0 +1,7 @@
> +#ifndef NVDIMM_UTILS_H
> +#define NVDIMM_UTILS_H
> +
> +#include "qemu/osdep.h"
> +
> +GSList *nvdimm_get_device_list(void);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 41bf59d127..a0f40d26e3 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -20,6 +20,7 @@ util-obj-y += envlist.o path.o module.o
>  util-obj-y += host-utils.o
>  util-obj-y += bitmap.o bitops.o hbitmap.o
>  util-obj-y += fifo8.o
> +util-obj-y += nvdimm-utils.o
>  util-obj-y += cacheinfo.o
>  util-obj-y += error.o qemu-error.o
>  util-obj-y += qemu-print.o
> diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
> new file mode 100644
> index 0000000000..5cc768ca47
> --- /dev/null
> +++ b/util/nvdimm-utils.c
> @@ -0,0 +1,29 @@
> +#include "qemu/nvdimm-utils.h"
> +#include "hw/mem/nvdimm.h"
> +
> +static int nvdimm_device_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
> +        *list = g_slist_append(*list, DEVICE(obj));
> +    }
> +
> +    object_child_foreach(obj, nvdimm_device_list, opaque);
> +    return 0;
> +}
> +
> +/*
> + * inquire NVDIMM devices and link them into the list which is
> + * returned to the caller.
> + *
> + * Note: it is the caller's responsibility to free the list to avoid
> + * memory leak.
> + */
> +GSList *nvdimm_get_device_list(void)
> +{
> +    GSList *list = NULL;
> +
> +    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
> +    return list;
> +}
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] mem: move nvdimm_device_list to utilities
  2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
  2019-11-19  2:58   ` David Gibson
@ 2019-11-19  7:13   ` Igor Mammedov
  2019-11-20  8:01     ` Shivaprasad G Bhat
  1 sibling, 1 reply; 18+ messages in thread
From: Igor Mammedov @ 2019-11-19  7:13 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, qemu-devel, sbhat, qemu-ppc, david

On Mon, 14 Oct 2019 13:37:37 -0500
Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:

> nvdimm_device_list is required for parsing the list for devices
> in subsequent patches. Move it to common utility area.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/acpi/nvdimm.c            |   28 +---------------------------
>  include/qemu/nvdimm-utils.h |    7 +++++++
>  util/Makefile.objs          |    1 +
>  util/nvdimm-utils.c         |   29 +++++++++++++++++++++++++++++
instead of creating new file, why not to move it to existing hw/mem/nvdimm.c?

>  4 files changed, 38 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/nvdimm-utils.h
>  create mode 100644 util/nvdimm-utils.c
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 9fdad6dc3f..5219dd0e2e 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -32,33 +32,7 @@
>  #include "hw/acpi/bios-linker-loader.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/mem/nvdimm.h"
> -
> -static int nvdimm_device_list(Object *obj, void *opaque)
> -{
> -    GSList **list = opaque;
> -
> -    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
> -        *list = g_slist_append(*list, DEVICE(obj));
> -    }
> -
> -    object_child_foreach(obj, nvdimm_device_list, opaque);
> -    return 0;
> -}
> -
> -/*
> - * inquire NVDIMM devices and link them into the list which is
> - * returned to the caller.
> - *
> - * Note: it is the caller's responsibility to free the list to avoid
> - * memory leak.
> - */
> -static GSList *nvdimm_get_device_list(void)
> -{
> -    GSList *list = NULL;
> -
> -    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
> -    return list;
> -}
> +#include "qemu/nvdimm-utils.h"
>  
>  #define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
>     { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
> diff --git a/include/qemu/nvdimm-utils.h b/include/qemu/nvdimm-utils.h
> new file mode 100644
> index 0000000000..4b8b198ba7
> --- /dev/null
> +++ b/include/qemu/nvdimm-utils.h
> @@ -0,0 +1,7 @@
> +#ifndef NVDIMM_UTILS_H
> +#define NVDIMM_UTILS_H
> +
> +#include "qemu/osdep.h"
> +
> +GSList *nvdimm_get_device_list(void);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 41bf59d127..a0f40d26e3 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -20,6 +20,7 @@ util-obj-y += envlist.o path.o module.o
>  util-obj-y += host-utils.o
>  util-obj-y += bitmap.o bitops.o hbitmap.o
>  util-obj-y += fifo8.o
> +util-obj-y += nvdimm-utils.o
>  util-obj-y += cacheinfo.o
>  util-obj-y += error.o qemu-error.o
>  util-obj-y += qemu-print.o
> diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
> new file mode 100644
> index 0000000000..5cc768ca47
> --- /dev/null
> +++ b/util/nvdimm-utils.c
> @@ -0,0 +1,29 @@
> +#include "qemu/nvdimm-utils.h"
> +#include "hw/mem/nvdimm.h"
> +
> +static int nvdimm_device_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
> +        *list = g_slist_append(*list, DEVICE(obj));
> +    }
> +
> +    object_child_foreach(obj, nvdimm_device_list, opaque);
> +    return 0;
> +}
> +
> +/*
> + * inquire NVDIMM devices and link them into the list which is
> + * returned to the caller.
> + *
> + * Note: it is the caller's responsibility to free the list to avoid
> + * memory leak.
> + */
> +GSList *nvdimm_get_device_list(void)
> +{
> +    GSList *list = NULL;
> +
> +    object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
> +    return list;
> +}
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] mem: move nvdimm_device_list to utilities
  2019-11-19  7:13   ` Igor Mammedov
@ 2019-11-20  8:01     ` Shivaprasad G Bhat
  2019-11-20  9:35       ` Igor Mammedov
  0 siblings, 1 reply; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-11-20  8:01 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: xiaoguangrong.eric, mst, qemu-devel, sbhat, qemu-ppc, david

Hi Igor,


On 11/19/2019 12:43 PM, Igor Mammedov wrote:
> On Mon, 14 Oct 2019 13:37:37 -0500
> Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>
>> nvdimm_device_list is required for parsing the list for devices
>> in subsequent patches. Move it to common utility area.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>> ---
>>   hw/acpi/nvdimm.c            |   28 +---------------------------
>>   include/qemu/nvdimm-utils.h |    7 +++++++
>>   util/Makefile.objs          |    1 +
>>   util/nvdimm-utils.c         |   29 +++++++++++++++++++++++++++++
> instead of creating new file, why not to move it to existing hw/mem/nvdimm.c?

That would break the build for mips-softmmu. The mips has 
CONFIG_ACPI_NVDIMM=y
and not CONFIG_NVDIMM. So, the build would break failing to fetch the 
definition from
hw/mem/nvdimm.c.

I have the patch here from v2 of the series,

https://github.com/ShivaprasadGBhat/qemu/commit/00512a25e4852f174fe6c07bc5acb5ee7027e3de

Thanks,
Shivaprasad



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] mem: move nvdimm_device_list to utilities
  2019-11-20  8:01     ` Shivaprasad G Bhat
@ 2019-11-20  9:35       ` Igor Mammedov
  0 siblings, 0 replies; 18+ messages in thread
From: Igor Mammedov @ 2019-11-20  9:35 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, qemu-devel, qemu-ppc, sbhat, david

On Wed, 20 Nov 2019 13:31:34 +0530
Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:

> Hi Igor,
> 
> 
> On 11/19/2019 12:43 PM, Igor Mammedov wrote:
> > On Mon, 14 Oct 2019 13:37:37 -0500
> > Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
> >  
> >> nvdimm_device_list is required for parsing the list for devices
> >> in subsequent patches. Move it to common utility area.
> >>
> >> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> >> ---
> >>   hw/acpi/nvdimm.c            |   28 +---------------------------
> >>   include/qemu/nvdimm-utils.h |    7 +++++++
> >>   util/Makefile.objs          |    1 +
> >>   util/nvdimm-utils.c         |   29 +++++++++++++++++++++++++++++  
> > instead of creating new file, why not to move it to existing hw/mem/nvdimm.c?  
> 
> That would break the build for mips-softmmu. The mips has 
> CONFIG_ACPI_NVDIMM=y
> and not CONFIG_NVDIMM. So, the build would break failing to fetch the 
> definition from
> hw/mem/nvdimm.c.

Yes, I forgot that mips doesn't really use any acpi stuff, but it
still pulls in files as dependency via piix4 and trying to decouple
it is not worth effort.

So lets go ahead with your variant using util/nvdimm-utils.c

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> I have the patch here from v2 of the series,
> https://github.com/ShivaprasadGBhat/qemu/commit/00512a25e4852f174fe6c07bc5acb5ee7027e3de
>
> 
> Thanks,
> Shivaprasad
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-10-14 18:37 ` [PATCH v3 2/3] spapr: Add NVDIMM device support Shivaprasad G Bhat
@ 2019-11-22  4:30   ` David Gibson
  2019-11-27  4:20     ` Bharata B Rao
  2019-12-16 11:15     ` Shivaprasad G Bhat
  0 siblings, 2 replies; 18+ messages in thread
From: David Gibson @ 2019-11-22  4:30 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, sbhat, qemu-devel, qemu-ppc, imammedo

[-- Attachment #1: Type: text/plain, Size: 26937 bytes --]

On Mon, Oct 14, 2019 at 01:37:50PM -0500, Shivaprasad G Bhat wrote:
> Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm
> device interface in QEMU to support virtual NVDIMM devices for Power.
> Create the required DT entries for the device (some entries have
> dummy values right now).
> 
> The patch creates the required DT node and sends a hotplug
> interrupt to the guest. Guest is expected to undertake the normal
> DR resource add path in response and start issuing PAPR SCM hcalls.
> 
> The device support is verified based on the machine version unlike x86.
> 
> This is how it can be used ..
> Ex :
> For coldplug, the device to be added in qemu command line as shown below
> -object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
> -device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0
> 
> For hotplug, the device to be added from monitor as below
> object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
> device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
>                [Early implementation]
> ---
>  default-configs/ppc64-softmmu.mak |    1 
>  hw/mem/Kconfig                    |    2 
>  hw/mem/nvdimm.c                   |   40 +++++++
>  hw/ppc/spapr.c                    |  218 ++++++++++++++++++++++++++++++++++---
>  hw/ppc/spapr_drc.c                |   18 +++
>  hw/ppc/spapr_events.c             |    4 +
>  include/hw/mem/nvdimm.h           |    7 +
>  include/hw/ppc/spapr.h            |   11 ++
>  include/hw/ppc/spapr_drc.h        |    9 ++
>  9 files changed, 293 insertions(+), 17 deletions(-)
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index cca52665d9..ae0841fa3a 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -8,3 +8,4 @@ CONFIG_POWERNV=y
>  
>  # For pSeries
>  CONFIG_PSERIES=y
> +CONFIG_NVDIMM=y
> diff --git a/hw/mem/Kconfig b/hw/mem/Kconfig
> index 620fd4cb59..2ad052a536 100644
> --- a/hw/mem/Kconfig
> +++ b/hw/mem/Kconfig
> @@ -8,4 +8,4 @@ config MEM_DEVICE
>  config NVDIMM
>      bool
>      default y
> -    depends on PC
> +    depends on (PC || PSERIES)
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 375f9a588a..e1238b5bed 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -69,11 +69,51 @@ out:
>      error_propagate(errp, local_err);
>  }
>  
> +static void nvdimm_get_uuid(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +    char *value = NULL;
> +
> +    value = qemu_uuid_unparse_strdup(&nvdimm->uuid);
> +
> +    visit_type_str(v, name, &value, errp);
> +    g_free(value);
> +}
> +
> +
> +static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +    Error *local_err = NULL;
> +    char *value;
> +
> +    visit_type_str(v, name, &value, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
> +    if (qemu_uuid_parse(value, &nvdimm->uuid) != 0) {
> +        error_setg(errp, "Property '%s.%s' has invalid value",
> +                   object_get_typename(obj), name);
> +        goto out;
> +    }
> +    g_free(value);
> +
> +out:
> +    error_propagate(errp, local_err);
> +}
> +
> +
>  static void nvdimm_init(Object *obj)
>  {
>      object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
>                          nvdimm_get_label_size, nvdimm_set_label_size, NULL,
>                          NULL, NULL);
> +
> +    object_property_add(obj, NVDIMM_UUID_PROP, "QemuUUID", nvdimm_get_uuid,
> +                        nvdimm_set_uuid, NULL, NULL, NULL);

Adding a property to the generic NVDIMM device feels like it should go
in a separate patch.

>  }
>  
>  static void nvdimm_finalize(Object *obj)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 08a2a5a770..eb5c205078 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -80,6 +80,8 @@
>  #include "hw/ppc/spapr_cpu_core.h"
>  #include "hw/mem/memory-device.h"
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/mem/nvdimm.h"
> +#include "qemu/nvdimm-utils.h"
>  
>  #include <libfdt.h>
>  
> @@ -716,7 +718,8 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
>      uint8_t *int_buf, *cur_index;
>      int ret;
>      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> -    uint64_t addr, cur_addr, size;
> +    uint64_t addr, cur_addr, size, slot;
> +    uint64_t scm_block_size = SPAPR_MINIMUM_SCM_BLOCK_SIZE;
>      uint32_t nr_boot_lmbs = (machine->device_memory->base / lmb_size);
>      uint64_t mem_end = machine->device_memory->base +
>                         memory_region_size(&machine->device_memory->mr);
> @@ -741,6 +744,7 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
>          addr = di->addr;
>          size = di->size;
>          node = di->node;
> +        slot = di->slot;
>  
>          /* Entry for hot-pluggable area */
>          if (cur_addr < addr) {
> @@ -752,12 +756,20 @@ static int spapr_populate_drmem_v2(SpaprMachineState *spapr, void *fdt,
>              nr_entries++;
>          }
>  
> -        /* Entry for DIMM */
> -        drc = spapr_drc_by_id(TYPE_SPAPR_DRC_LMB, addr / lmb_size);
> -        g_assert(drc);
> -        elem = spapr_get_drconf_cell(size / lmb_size, addr,
> -                                     spapr_drc_index(drc), node,
> -                                     SPAPR_LMB_FLAGS_ASSIGNED);
> +        if (info->value->type == MEMORY_DEVICE_INFO_KIND_DIMM) {
> +            /* Entry for DIMM */
> +            drc = spapr_drc_by_id(TYPE_SPAPR_DRC_LMB, addr / lmb_size);
> +            g_assert(drc);
> +            elem = spapr_get_drconf_cell(size / lmb_size, addr,
> +                                         spapr_drc_index(drc), node,
> +                                         SPAPR_LMB_FLAGS_ASSIGNED);
> +        } else if (info->value->type == MEMORY_DEVICE_INFO_KIND_NVDIMM) {
> +            /* Entry for NVDIMM */
> +            drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
> +            g_assert(drc);
> +            elem = spapr_get_drconf_cell(size / scm_block_size, addr,
> +                                         spapr_drc_index(drc), -1, 0);
> +        }

Ok.  A number of queries about this.

1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
each entry is the number of LMBs, but for NVDIMMs you use the
not-necessarily-equal scm_block_size instead.  Does the NVDIMM
amendment for PAPR really specify to use different block sizes for
these cases?  (In which case that's a really stupid spec decision, but
that wouldn't surprise me at this point).

2) Similarly, the ibm,dynamic-memory-v2 description says that the
memory block described by the entry has a whole batch of contiguous
DRCs starting at the DRC index given and continuing for #LMBs DRCs.
For NVDIMMs it appears that you just have one DRC for the whole
NVDIMM.  Is that right?

3) You're not setting *any* extra flags on the entry.  How is the
guest supposed to know which are NVDIMM entries and which are regular
DIMM entries?  AFAICT in this version the NVDIMM slots are
indistinguishable from the unassigned hotplug memory (which makes the
difference in LMB and DRC numbering even more troubling).

4) AFAICT these are _present_ NVDIMMs, so why is
SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
forced to -1, regardless of di->node).

>          QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
>          nr_entries++;
>          cur_addr = addr + size;
> @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>      }
>  }
>  
> +static int spapr_dt_nvdimm(void *fdt, int parent_offset,
> +                           NVDIMMDevice *nvdimm)
> +{
> +    int child_offset;
> +    char buf[40];
> +    SpaprDrc *drc;
> +    uint32_t drc_idx;
> +    uint32_t node = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_NODE_PROP,
> +                                             &error_abort);
> +    uint64_t slot = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_SLOT_PROP,
> +                                             &error_abort);
> +    uint32_t associativity[] = {
> +        cpu_to_be32(0x4), /* length */
> +        cpu_to_be32(0x0), cpu_to_be32(0x0),
> +        cpu_to_be32(0x0), cpu_to_be32(node)
> +    };
> +    uint64_t lsize = nvdimm->label_size;
> +    uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
> +                                            NULL);
> +
> +    drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
> +    g_assert(drc);
> +
> +    drc_idx = spapr_drc_index(drc);
> +
> +    sprintf(buf, "ibm,pmemory@%x", drc_idx);
> +    child_offset = fdt_add_subnode(fdt, parent_offset, buf);
> +    _FDT(child_offset);
> +
> +    _FDT((fdt_setprop_cell(fdt, child_offset, "reg", drc_idx)));
> +    _FDT((fdt_setprop_string(fdt, child_offset, "compatible", "ibm,pmemory")));
> +    _FDT((fdt_setprop_string(fdt, child_offset, "device_type", "ibm,pmemory")));
> +
> +    _FDT((fdt_setprop(fdt, child_offset, "ibm,associativity", associativity,
> +                      sizeof(associativity))));
> +
> +    qemu_uuid_unparse(&nvdimm->uuid, buf);
> +    _FDT((fdt_setprop_string(fdt, child_offset, "ibm,unit-guid", buf)));
> +
> +    _FDT((fdt_setprop_cell(fdt, child_offset, "ibm,my-drc-index", drc_idx)));
> +
> +    _FDT((fdt_setprop_u64(fdt, child_offset, "ibm,block-size",
> +                          SPAPR_MINIMUM_SCM_BLOCK_SIZE)));
> +    _FDT((fdt_setprop_u64(fdt, child_offset, "ibm,number-of-blocks",
> +                          size / SPAPR_MINIMUM_SCM_BLOCK_SIZE)));
> +    _FDT((fdt_setprop_cell(fdt, child_offset, "ibm,metadata-size", lsize)));
> +
> +    _FDT((fdt_setprop_string(fdt, child_offset, "ibm,pmem-application",
> +                             "operating-system")));
> +    _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
> +
> +    return child_offset;
> +}
> +
> +static void spapr_dt_persistent_memory(void *fdt)
> +{
> +    int offset = fdt_subnode_offset(fdt, 0, "persistent-memory");
> +    GSList *iter, *nvdimms = nvdimm_get_device_list();
> +
> +    if (offset < 0) {
> +        offset = fdt_add_subnode(fdt, 0, "persistent-memory");
> +        _FDT(offset);
> +        _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 0x1)));
> +        _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 0x0)));
> +        _FDT((fdt_setprop_string(fdt, offset, "device_type",
> +                                 "ibm,persistent-memory")));
> +    }
> +
> +    /* Create DT entries for cold plugged NVDIMM devices */
> +    for (iter = nvdimms; iter; iter = iter->next) {
> +        NVDIMMDevice *nvdimm = iter->data;
> +
> +        spapr_dt_nvdimm(fdt, offset, nvdimm);
> +    }
> +    g_slist_free(nvdimms);
> +
> +    return;
> +}
> +
>  static void *spapr_build_fdt(SpaprMachineState *spapr)
>  {
>      MachineState *machine = MACHINE(spapr);
> @@ -1392,6 +1483,11 @@ static void *spapr_build_fdt(SpaprMachineState *spapr)
>          }
>      }
>  
> +    /* NVDIMM devices */
> +    if (mc->nvdimm_supported) {
> +        spapr_dt_persistent_memory(fdt);
> +    }
> +
>      return fdt;
>  }
>  
> @@ -2521,6 +2617,16 @@ static void spapr_create_lmb_dr_connectors(SpaprMachineState *spapr)
>      }
>  }
>  
> +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
> +{
> +    MachineState *machine = MACHINE(spapr);
> +    int i;
> +
> +    for (i = 0; i < machine->ram_slots; i++) {
> +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);

What happens if you try to plug an NVDIMM to one of these slots, but a
regular DIMM has already taken it?

> +    }
> +}
> +
>  /*
>   * If RAM size, maxmem size and individual node mem sizes aren't aligned
>   * to SPAPR_MEMORY_BLOCK_SIZE(256MB), then refuse to start the guest
> @@ -2734,6 +2840,7 @@ static void spapr_machine_init(MachineState *machine)
>  {
>      SpaprMachineState *spapr = SPAPR_MACHINE(machine);
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
>      const char *kernel_filename = machine->kernel_filename;
>      const char *initrd_filename = machine->initrd_filename;
>      PCIHostState *phb;
> @@ -2915,6 +3022,10 @@ static void spapr_machine_init(MachineState *machine)
>          spapr_create_lmb_dr_connectors(spapr);
>      }
>  
> +    if (mc->nvdimm_supported) {
> +        spapr_create_nvdimm_dr_connectors(spapr);
> +    }
> +
>      filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, "spapr-rtas.bin");
>      if (!filename) {
>          error_report("Could not find LPAR rtas '%s'", "spapr-rtas.bin");
> @@ -3436,6 +3547,16 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
>      }
>  }
>  
> +int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> +                           void *fdt, int *fdt_start_offset, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(drc->dev);
> +
> +    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm);
> +
> +    return 0;
> +}
> +
>  int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                            void *fdt, int *fdt_start_offset, Error **errp)
>  {
> @@ -3498,13 +3619,34 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size,
>      }
>  }
>  
> +static void spapr_add_nvdimm(DeviceState *dev, uint64_t slot, Error **errp)
> +{
> +    SpaprDrc *drc;
> +    bool hotplugged = spapr_drc_hotplugged(dev);
> +    Error *local_err = NULL;
> +
> +    drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
> +    g_assert(drc);
> +
> +    spapr_drc_attach(drc, dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    if (hotplugged) {
> +        spapr_hotplug_req_add_by_index(drc);
> +    }
> +}
> +
>  static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>                                Error **errp)
>  {
>      Error *local_err = NULL;
>      SpaprMachineState *ms = SPAPR_MACHINE(hotplug_dev);
>      PCDIMMDevice *dimm = PC_DIMM(dev);
> -    uint64_t size, addr;
> +    uint64_t size, addr, slot;
> +    bool is_nvdimm = object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM);
>  
>      size = memory_device_get_region_size(MEMORY_DEVICE(dev), &error_abort);
>  
> @@ -3513,14 +3655,24 @@ static void spapr_memory_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>          goto out;
>      }
>  
> -    addr = object_property_get_uint(OBJECT(dimm),
> -                                    PC_DIMM_ADDR_PROP, &local_err);
> -    if (local_err) {
> -        goto out_unplug;
> +    if (!is_nvdimm) {
> +        addr = object_property_get_uint(OBJECT(dimm),
> +                                        PC_DIMM_ADDR_PROP, &local_err);
> +        if (local_err) {
> +            goto out_unplug;
> +        }
> +        spapr_add_lmbs(dev, addr, size,
> +                       spapr_ovec_test(ms->ov5_cas, OV5_HP_EVT),
> +                       &local_err);
> +    } else {
> +        slot = object_property_get_uint(OBJECT(dimm),
> +                                        PC_DIMM_SLOT_PROP, &local_err);
> +        if (local_err) {
> +            goto out_unplug;
> +        }
> +        spapr_add_nvdimm(dev, slot, &local_err);
>      }
>  
> -    spapr_add_lmbs(dev, addr, size, spapr_ovec_test(ms->ov5_cas, OV5_HP_EVT),
> -                   &local_err);
>      if (local_err) {
>          goto out_unplug;
>      }
> @@ -3538,6 +3690,8 @@ static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>  {
>      const SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(hotplug_dev);
>      SpaprMachineState *spapr = SPAPR_MACHINE(hotplug_dev);
> +    const MachineClass *mc = MACHINE_CLASS(smc);
> +    bool is_nvdimm = object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM);
>      PCDIMMDevice *dimm = PC_DIMM(dev);
>      Error *local_err = NULL;
>      uint64_t size;
> @@ -3549,16 +3703,40 @@ static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>          return;
>      }
>  
> +    if (is_nvdimm && !mc->nvdimm_supported) {
> +        error_setg(errp, "NVDIMM hotplug not supported for this machine");
> +        return;
> +    }
> +
>      size = memory_device_get_region_size(MEMORY_DEVICE(dimm), &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
>          return;
>      }
>  
> -    if (size % SPAPR_MEMORY_BLOCK_SIZE) {
> +    if (!is_nvdimm && size % SPAPR_MEMORY_BLOCK_SIZE) {
>          error_setg(errp, "Hotplugged memory size must be a multiple of "
> -                      "%" PRIu64 " MB", SPAPR_MEMORY_BLOCK_SIZE / MiB);
> +                   "%" PRIu64 " MB", SPAPR_MEMORY_BLOCK_SIZE / MiB);
>          return;
> +    } else if (is_nvdimm) {
> +        char *uuidstr = NULL;
> +        QemuUUID uuid;
> +
> +        if (size % SPAPR_MINIMUM_SCM_BLOCK_SIZE) {
> +            error_setg(errp, "NVDIMM memory size excluding the label area"
> +                       " must be a multiple of %" PRIu64 "MB",
> +                       SPAPR_MINIMUM_SCM_BLOCK_SIZE / MiB);
> +            return;
> +        }
> +
> +        uuidstr = object_property_get_str(OBJECT(dimm), NVDIMM_UUID_PROP, NULL);
> +        qemu_uuid_parse(uuidstr, &uuid);
> +        g_free(uuidstr);
> +
> +        if (qemu_uuid_is_null(&uuid)) {
> +            error_setg(errp, "NVDIMM device requires the uuid to be set");
> +            return;
> +        }
>      }
>  
>      memdev = object_property_get_link(OBJECT(dimm), PC_DIMM_MEMDEV_PROP,
> @@ -3698,6 +3876,12 @@ static void spapr_memory_unplug_request(HotplugHandler *hotplug_dev,
>      int i;
>      SpaprDrc *drc;
>  
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
> +        error_setg(&local_err,
> +                   "nvdimm device hot unplug is not supported yet.");
> +        goto out;
> +    }
> +
>      size = memory_device_get_region_size(MEMORY_DEVICE(dimm), &error_abort);
>      nr_lmbs = size / SPAPR_MEMORY_BLOCK_SIZE;
>  
> @@ -4453,6 +4637,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->update_dt_enabled = true;
>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power9_v2.0");
>      mc->has_hotpluggable_cpus = true;
> +    mc->nvdimm_supported = true;
>      smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> @@ -4558,6 +4743,7 @@ static void spapr_machine_4_1_class_options(MachineClass *mc)
>      };
>  
>      spapr_machine_4_2_class_options(mc);
> +    mc->nvdimm_supported = false;

This is too late for qemu-4.2, so this will have to move to the
machine_4_2 function.

>      smc->linux_pci_probe = false;
>      compat_props_add(mc->compat_props, hw_compat_4_1, hw_compat_4_1_len);
>      compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat));
> diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> index 62f1a42592..815167e42f 100644
> --- a/hw/ppc/spapr_drc.c
> +++ b/hw/ppc/spapr_drc.c
> @@ -708,6 +708,17 @@ static void spapr_drc_phb_class_init(ObjectClass *k, void *data)
>      drck->dt_populate = spapr_phb_dt_populate;
>  }
>  
> +static void spapr_drc_pmem_class_init(ObjectClass *k, void *data)
> +{
> +    SpaprDrcClass *drck = SPAPR_DR_CONNECTOR_CLASS(k);
> +
> +    drck->typeshift = SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM;
> +    drck->typename = "MEM";

This is the same as the typename for LMB DRCs.  Doesn't that mean that
ibm,drc-types will end up with a duplicate in it?

> +    drck->drc_name_prefix = "PMEM ";
> +    drck->release = NULL;
> +    drck->dt_populate = spapr_pmem_dt_populate;
> +}
> +
>  static const TypeInfo spapr_dr_connector_info = {
>      .name          = TYPE_SPAPR_DR_CONNECTOR,
>      .parent        = TYPE_DEVICE,
> @@ -758,6 +769,12 @@ static const TypeInfo spapr_drc_phb_info = {
>      .class_init    = spapr_drc_phb_class_init,
>  };
>  
> +static const TypeInfo spapr_drc_pmem_info = {
> +    .name          = TYPE_SPAPR_DRC_PMEM,
> +    .parent        = TYPE_SPAPR_DRC_LOGICAL,
> +    .class_init    = spapr_drc_pmem_class_init,
> +};
> +
>  /* helper functions for external users */
>  
>  SpaprDrc *spapr_drc_by_index(uint32_t index)
> @@ -1229,6 +1246,7 @@ static void spapr_drc_register_types(void)
>      type_register_static(&spapr_drc_pci_info);
>      type_register_static(&spapr_drc_lmb_info);
>      type_register_static(&spapr_drc_phb_info);
> +    type_register_static(&spapr_drc_pmem_info);
>  
>      spapr_rtas_register(RTAS_SET_INDICATOR, "set-indicator",
>                          rtas_set_indicator);
> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
> index 0e4c19523a..b9a4d1607c 100644
> --- a/hw/ppc/spapr_events.c
> +++ b/hw/ppc/spapr_events.c
> @@ -194,6 +194,7 @@ struct rtas_event_log_v6_hp {
>  #define RTAS_LOG_V6_HP_TYPE_SLOT                         3
>  #define RTAS_LOG_V6_HP_TYPE_PHB                          4
>  #define RTAS_LOG_V6_HP_TYPE_PCI                          5
> +#define RTAS_LOG_V6_HP_TYPE_PMEM                         6
>      uint8_t hotplug_action;
>  #define RTAS_LOG_V6_HP_ACTION_ADD                        1
>  #define RTAS_LOG_V6_HP_ACTION_REMOVE                     2
> @@ -530,6 +531,9 @@ static void spapr_hotplug_req_event(uint8_t hp_id, uint8_t hp_action,
>      case SPAPR_DR_CONNECTOR_TYPE_PHB:
>          hp->hotplug_type = RTAS_LOG_V6_HP_TYPE_PHB;
>          break;
> +    case SPAPR_DR_CONNECTOR_TYPE_PMEM:
> +        hp->hotplug_type = RTAS_LOG_V6_HP_TYPE_PMEM;
> +        break;
>      default:
>          /* we shouldn't be signaling hotplug events for resources
>           * that don't support them
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index 523a9b3d4a..4807ca615b 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -25,6 +25,7 @@
>  
>  #include "hw/mem/pc-dimm.h"
>  #include "hw/acpi/bios-linker-loader.h"
> +#include "qemu/uuid.h"
>  
>  #define NVDIMM_DEBUG 0
>  #define nvdimm_debug(fmt, ...)                                \
> @@ -49,6 +50,7 @@
>                                                 TYPE_NVDIMM)
>  
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
> +#define NVDIMM_UUID_PROP       "uuid"
>  #define NVDIMM_UNARMED_PROP    "unarmed"
>  
>  struct NVDIMMDevice {
> @@ -83,6 +85,11 @@ struct NVDIMMDevice {
>       * the guest write persistence.
>       */
>      bool unarmed;
> +
> +    /*
> +     * The PPC64 - spapr requires each nvdimm device have a uuid.
> +     */
> +    QemuUUID uuid;
>  };
>  typedef struct NVDIMMDevice NVDIMMDevice;
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 03111fd55b..a8cb3513d0 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -811,6 +811,8 @@ int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>  void spapr_lmb_release(DeviceState *dev);
>  int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                            void *fdt, int *fdt_start_offset, Error **errp);
> +int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> +                           void *fdt, int *fdt_start_offset, Error **errp);
>  void spapr_phb_release(DeviceState *dev);
>  int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                            void *fdt, int *fdt_start_offset, Error **errp);
> @@ -846,6 +848,15 @@ int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
>  #define SPAPR_LMB_FLAGS_DRC_INVALID 0x00000020
>  #define SPAPR_LMB_FLAGS_RESERVED 0x00000080
>  
> +/*
> + * The nvdimm size should be aligned to SCM block size.
> + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> + * inorder to have SCM regions not to overlap with dimm memory regions.
> + * The SCM devices can have variable block sizes. For now, fixing the
> + * block size to the minimum value.
> + */
> +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> +
>  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data arg);
>  
>  #define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> diff --git a/include/hw/ppc/spapr_drc.h b/include/hw/ppc/spapr_drc.h
> index 83f03cc577..df3d958a66 100644
> --- a/include/hw/ppc/spapr_drc.h
> +++ b/include/hw/ppc/spapr_drc.h
> @@ -78,6 +78,13 @@
>  #define SPAPR_DRC_PHB(obj) OBJECT_CHECK(SpaprDrc, (obj), \
>                                          TYPE_SPAPR_DRC_PHB)
>  
> +#define TYPE_SPAPR_DRC_PMEM "spapr-drc-pmem"
> +#define SPAPR_DRC_PMEM_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(SpaprDrcClass, obj, TYPE_SPAPR_DRC_PMEM)
> +#define SPAPR_DRC_PMEM_CLASS(klass) \
> +        OBJECT_CLASS_CHECK(SpaprDrcClass, klass, TYPE_SPAPR_DRC_PMEM)
> +#define SPAPR_DRC_PMEM(obj) OBJECT_CHECK(SpaprDrc, (obj), \
> +                                         TYPE_SPAPR_DRC_PMEM)
>  /*
>   * Various hotplug types managed by SpaprDrc
>   *
> @@ -95,6 +102,7 @@ typedef enum {
>      SPAPR_DR_CONNECTOR_TYPE_SHIFT_VIO = 3,
>      SPAPR_DR_CONNECTOR_TYPE_SHIFT_PCI = 4,
>      SPAPR_DR_CONNECTOR_TYPE_SHIFT_LMB = 8,
> +    SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM = 9,
>  } SpaprDrcTypeShift;
>  
>  typedef enum {
> @@ -104,6 +112,7 @@ typedef enum {
>      SPAPR_DR_CONNECTOR_TYPE_VIO = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_VIO,
>      SPAPR_DR_CONNECTOR_TYPE_PCI = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_PCI,
>      SPAPR_DR_CONNECTOR_TYPE_LMB = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_LMB,
> +    SPAPR_DR_CONNECTOR_TYPE_PMEM = 1 << SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM,
>  } SpaprDrcType;
>  
>  /*
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device
  2019-10-14 18:38 ` [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device Shivaprasad G Bhat
@ 2019-11-22  5:11   ` David Gibson
  2019-12-17  6:10     ` Shivaprasad G Bhat
  0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2019-11-22  5:11 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, sbhat, qemu-devel, qemu-ppc, imammedo

[-- Attachment #1: Type: text/plain, Size: 15628 bytes --]

On Mon, Oct 14, 2019 at 01:38:16PM -0500, Shivaprasad G Bhat wrote:
> This patch implements few of the necessary hcalls for the nvdimm support.
> 
> PAPR semantics is such that each NVDIMM device is comprising of multiple
> SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
> bind each of the SCM blocks of the NVDIMM device using hcalls. There can
> be SCM block unbind requests in case of driver errors or unplug(not
> supported now) use cases. The NVDIMM label read/writes are done through
> hcalls.
> 
> Since each virtual NVDIMM device is divided into multiple SCM blocks,
> the bind, unbind, and queries using hcalls on those blocks can come
> independently. This doesn't fit well into the qemu device semantics,
> where the map/unmap are done at the (whole)device/object level granularity.
> The patch doesnt actually bind/unbind on hcalls but let it happen at the
> device_add/del phase itself instead.
> 
> The guest kernel makes bind/unbind requests for the virtual NVDIMM device
> at the region level granularity. Without interleaving, each virtual NVDIMM

It's not clear to me what a "region" means in this context.

> device is presented as separate region. There is no way to configure the
> virtual NVDIMM interleaving for the guests today. So, there is no way a
> partial bind/unbind request can come for the vNVDIMM in a hcall for a
> subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do
> bind/unbind everything during the device_add/del.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
> ---
>  hw/ppc/spapr_hcall.c   |  300 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h |    8 +
>  2 files changed, 307 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 23e4bdb829..4e9ad96f7c 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c

This is large enough and sufficiently non-core that I'd suggest
putting it in its own file.

> @@ -18,6 +18,10 @@
>  #include "hw/ppc/spapr_ovec.h"
>  #include "mmu-book3s-v3.h"
>  #include "hw/mem/memory-device.h"
> +#include "hw/ppc/spapr_drc.h"
> +#include "hw/mem/nvdimm.h"
> +#include "qemu/range.h"
> +#include "qemu/nvdimm-utils.h"
>  
>  static bool has_spr(PowerPCCPU *cpu, int spr)
>  {
> @@ -1961,6 +1965,295 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_scm_read_metadata(PowerPCCPU *cpu,
> +                                        SpaprMachineState *spapr,
> +                                        target_ulong opcode,
> +                                        target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t offset = args[1];
> +    uint64_t numBytesToRead = args[2];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    NVDIMMDevice *nvdimm;
> +    NVDIMMClass *ddc;
> +    __be64 data_be = 0;
> +    uint64_t data = 0;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

I'm pretty sure you want if (!drc || ... ) here.  Otherwise if the
given index is totally bogus you'll crash a few lines down.

> +        return H_PARAMETER;
> +    }
> +
> +    if (numBytesToRead != 1 && numBytesToRead != 2 &&
> +        numBytesToRead != 4 && numBytesToRead != 8) {
> +        return H_P3;
> +    }
> +
> +    nvdimm = NVDIMM(drc->dev);

What if the drc slot is not plugged at present (so drc->dev is NULL)?

> +    if ((offset + numBytesToRead < offset) ||
> +        (nvdimm->label_size < numBytesToRead + offset)) {
> +        return H_P2;
> +    }
> +
> +    ddc = NVDIMM_GET_CLASS(nvdimm);
> +    ddc->read_label_data(nvdimm, &data_be, numBytesToRead, offset);
> +
> +    switch (numBytesToRead) {
> +    case 1:
> +        data = data_be & 0xff;

The read_label_data above is only filling in the first byte of
data_be.  That only corresponds to data_be & 0xff if the host is
little endian which you shouldn't rely on.

Also, I'm not sure in what sense "data_be" is big-endian as the name
suggests.  AFAICT the label data is a byte-addressable array, which
means it doesn't really have an endianness.

I think what you're trying to do here is put the 0th byte of label
into the LSB of the return value ... the 7th byte of label data into
the MSB of the return value.

From the point of view of the 64-bit value you're returning in a
register, that makes the label data effectively *LE* (LSB first), not
BE.  Since you're initializing data_be to 0, you can just load into it
and le64_to_cpu() without reference to numBytesToRead to accomplish
this.

> +        break;
> +    case 2:
> +        data = be16_to_cpu(data_be & 0xffff);
> +        break;
> +    case 4:
> +        data = be32_to_cpu(data_be & 0xffffffff);
> +        break;
> +    case 8:
> +        data = be64_to_cpu(data_be);
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    args[0] = data;
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_write_metadata(PowerPCCPU *cpu,
> +                                         SpaprMachineState *spapr,
> +                                         target_ulong opcode,
> +                                         target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t offset = args[1];
> +    uint64_t data = args[2];
> +    uint64_t numBytesToWrite = args[3];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    NVDIMMDevice *nvdimm;
> +    DeviceState *dev;
> +    NVDIMMClass *ddc;
> +    __be64 data_be = 0;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

Same error for !drc as noted above.

> +        return H_PARAMETER;
> +    }
> +
> +    if (numBytesToWrite != 1 && numBytesToWrite != 2 &&
> +        numBytesToWrite != 4 && numBytesToWrite != 8) {
> +        return H_P4;
> +    }
> +
> +    dev = drc->dev;
> +    nvdimm = NVDIMM(dev);
> +
> +    switch (numBytesToWrite) {
> +    case 1:
> +        if (data & 0xffffffffffffff00) {
> +            return H_P2;
> +        }
> +        data_be = data & 0xff;
> +        break;
> +    case 2:
> +        if (data & 0xffffffffffff0000) {
> +            return H_P2;
> +        }
> +        data_be = cpu_to_be16(data & 0xffff);
> +        break;
> +    case 4:
> +        if (data & 0xffffffff00000000) {
> +            return H_P2;
> +        }
> +        data_be = cpu_to_be32(data & 0xffffffff);
> +        break;
> +    case 8:
> +        data_be = cpu_to_be64(data);
> +        break;
> +    default: /* lint */
> +            break;
> +    }

Euch, you can definitely find a less repetitive way to make these
checks than a big switch.  Also in mirror to the read function, AFAICT
what you actually want here is a cpu_to_le64().

> +
> +    ddc = NVDIMM_GET_CLASS(nvdimm);
> +    ddc->write_label_data(nvdimm, &data_be, numBytesToWrite, offset);
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                   target_ulong opcode, target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t starting_idx = args[1];
> +    uint64_t no_of_scm_blocks_to_bind = args[2];
> +    uint64_t target_logical_mem_addr = args[3];
> +    uint64_t continue_token = args[4];
> +    uint64_t size;
> +    uint64_t total_no_of_scm_blocks;
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    hwaddr addr;
> +    DeviceState *dev;
> +    PCDIMMDevice *dimm;
> +    Error *local_err = NULL;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

Check !drc case again.

> +        return H_PARAMETER;
> +    }
> +
> +    dev = drc->dev;

And the !drc->dev case.

> +    dimm = PC_DIMM(dev);
> +
> +    size = object_property_get_uint(OBJECT(dimm),
> +                                    PC_DIMM_SIZE_PROP, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return H_PARAMETER;
> +    }
> +
> +    total_no_of_scm_blocks = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +
> +    if ((starting_idx > total_no_of_scm_blocks) ||
> +        (no_of_scm_blocks_to_bind > total_no_of_scm_blocks)) {

I think the second check is redundant with the check below.

> +        return H_P2;
> +    }
> +
> +    if (((starting_idx + no_of_scm_blocks_to_bind) < starting_idx) ||
> +        ((starting_idx + no_of_scm_blocks_to_bind) > total_no_of_scm_blocks)) {
> +        return H_P3;
> +    }
> +
> +    /* Currently qemu assigns the address. */
> +    if (target_logical_mem_addr != 0xffffffffffffffff) {
> +        return H_OVERLAP;
> +    }
> +
> +    /*
> +     * Currently continue token should be zero qemu has already bound
> +     * everything and this hcall doesnt return H_BUSY.
> +     */
> +    if (continue_token > 0) {
> +        return H_P5;
> +    }
> +
> +    addr = object_property_get_uint(OBJECT(dimm),
> +                                    PC_DIMM_ADDR_PROP, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return H_PARAMETER;
> +    }
> +
> +    addr += starting_idx * SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +
> +    /* Already bound, Return target logical address in R4 */
> +    args[1] = addr;
> +    args[2] = no_of_scm_blocks_to_bind;
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                     target_ulong opcode, target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t starting_scm_logical_addr = args[1];
> +    uint64_t no_of_scm_blocks_to_unbind = args[2];
> +    uint64_t continue_token = args[3];
> +    uint64_t size_to_unbind;
> +    Range blockrange = range_empty;
> +    Range nvdimmrange = range_empty;
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    DeviceState *dev;
> +    PCDIMMDevice *dimm;
> +    uint64_t size, addr;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

And again.

> +        return H_PARAMETER;
> +    }
> +
> +    /* Check if starting_scm_logical_addr is block aligned */
> +    if (!QEMU_IS_ALIGNED(starting_scm_logical_addr,
> +                         SPAPR_MINIMUM_SCM_BLOCK_SIZE)) {
> +        return H_P2;
> +    }
> +
> +    dev = drc->dev;

And again the drc->dev case.

> +    dimm = PC_DIMM(dev);
> +    size = object_property_get_int(OBJECT(dimm), PC_DIMM_SIZE_PROP, NULL);
> +    addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, NULL);
> +
> +    range_init_nofail(&nvdimmrange, addr, size);
> +
> +    size_to_unbind = no_of_scm_blocks_to_unbind * SPAPR_MINIMUM_SCM_BLOCK_SIZE;

Check for overflow in this multiply.

> +
> +
> +    range_init_nofail(&blockrange, starting_scm_logical_addr, size_to_unbind);
> +
> +    if (!range_contains_range(&nvdimmrange, &blockrange)) {
> +        return H_P3;
> +    }
> +
> +    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
> +    if (continue_token > 0) {
> +        return H_P3;
> +    }
> +
> +    args[1] = no_of_scm_blocks_to_unbind;
> +
> +    /* let unplug take care of actual unbind */
> +    return H_SUCCESS;
> +}
> +
> +#define H_UNBIND_SCOPE_ALL 0x1
> +#define H_UNBIND_SCOPE_DRC 0x2
> +
> +static target_ulong h_scm_unbind_all(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                     target_ulong opcode, target_ulong *args)
> +{
> +    uint64_t target_scope = args[0];
> +    uint32_t drc_index = args[1];
> +    uint64_t continue_token = args[2];
> +    NVDIMMDevice *nvdimm;
> +    uint64_t size;
> +    uint64_t no_of_scm_blocks_unbound = 0;
> +
> +    if (target_scope == H_UNBIND_SCOPE_DRC) {
> +        DeviceState *dev;
> +        SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +
> +        if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +            return H_P2;
> +        }
> +
> +        dev = drc->dev;
> +        nvdimm = NVDIMM(dev);
> +        size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP, NULL);
> +
> +        no_of_scm_blocks_unbound = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +    } else if (target_scope ==  H_UNBIND_SCOPE_ALL) {
> +        GSList *list, *dimms;
> +
> +        dimms = nvdimm_get_device_list();
> +        for (list = dimms; list; list = list->next) {
> +            nvdimm = list->data;
> +            size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
> +                                           NULL);
> +
> +            no_of_scm_blocks_unbound += size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +        }
> +        g_slist_free(dimms);
> +    } else {
> +        return H_PARAMETER;
> +    }
> +
> +    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
> +    if (continue_token > 0) {
> +        return H_P4;
> +    }

Usually better to do simple checks like this before more complex logic
like you have above.

> +
> +    args[1] = no_of_scm_blocks_unbound;
> +
> +    /* let unplug take care of actual unbind */
> +    return H_SUCCESS;
> +}
> +
>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
> @@ -2079,6 +2372,13 @@ static void hypercall_register_types(void)
>      /* qemu/KVM-PPC specific hcalls */
>      spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
>  
> +    /* qemu/scm specific hcalls */
> +    spapr_register_hypercall(H_SCM_READ_METADATA, h_scm_read_metadata);
> +    spapr_register_hypercall(H_SCM_WRITE_METADATA, h_scm_write_metadata);
> +    spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
> +    spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
> +    spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> +
>      /* ibm,client-architecture-support support */
>      spapr_register_hypercall(KVMPPC_H_CAS, h_client_architecture_support);
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a8cb3513d0..e1933e877d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -286,6 +286,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
>  
> @@ -493,8 +494,13 @@ struct SpaprMachineState {
>  #define H_INT_ESB               0x3C8
>  #define H_INT_SYNC              0x3CC
>  #define H_INT_RESET             0x3D0
> +#define H_SCM_READ_METADATA     0x3E4
> +#define H_SCM_WRITE_METADATA    0x3E8
> +#define H_SCM_BIND_MEM          0x3EC
> +#define H_SCM_UNBIND_MEM        0x3F0
> +#define H_SCM_UNBIND_ALL        0x3FC
>  
> -#define MAX_HCALL_OPCODE        H_INT_RESET
> +#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-11-22  4:30   ` David Gibson
@ 2019-11-27  4:20     ` Bharata B Rao
  2019-12-06  1:52       ` David Gibson
  2019-12-16 11:15     ` Shivaprasad G Bhat
  1 sibling, 1 reply; 18+ messages in thread
From: Bharata B Rao @ 2019-11-27  4:20 UTC (permalink / raw)
  To: David Gibson
  Cc: xiaoguangrong.eric, Shivaprasad G Bhat, mst, qemu-devel,
	qemu-ppc, Igor Mammedov, Shivaprasad G Bhat

On Fri, Nov 22, 2019 at 10:42 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> Ok.  A number of queries about this.
>
> 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
> each entry is the number of LMBs, but for NVDIMMs you use the
> not-necessarily-equal scm_block_size instead.  Does the NVDIMM
> amendment for PAPR really specify to use different block sizes for
> these cases?  (In which case that's a really stupid spec decision, but
> that wouldn't surprise me at this point).

SCM block sizes can be different from LMB sizes, but here we enforce
that SCM device size (excluding metadata) to multiple of LMB size so
that we don't end up memory range that is not aligned to LMB size.

>
> 2) Similarly, the ibm,dynamic-memory-v2 description says that the
> memory block described by the entry has a whole batch of contiguous
> DRCs starting at the DRC index given and continuing for #LMBs DRCs.
> For NVDIMMs it appears that you just have one DRC for the whole
> NVDIMM.  Is that right?

One NVDIMM has one DRC, In our case, we need to mark the LMBs
corresponding to that address range in ibm,dynamic-memory-v2 as
reserved and invalid.

>
> 3) You're not setting *any* extra flags on the entry.  How is the
> guest supposed to know which are NVDIMM entries and which are regular
> DIMM entries?  AFAICT in this version the NVDIMM slots are
> indistinguishable from the unassigned hotplug memory (which makes the
> difference in LMB and DRC numbering even more troubling).

For NVDIMM case, this patch should populate the LMB set in
ibm,dynamic-memory-v2 something like below:
            elem = spapr_get_drconf_cell(size /lmb_size, addr,
                                         0, -1,
SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);

This will ensure that the NVDIMM range will never be considered as
valid memory range for memory hotplug.

>
> 4) AFAICT these are _present_ NVDIMMs, so why is
> SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
> forced to -1, regardless of di->node).
>
> >          QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
> >          nr_entries++;
> >          cur_addr = addr + size;
> > @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> >      }
> >  }
> >

> > +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
> > +{
> > +    MachineState *machine = MACHINE(spapr);
> > +    int i;
> > +
> > +    for (i = 0; i < machine->ram_slots; i++) {
> > +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
>
> What happens if you try to plug an NVDIMM to one of these slots, but a
> regular DIMM has already taken it?

NVDIMM hotplug won't get that occupied slot.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-11-27  4:20     ` Bharata B Rao
@ 2019-12-06  1:52       ` David Gibson
  2019-12-11  4:14         ` Shivaprasad G Bhat
  0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2019-12-06  1:52 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: xiaoguangrong.eric, Shivaprasad G Bhat, mst, qemu-devel,
	qemu-ppc, Igor Mammedov, Shivaprasad G Bhat

[-- Attachment #1: Type: text/plain, Size: 4474 bytes --]

On Wed, Nov 27, 2019 at 09:50:54AM +0530, Bharata B Rao wrote:
> On Fri, Nov 22, 2019 at 10:42 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > Ok.  A number of queries about this.
> >
> > 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
> > each entry is the number of LMBs, but for NVDIMMs you use the
> > not-necessarily-equal scm_block_size instead.  Does the NVDIMM
> > amendment for PAPR really specify to use different block sizes for
> > these cases?  (In which case that's a really stupid spec decision, but
> > that wouldn't surprise me at this point).
> 
> SCM block sizes can be different from LMB sizes, but here we enforce
> that SCM device size (excluding metadata) to multiple of LMB size so
> that we don't end up memory range that is not aligned to LMB size.

Right, but it still doesn't make sense to use scm_block_size when you
create the dynamic-memory-v2 property.  As far as the thing
interpreting that goes, it *must* be LMB size, not SCM block size.  If
those are required to be the same at this point, you should use an
assert().

> > 2) Similarly, the ibm,dynamic-memory-v2 description says that the
> > memory block described by the entry has a whole batch of contiguous
> > DRCs starting at the DRC index given and continuing for #LMBs DRCs.
> > For NVDIMMs it appears that you just have one DRC for the whole
> > NVDIMM.  Is that right?
> 
> One NVDIMM has one DRC, In our case, we need to mark the LMBs
> corresponding to that address range in ibm,dynamic-memory-v2 as
> reserved and invalid.

Ok, that fits very weirdly with the DRC allocation for the rest of
pluggable memory, but I suppose that's PAPR for you.

Having these in together is very inscrutable though, and relies on a
heap of non-obvious constraints about placement of DIMMs and NVDIMMs
relative to each other.  I really wonder if it would be better to have
a completely different address range for the NVDIMMs.

> > 3) You're not setting *any* extra flags on the entry.  How is the
> > guest supposed to know which are NVDIMM entries and which are regular
> > DIMM entries?  AFAICT in this version the NVDIMM slots are
> > indistinguishable from the unassigned hotplug memory (which makes the
> > difference in LMB and DRC numbering even more troubling).
> 
> For NVDIMM case, this patch should populate the LMB set in
> ibm,dynamic-memory-v2 something like below:
>             elem = spapr_get_drconf_cell(size /lmb_size, addr,
>                                          0, -1,
> SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);
> 
> This will ensure that the NVDIMM range will never be considered as
> valid memory range for memory hotplug.

Hrm.  Ok so we already have code that does that for any gaps between
DIMMs.  I don't think there's actually anything that that code will do
differently than the code you have for NVDIMMs, so you could just skip
over the NVDIMMs here and it should do the right thing.

The *interpretation* of those entries will become different: for space
into which a regular DIMM is later inserted, we'll assume the DRC
index given is a base and there are more DRCs following it, but for
NVDIMMs we'll assume the same DRC throughout.  This is nuts, but IIUC
that's what PAPR says and we can't do much about it.

> > 4) AFAICT these are _present_ NVDIMMs, so why is
> > SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
> > forced to -1, regardless of di->node).
> >
> > >          QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
> > >          nr_entries++;
> > >          cur_addr = addr + size;
> > > @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> > >      }
> > >  }
> > >
> 
> > > +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
> > > +{
> > > +    MachineState *machine = MACHINE(spapr);
> > > +    int i;
> > > +
> > > +    for (i = 0; i < machine->ram_slots; i++) {
> > > +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
> >
> > What happens if you try to plug an NVDIMM to one of these slots, but a
> > regular DIMM has already taken it?
> 
> NVDIMM hotplug won't get that occupied slot.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-12-06  1:52       ` David Gibson
@ 2019-12-11  4:14         ` Shivaprasad G Bhat
  2019-12-11  8:05           ` Igor Mammedov
  0 siblings, 1 reply; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-12-11  4:14 UTC (permalink / raw)
  To: David Gibson, Bharata B Rao
  Cc: xiaoguangrong.eric, mst, Shivaprasad G Bhat, qemu-devel,
	qemu-ppc, Igor Mammedov


On 12/06/2019 07:22 AM, David Gibson wrote:
> On Wed, Nov 27, 2019 at 09:50:54AM +0530, Bharata B Rao wrote:
>> On Fri, Nov 22, 2019 at 10:42 AM David Gibson
>> <david@gibson.dropbear.id.au> wrote:
>>> Ok.  A number of queries about this.
>>>
>>> 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
>>> each entry is the number of LMBs, but for NVDIMMs you use the
>>> not-necessarily-equal scm_block_size instead.  Does the NVDIMM
>>> amendment for PAPR really specify to use different block sizes for
>>> these cases?  (In which case that's a really stupid spec decision, but
>>> that wouldn't surprise me at this point).
>> SCM block sizes can be different from LMB sizes, but here we enforce
>> that SCM device size (excluding metadata) to multiple of LMB size so
>> that we don't end up memory range that is not aligned to LMB size.
> Right, but it still doesn't make sense to use scm_block_size when you
> create the dynamic-memory-v2 property.

Right, I should use LMB size here as I will be creating holes here to 
disallow DIMMs
to claim those LMBs marking them INVALID as Bharata Suggested before.

>   As far as the thing
> interpreting that goes, it *must* be LMB size, not SCM block size.  If
> those are required to be the same at this point, you should use an
> assert().

SCM block size should be a multiple for LMB size, need not be equal. 
I'll add an assert
for that, checking if equal. There is no benefit I see as of now having 
higher
SCM block size as the bind/unbind are already done before the bind hcall.

>>> 2) Similarly, the ibm,dynamic-memory-v2 description says that the
>>> memory block described by the entry has a whole batch of contiguous
>>> DRCs starting at the DRC index given and continuing for #LMBs DRCs.
>>> For NVDIMMs it appears that you just have one DRC for the whole
>>> NVDIMM.  Is that right?
>> One NVDIMM has one DRC, In our case, we need to mark the LMBs
>> corresponding to that address range in ibm,dynamic-memory-v2 as
>> reserved and invalid.
> Ok, that fits very weirdly with the DRC allocation for the rest of
> pluggable memory, but I suppose that's PAPR for you.
>
> Having these in together is very inscrutable though, and relies on a
> heap of non-obvious constraints about placement of DIMMs and NVDIMMs
> relative to each other.  I really wonder if it would be better to have
> a completely different address range for the NVDIMMs.

The backend object for both DIMM and NVDIMM are memory-backend-*
and they use the address from the same space. Separating it would mean
using/introducing different backend object. I dont think we have a 
choice here.

>
>>> 3) You're not setting *any* extra flags on the entry.  How is the
>>> guest supposed to know which are NVDIMM entries and which are regular
>>> DIMM entries?  AFAICT in this version the NVDIMM slots are
>>> indistinguishable from the unassigned hotplug memory (which makes the
>>> difference in LMB and DRC numbering even more troubling).
>> For NVDIMM case, this patch should populate the LMB set in
>> ibm,dynamic-memory-v2 something like below:
>>              elem = spapr_get_drconf_cell(size /lmb_size, addr,
>>                                           0, -1,
>> SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);
>>
>> This will ensure that the NVDIMM range will never be considered as
>> valid memory range for memory hotplug.
> Hrm.  Ok so we already have code that does that for any gaps between
> DIMMs.  I don't think there's actually anything that that code will do
> differently than the code you have for NVDIMMs, so you could just skip
> over the NVDIMMs here and it should do the right thing.
>
> The *interpretation* of those entries will become different: for space
> into which a regular DIMM is later inserted, we'll assume the DRC
> index given is a base and there are more DRCs following it, but for
> NVDIMMs we'll assume the same DRC throughout.  This is nuts, but IIUC
> that's what PAPR says and we can't do much about it.

My current patch is buggy as Bharata pointed out. The NVDIMM DRCs
are not to be populated here, but mark the LMB DRCs as RESERVED and INVALID
so that no malicious attempts to online those LMBs at those NVDIMM address
ranges are attempted.

>
>>> 4) AFAICT these are _present_ NVDIMMs, so why is
>>> SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
>>> forced to -1, regardless of di->node).
>>>
>>>>           QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
>>>>           nr_entries++;
>>>>           cur_addr = addr + size;
>>>> @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>>>>       }
>>>>   }
>>>>
>>>> +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
>>>> +{
>>>> +    MachineState *machine = MACHINE(spapr);
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < machine->ram_slots; i++) {
>>>> +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
>>> What happens if you try to plug an NVDIMM to one of these slots, but a
>>> regular DIMM has already taken it?
>> NVDIMM hotplug won't get that occupied slot.
> Ok.
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-12-11  4:14         ` Shivaprasad G Bhat
@ 2019-12-11  8:05           ` Igor Mammedov
  2019-12-12  8:52             ` Shivaprasad G Bhat
  0 siblings, 1 reply; 18+ messages in thread
From: Igor Mammedov @ 2019-12-11  8:05 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: xiaoguangrong.eric, mst, qemu-devel, Shivaprasad G Bhat,
	Bharata B Rao, qemu-ppc, David Gibson

On Wed, 11 Dec 2019 09:44:11 +0530
Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:

> On 12/06/2019 07:22 AM, David Gibson wrote:
> > On Wed, Nov 27, 2019 at 09:50:54AM +0530, Bharata B Rao wrote:  
> >> On Fri, Nov 22, 2019 at 10:42 AM David Gibson
> >> <david@gibson.dropbear.id.au> wrote:  
> >>> Ok.  A number of queries about this.
> >>>
> >>> 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
> >>> each entry is the number of LMBs, but for NVDIMMs you use the
> >>> not-necessarily-equal scm_block_size instead.  Does the NVDIMM
> >>> amendment for PAPR really specify to use different block sizes for
> >>> these cases?  (In which case that's a really stupid spec decision, but
> >>> that wouldn't surprise me at this point).  
> >> SCM block sizes can be different from LMB sizes, but here we enforce
> >> that SCM device size (excluding metadata) to multiple of LMB size so
> >> that we don't end up memory range that is not aligned to LMB size.  
> > Right, but it still doesn't make sense to use scm_block_size when you
> > create the dynamic-memory-v2 property.  
> 
> Right, I should use LMB size here as I will be creating holes here to 
> disallow DIMMs
> to claim those LMBs marking them INVALID as Bharata Suggested before.
> 
> >   As far as the thing
> > interpreting that goes, it *must* be LMB size, not SCM block size.  If
> > those are required to be the same at this point, you should use an
> > assert().  
> 
> SCM block size should be a multiple for LMB size, need not be equal. 
> I'll add an assert
> for that, checking if equal. There is no benefit I see as of now having 
> higher
> SCM block size as the bind/unbind are already done before the bind hcall.
> 
> >>> 2) Similarly, the ibm,dynamic-memory-v2 description says that the
> >>> memory block described by the entry has a whole batch of contiguous
> >>> DRCs starting at the DRC index given and continuing for #LMBs DRCs.
> >>> For NVDIMMs it appears that you just have one DRC for the whole
> >>> NVDIMM.  Is that right?  
> >> One NVDIMM has one DRC, In our case, we need to mark the LMBs
> >> corresponding to that address range in ibm,dynamic-memory-v2 as
> >> reserved and invalid.  
> > Ok, that fits very weirdly with the DRC allocation for the rest of
> > pluggable memory, but I suppose that's PAPR for you.
> >
> > Having these in together is very inscrutable though, and relies on a
> > heap of non-obvious constraints about placement of DIMMs and NVDIMMs
> > relative to each other.  I really wonder if it would be better to have
> > a completely different address range for the NVDIMMs.  
> 
> The backend object for both DIMM and NVDIMM are memory-backend-*
> and they use the address from the same space. Separating it would mean
> using/introducing different backend object. I dont think we have a 
> choice here.

What address-space(s) are are talking about here exactly?
From my point of view memory-backend-* provides RAM block at
some HVA, which shouldn't not have anything to do with how NVDIMM
partitions and maps it to GPA.


> >>> 3) You're not setting *any* extra flags on the entry.  How is the
> >>> guest supposed to know which are NVDIMM entries and which are regular
> >>> DIMM entries?  AFAICT in this version the NVDIMM slots are
> >>> indistinguishable from the unassigned hotplug memory (which makes the
> >>> difference in LMB and DRC numbering even more troubling).  
> >> For NVDIMM case, this patch should populate the LMB set in
> >> ibm,dynamic-memory-v2 something like below:
> >>              elem = spapr_get_drconf_cell(size /lmb_size, addr,
> >>                                           0, -1,
> >> SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);
> >>
> >> This will ensure that the NVDIMM range will never be considered as
> >> valid memory range for memory hotplug.  
> > Hrm.  Ok so we already have code that does that for any gaps between
> > DIMMs.  I don't think there's actually anything that that code will do
> > differently than the code you have for NVDIMMs, so you could just skip
> > over the NVDIMMs here and it should do the right thing.
> >
> > The *interpretation* of those entries will become different: for space
> > into which a regular DIMM is later inserted, we'll assume the DRC
> > index given is a base and there are more DRCs following it, but for
> > NVDIMMs we'll assume the same DRC throughout.  This is nuts, but IIUC
> > that's what PAPR says and we can't do much about it.  
> 
> My current patch is buggy as Bharata pointed out. The NVDIMM DRCs
> are not to be populated here, but mark the LMB DRCs as RESERVED and INVALID
> so that no malicious attempts to online those LMBs at those NVDIMM address
> ranges are attempted.
> 
> >  
> >>> 4) AFAICT these are _present_ NVDIMMs, so why is
> >>> SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
> >>> forced to -1, regardless of di->node).
> >>>  
> >>>>           QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
> >>>>           nr_entries++;
> >>>>           cur_addr = addr + size;
> >>>> @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> >>>>       }
> >>>>   }
> >>>>
> >>>> +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
> >>>> +{
> >>>> +    MachineState *machine = MACHINE(spapr);
> >>>> +    int i;
> >>>> +
> >>>> +    for (i = 0; i < machine->ram_slots; i++) {
> >>>> +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);  
> >>> What happens if you try to plug an NVDIMM to one of these slots, but a
> >>> regular DIMM has already taken it?  
> >> NVDIMM hotplug won't get that occupied slot.  
> > Ok.
> >  
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-12-11  8:05           ` Igor Mammedov
@ 2019-12-12  8:52             ` Shivaprasad G Bhat
  2020-01-03  0:45               ` David Gibson
  0 siblings, 1 reply; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-12-12  8:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, qemu-devel, Shivaprasad G Bhat,
	Bharata B Rao, qemu-ppc, David Gibson



On 12/11/2019 01:35 PM, Igor Mammedov wrote:
> On Wed, 11 Dec 2019 09:44:11 +0530
> Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>
>> On 12/06/2019 07:22 AM, David Gibson wrote:
>>> On Wed, Nov 27, 2019 at 09:50:54AM +0530, Bharata B Rao wrote:
>>>> On Fri, Nov 22, 2019 at 10:42 AM David Gibson
>>>> <david@gibson.dropbear.id.au> wrote:
>>>>> Ok.  A number of queries about this.
>>>>>
>>>>> 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
>>>>> each entry is the number of LMBs, but for NVDIMMs you use the
>>>>> not-necessarily-equal scm_block_size instead.  Does the NVDIMM
>>>>> amendment for PAPR really specify to use different block sizes for
>>>>> these cases?  (In which case that's a really stupid spec decision, but
>>>>> that wouldn't surprise me at this point).
>>>> SCM block sizes can be different from LMB sizes, but here we enforce
>>>> that SCM device size (excluding metadata) to multiple of LMB size so
>>>> that we don't end up memory range that is not aligned to LMB size.
>>> Right, but it still doesn't make sense to use scm_block_size when you
>>> create the dynamic-memory-v2 property.
>> Right, I should use LMB size here as I will be creating holes here to
>> disallow DIMMs
>> to claim those LMBs marking them INVALID as Bharata Suggested before.
>>
>>>    As far as the thing
>>> interpreting that goes, it *must* be LMB size, not SCM block size.  If
>>> those are required to be the same at this point, you should use an
>>> assert().
>> SCM block size should be a multiple for LMB size, need not be equal.
>> I'll add an assert
>> for that, checking if equal. There is no benefit I see as of now having
>> higher
>> SCM block size as the bind/unbind are already done before the bind hcall.
>>
>>>>> 2) Similarly, the ibm,dynamic-memory-v2 description says that the
>>>>> memory block described by the entry has a whole batch of contiguous
>>>>> DRCs starting at the DRC index given and continuing for #LMBs DRCs.
>>>>> For NVDIMMs it appears that you just have one DRC for the whole
>>>>> NVDIMM.  Is that right?
>>>> One NVDIMM has one DRC, In our case, we need to mark the LMBs
>>>> corresponding to that address range in ibm,dynamic-memory-v2 as
>>>> reserved and invalid.
>>> Ok, that fits very weirdly with the DRC allocation for the rest of
>>> pluggable memory, but I suppose that's PAPR for you.
>>>
>>> Having these in together is very inscrutable though, and relies on a
>>> heap of non-obvious constraints about placement of DIMMs and NVDIMMs
>>> relative to each other.  I really wonder if it would be better to have
>>> a completely different address range for the NVDIMMs.
>> The backend object for both DIMM and NVDIMM are memory-backend-*
>> and they use the address from the same space. Separating it would mean
>> using/introducing different backend object. I dont think we have a
>> choice here.
> What address-space(s) are are talking about here exactly?
>  From my point of view memory-backend-* provides RAM block at
> some HVA, which shouldn't not have anything to do with how NVDIMM
> partitions and maps it to GPA.

Ah, you are right! I got confused with the HVA.

Nonetheless, I don't see a need for having vNVDIMM in different
guest physical address range as the existing code has support for marking
memory ranges distinctly for DIMM/NVDIMM.

On another note, the x86 too does it the same way. There is no separate
range defined there.

>
>
>>>>> 3) You're not setting *any* extra flags on the entry.  How is the
>>>>> guest supposed to know which are NVDIMM entries and which are regular
>>>>> DIMM entries?  AFAICT in this version the NVDIMM slots are
>>>>> indistinguishable from the unassigned hotplug memory (which makes the
>>>>> difference in LMB and DRC numbering even more troubling).
>>>> For NVDIMM case, this patch should populate the LMB set in
>>>> ibm,dynamic-memory-v2 something like below:
>>>>               elem = spapr_get_drconf_cell(size /lmb_size, addr,
>>>>                                            0, -1,
>>>> SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);
>>>>
>>>> This will ensure that the NVDIMM range will never be considered as
>>>> valid memory range for memory hotplug.
>>> Hrm.  Ok so we already have code that does that for any gaps between
>>> DIMMs.  I don't think there's actually anything that that code will do
>>> differently than the code you have for NVDIMMs, so you could just skip
>>> over the NVDIMMs here and it should do the right thing.
>>>
>>> The *interpretation* of those entries will become different: for space
>>> into which a regular DIMM is later inserted, we'll assume the DRC
>>> index given is a base and there are more DRCs following it, but for
>>> NVDIMMs we'll assume the same DRC throughout.  This is nuts, but IIUC
>>> that's what PAPR says and we can't do much about it.
>> My current patch is buggy as Bharata pointed out. The NVDIMM DRCs
>> are not to be populated here, but mark the LMB DRCs as RESERVED and INVALID
>> so that no malicious attempts to online those LMBs at those NVDIMM address
>> ranges are attempted.
>>
>>>   
>>>>> 4) AFAICT these are _present_ NVDIMMs, so why is
>>>>> SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
>>>>> forced to -1, regardless of di->node).
>>>>>   
>>>>>>            QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
>>>>>>            nr_entries++;
>>>>>>            cur_addr = addr + size;
>>>>>> @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
>>>>>>        }
>>>>>>    }
>>>>>>
>>>>>> +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
>>>>>> +{
>>>>>> +    MachineState *machine = MACHINE(spapr);
>>>>>> +    int i;
>>>>>> +
>>>>>> +    for (i = 0; i < machine->ram_slots; i++) {
>>>>>> +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
>>>>> What happens if you try to plug an NVDIMM to one of these slots, but a
>>>>> regular DIMM has already taken it?
>>>> NVDIMM hotplug won't get that occupied slot.
>>> Ok.
>>>   



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-11-22  4:30   ` David Gibson
  2019-11-27  4:20     ` Bharata B Rao
@ 2019-12-16 11:15     ` Shivaprasad G Bhat
  1 sibling, 0 replies; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-12-16 11:15 UTC (permalink / raw)
  To: David Gibson
  Cc: xiaoguangrong.eric, mst, sbhat, qemu-devel, qemu-ppc, imammedo

Hi David,

On 11/22/2019 10:00 AM, David Gibson wrote:
> On Mon, Oct 14, 2019 at 01:37:50PM -0500, Shivaprasad G Bhat wrote:
> ---
>> index 62f1a42592..815167e42f 100644
>> --- a/hw/ppc/spapr_drc.c
>> +++ b/hw/ppc/spapr_drc.c
>> @@ -708,6 +708,17 @@ static void spapr_drc_phb_class_init(ObjectClass *k, void *data)
>>       drck->dt_populate = spapr_phb_dt_populate;
>>   }
>>   
>> +static void spapr_drc_pmem_class_init(ObjectClass *k, void *data)
>> +{
>> +    SpaprDrcClass *drck = SPAPR_DR_CONNECTOR_CLASS(k);
>> +
>> +    drck->typeshift = SPAPR_DR_CONNECTOR_TYPE_SHIFT_PMEM;
>> +    drck->typename = "MEM";
> This is the same as the typename for LMB DRCs.  Doesn't that mean that
> ibm,drc-types will end up with a duplicate in it?

Correct, this has to be "PMEM" instead of "MEM". Fixing it in next version.

Thanks,
Shivaprasad

>> +    drck->drc_name_prefix = "PMEM ";
>>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device
  2019-11-22  5:11   ` David Gibson
@ 2019-12-17  6:10     ` Shivaprasad G Bhat
  0 siblings, 0 replies; 18+ messages in thread
From: Shivaprasad G Bhat @ 2019-12-17  6:10 UTC (permalink / raw)
  To: David Gibson
  Cc: xiaoguangrong.eric, mst, sbhat, qemu-devel, qemu-ppc, imammedo

Hi David,


On 11/22/2019 10:41 AM, David Gibson wrote:
> On Mon, Oct 14, 2019 at 01:38:16PM -0500, Shivaprasad G Bhat wrote:
>> device_add/del phase itself instead.
>>
>> The guest kernel makes bind/unbind requests for the virtual NVDIMM device
>> at the region level granularity. Without interleaving, each virtual NVDIMM
> It's not clear to me what a "region" means in this context.

That is PMEM terminology. "region" in this context is guest physical
address range.

Fixing all the rest of the things you pointed out.

Thanks,
Shivaprasad

>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
  2019-12-12  8:52             ` Shivaprasad G Bhat
@ 2020-01-03  0:45               ` David Gibson
  0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2020-01-03  0:45 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: qemu-ppc, xiaoguangrong.eric, mst, qemu-devel,
	Shivaprasad G Bhat, Bharata B Rao, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 8022 bytes --]

On Thu, Dec 12, 2019 at 02:22:56PM +0530, Shivaprasad G Bhat wrote:
> 
> 
> On 12/11/2019 01:35 PM, Igor Mammedov wrote:
> > On Wed, 11 Dec 2019 09:44:11 +0530
> > Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
> > 
> > > On 12/06/2019 07:22 AM, David Gibson wrote:
> > > > On Wed, Nov 27, 2019 at 09:50:54AM +0530, Bharata B Rao wrote:
> > > > > On Fri, Nov 22, 2019 at 10:42 AM David Gibson
> > > > > <david@gibson.dropbear.id.au> wrote:
> > > > > > Ok.  A number of queries about this.
> > > > > > 
> > > > > > 1) The PAPR spec for ibm,dynamic-memory-v2 says that the first word in
> > > > > > each entry is the number of LMBs, but for NVDIMMs you use the
> > > > > > not-necessarily-equal scm_block_size instead.  Does the NVDIMM
> > > > > > amendment for PAPR really specify to use different block sizes for
> > > > > > these cases?  (In which case that's a really stupid spec decision, but
> > > > > > that wouldn't surprise me at this point).
> > > > > SCM block sizes can be different from LMB sizes, but here we enforce
> > > > > that SCM device size (excluding metadata) to multiple of LMB size so
> > > > > that we don't end up memory range that is not aligned to LMB size.
> > > > Right, but it still doesn't make sense to use scm_block_size when you
> > > > create the dynamic-memory-v2 property.
> > > Right, I should use LMB size here as I will be creating holes here to
> > > disallow DIMMs
> > > to claim those LMBs marking them INVALID as Bharata Suggested before.
> > > 
> > > >    As far as the thing
> > > > interpreting that goes, it *must* be LMB size, not SCM block size.  If
> > > > those are required to be the same at this point, you should use an
> > > > assert().
> > > SCM block size should be a multiple for LMB size, need not be equal.
> > > I'll add an assert
> > > for that, checking if equal. There is no benefit I see as of now having
> > > higher
> > > SCM block size as the bind/unbind are already done before the bind hcall.
> > > 
> > > > > > 2) Similarly, the ibm,dynamic-memory-v2 description says that the
> > > > > > memory block described by the entry has a whole batch of contiguous
> > > > > > DRCs starting at the DRC index given and continuing for #LMBs DRCs.
> > > > > > For NVDIMMs it appears that you just have one DRC for the whole
> > > > > > NVDIMM.  Is that right?
> > > > > One NVDIMM has one DRC, In our case, we need to mark the LMBs
> > > > > corresponding to that address range in ibm,dynamic-memory-v2 as
> > > > > reserved and invalid.
> > > > Ok, that fits very weirdly with the DRC allocation for the rest of
> > > > pluggable memory, but I suppose that's PAPR for you.
> > > > 
> > > > Having these in together is very inscrutable though, and relies on a
> > > > heap of non-obvious constraints about placement of DIMMs and NVDIMMs
> > > > relative to each other.  I really wonder if it would be better to have
> > > > a completely different address range for the NVDIMMs.
> > > The backend object for both DIMM and NVDIMM are memory-backend-*
> > > and they use the address from the same space. Separating it would mean
> > > using/introducing different backend object. I dont think we have a
> > > choice here.
> > What address-space(s) are are talking about here exactly?
> >  From my point of view memory-backend-* provides RAM block at
> > some HVA, which shouldn't not have anything to do with how NVDIMM
> > partitions and maps it to GPA.
> 
> Ah, you are right! I got confused with the HVA.
> 
> Nonetheless, I don't see a need for having vNVDIMM in different
> guest physical address range as the existing code has support for marking
> memory ranges distinctly for DIMM/NVDIMM.

The problem is that the way you create the dynamic-memory-v2 property
relies on knowing whether a GPA is DIMM or NVDIMM -but you can't in
the presence of hotplug.  Using the default address allocation, DIMMs
and NVDIMMs are dynamically assigned addresses in the hotplug memory
area.

So, if you have an NVDIMM plugged at boot time, you'll mark that range
of LMBs as invalid.  If you then unplug it, and instead plug in a
regular DIMM (before the guest has even tried to online the NVDIMM),
it will probably use the same GPA range.  You therefore need regular
LMB DRCs for that range, and we have no way to communicate that to the
guest.

Similar problems if you go the other way (unplug DIMM, plug NVDIMM).

> On another note, the x86 too does it the same way. There is no separate
> range defined there.

Yes, but AIUI, the way PC describes DIMMs and NVDIMMs in ACPI are
pretty similar.  In PAPR they are gratuitously different - we don't
even have the concept of a DIMM, only the individual LMBs that go into
it.  The match between the DIMM backend and the PAPR LMB frontend is
already pretty poor, covering NVDIMMs in the same range pushes it past
breaking point.

> > > > > > 3) You're not setting *any* extra flags on the entry.  How is the
> > > > > > guest supposed to know which are NVDIMM entries and which are regular
> > > > > > DIMM entries?  AFAICT in this version the NVDIMM slots are
> > > > > > indistinguishable from the unassigned hotplug memory (which makes the
> > > > > > difference in LMB and DRC numbering even more troubling).
> > > > > For NVDIMM case, this patch should populate the LMB set in
> > > > > ibm,dynamic-memory-v2 something like below:
> > > > >               elem = spapr_get_drconf_cell(size /lmb_size, addr,
> > > > >                                            0, -1,
> > > > > SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID);
> > > > > 
> > > > > This will ensure that the NVDIMM range will never be considered as
> > > > > valid memory range for memory hotplug.
> > > > Hrm.  Ok so we already have code that does that for any gaps between
> > > > DIMMs.  I don't think there's actually anything that that code will do
> > > > differently than the code you have for NVDIMMs, so you could just skip
> > > > over the NVDIMMs here and it should do the right thing.
> > > > 
> > > > The *interpretation* of those entries will become different: for space
> > > > into which a regular DIMM is later inserted, we'll assume the DRC
> > > > index given is a base and there are more DRCs following it, but for
> > > > NVDIMMs we'll assume the same DRC throughout.  This is nuts, but IIUC
> > > > that's what PAPR says and we can't do much about it.
> > > My current patch is buggy as Bharata pointed out. The NVDIMM DRCs
> > > are not to be populated here, but mark the LMB DRCs as RESERVED and INVALID
> > > so that no malicious attempts to online those LMBs at those NVDIMM address
> > > ranges are attempted.
> > > 
> > > > > > 4) AFAICT these are _present_ NVDIMMs, so why is
> > > > > > SPAPR_LMB_FLAGS_ASSIGNED not set for them?  (and why is the node
> > > > > > forced to -1, regardless of di->node).
> > > > > > >            QSIMPLEQ_INSERT_TAIL(&drconf_queue, elem, entry);
> > > > > > >            nr_entries++;
> > > > > > >            cur_addr = addr + size;
> > > > > > > @@ -1261,6 +1273,85 @@ static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
> > > > > > >        }
> > > > > > >    }
> > > > > > > 
> > > > > > > +static void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
> > > > > > > +{
> > > > > > > +    MachineState *machine = MACHINE(spapr);
> > > > > > > +    int i;
> > > > > > > +
> > > > > > > +    for (i = 0; i < machine->ram_slots; i++) {
> > > > > > > +        spapr_dr_connector_new(OBJECT(spapr), TYPE_SPAPR_DRC_PMEM, i);
> > > > > > What happens if you try to plug an NVDIMM to one of these slots, but a
> > > > > > regular DIMM has already taken it?
> > > > > NVDIMM hotplug won't get that occupied slot.
> > > > Ok.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-01-03  0:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14 18:37 [PATCH v3 0/3] ppc: spapr: virtual NVDIMM support Shivaprasad G Bhat
2019-10-14 18:37 ` [PATCH v3 1/3] mem: move nvdimm_device_list to utilities Shivaprasad G Bhat
2019-11-19  2:58   ` David Gibson
2019-11-19  7:13   ` Igor Mammedov
2019-11-20  8:01     ` Shivaprasad G Bhat
2019-11-20  9:35       ` Igor Mammedov
2019-10-14 18:37 ` [PATCH v3 2/3] spapr: Add NVDIMM device support Shivaprasad G Bhat
2019-11-22  4:30   ` David Gibson
2019-11-27  4:20     ` Bharata B Rao
2019-12-06  1:52       ` David Gibson
2019-12-11  4:14         ` Shivaprasad G Bhat
2019-12-11  8:05           ` Igor Mammedov
2019-12-12  8:52             ` Shivaprasad G Bhat
2020-01-03  0:45               ` David Gibson
2019-12-16 11:15     ` Shivaprasad G Bhat
2019-10-14 18:38 ` [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device Shivaprasad G Bhat
2019-11-22  5:11   ` David Gibson
2019-12-17  6:10     ` Shivaprasad G Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).