nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH REBASED v5 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2021-07-08  2:57 Shivaprasad G Bhat
  2021-07-08  2:57 ` [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
  2021-07-08  2:57 ` [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
  0 siblings, 2 replies; 8+ messages in thread
From: Shivaprasad G Bhat @ 2021-07-08  2:57 UTC (permalink / raw)
  To: david, groug, qemu-ppc; +Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

If the device backend is not persistent memory for the nvdimm, there
is need for explicit IO flushes to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem.
So, the approach here is to convey when the hcall flush is required
in a device tree property. The guest once it knows the device needs
explicit flushes, makes the hcall as and when required.

It was suggested to create a new device type to address the
explicit flush for such backends on PPC instead of extending the
generic nvdimm device with new property. So, the patch introduces
the spapr-nvdimm device. The new device inherits the nvdimm device
with the new bahviour such that if the backend has pmem=no, the
device tree property is set.

The below demonstration shows the map_sync behavior for non-pmem
backends.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from spapr-nvdimm with with backend pmem=yes, and pmem1 is
from spapr-nvdimm with pmem=no, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=yes
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=no
Failed to mmap  with Operation not supported

First patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The second patch
introduces the spapr-nvdimm device, adds the device tree property
for the guest when spapr-nvdimm is used with pmem="no" on the backend.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
Changes from v4:
      - Introduce spapr-nvdimm device with nvdimm device as the parent.
      - The new spapr-nvdimm has no new properties. As this is a new
        device and there is no migration related dependencies to be
        taken care of, the device behavior is made to set the device tree
        property and enable hcall when the device type spapr-nvdimm is
        used with pmem="no"
      - Fixed commit messages
      - Added checks to ensure the backend is actualy file and not memory
      - Addressed things pointed out by Eric

v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (2):
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      spapr: nvdimm: Introduce spapr-nvdimm device


 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  286 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   11 +-
 include/hw/ppc/spapr_nvdimm.h |   17 ++
 4 files changed, 319 insertions(+), 1 deletion(-)

--
Signature


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-07-08  2:57 [PATCH REBASED v5 0/3] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
@ 2021-07-08  2:57 ` Shivaprasad G Bhat
  2021-07-08  6:12   ` David Gibson
  2021-09-21  6:23   ` David Gibson
  2021-07-08  2:57 ` [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
  1 sibling, 2 replies; 8+ messages in thread
From: Shivaprasad G Bhat @ 2021-07-08  2:57 UTC (permalink / raw)
  To: david, groug, qemu-ppc; +Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with one of H_LONG_BUSY when the operation is expected to take longer
time along with a continue_token. The hcall to be called again providing
the continue_token to get the status. So, all fresh requests are put into
a 'pending' list and flush worker is submitted to the thread pool. The
thread pool completion callbacks move the requests to 'completed' list,
which are cleaned up after reporting to guest in subsequent hcalls to
get the status.

The semantics makes it necessary to preserve the continue_tokens and
their return status across migrations. So, the completed flush states
are forwarded to the destination and the pending ones are restarted
at the destination in post_load. The necessary nvdimm flush specific
vmstate structures are added to the spapr machine vmstate.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   11 ++
 include/hw/ppc/spapr_nvdimm.h |   13 ++
 4 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4dd90b75cc..546d825dde 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1622,6 +1622,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes(spapr);
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2018,6 +2020,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_ccf_assist,
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
+        &vmstate_spapr_nvdimm_states,
         NULL
     }
 };
@@ -3014,6 +3017,9 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
+
+    QLIST_INIT(&spapr->pending_flush_states);
+    QLIST_INIT(&spapr->completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 91de1052f2..4f8931ab15 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -30,6 +31,7 @@
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -375,6 +377,243 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static uint64_t flush_token;
+
+static int flush_worker_cb(void *opaque)
+{
+    int ret = H_SUCCESS;
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    /* flush raw backing image */
+    if (qemu_fdatasync(state->backend_fd) < 0) {
+        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                     strerror(errno));
+        ret = H_HARDWARE;
+    }
+
+    return ret;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
+}
+
+static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
+     .name = "spapr_nvdimm_flush_state",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+     SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+
+     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
+             !QLIST_EMPTY(&spapr->completed_flush_states));
+}
+
+static int spapr_nvdimm_post_load(void *opaque, int version_id)
+{
+    SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    SpaprDrc *drc;
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        if (flush_token < state->continue_token) {
+            flush_token = state->continue_token;
+        }
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, next) {
+        if (flush_token < state->continue_token) {
+            flush_token = state->continue_token;
+        }
+
+        drc = spapr_drc_by_index(state->drcidx);
+        dimm = PC_DIMM(drc->dev);
+        backend = MEMORY_BACKEND(dimm->hostmem);
+        state->backend_fd = memory_region_get_fd(&backend->mr);
+
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+    }
+
+    return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_states = {
+    .name = "spapr_nvdimm_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nvdimm_states_needed,
+    .post_load = spapr_nvdimm_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_QLIST_V(pending_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Assign a token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
+                                                      SpaprMachineState *spapr)
+{
+    SpaprNVDIMMDeviceFlushState *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    flush_token++;
+    /* Token zero is presumed as no job pending. Handle the overflow to zero */
+    if (flush_token == 0) {
+        flush_token++;
+    }
+    state->continue_token = flush_token;
+
+    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+
+    /*
+     * Called on reset path, the main loop thread which calls
+     * the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        QLIST_REMOVE(state, node);
+        g_free(state);
+    }
+}
+
+/*
+ * spapr_nvdimm_get_flush_status
+ *      Fetches the status of the hcall worker and returns
+ *      H_LONG_BUSY_XYZ if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(SpaprMachineState *spapr,
+                                         uint64_t token)
+{
+    int ret = H_LONG_BUSY_ORDER_10_MSEC;
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
+        if (state->continue_token == token) {
+            goto exit;
+        }
+    }
+    ret = H_P2; /* If not found in complete list too, invalid token */
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
+        if (state->continue_token == token) {
+            ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            break;
+        }
+    }
+exit:
+    return ret;
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_LONG_BUSY_XX when the flush takes longer time and
+ * the hcall needs to be issued multiple times in order to be completely
+ * serviced. The continue-token from the output to be passed in the
+ * argument list of subsequent hcalls until the hcall is completely serviced
+ * at which point H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    int fd;
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (continue_token != 0) {
+        goto get_status;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    backend = MEMORY_BACKEND(dimm->hostmem);
+    fd = memory_region_get_fd(&backend->mr);
+
+    if (fd < 0) {
+        return H_UNSUPPORTED;
+    }
+
+    state = spapr_nvdimm_init_new_flush_state(spapr);
+    if (!state) {
+        return H_HARDWARE;
+    }
+
+    state->drcidx = drc_index;
+    state->backend_fd = fd;
+
+    thread_pool_submit_aio(pool, flush_worker_cb, state,
+                           spapr_nvdimm_flush_completion_cb, state);
+
+    continue_token = state->continue_token;
+
+get_status:
+    ret = spapr_nvdimm_get_flush_status(spapr, continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -523,6 +762,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
     spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index f05219f75e..1684d72546 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -12,10 +12,12 @@
 #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
 #include "hw/ppc/xics.h"        /* For ICSState */
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/ppc/spapr_nvdimm.h"
 
 struct SpaprVioBus;
 struct SpaprPhbState;
 struct SpaprNvram;
+struct SpaprNVDIMMDeviceFlushState;
 
 typedef struct SpaprEventLogEntry SpaprEventLogEntry;
 typedef struct SpaprEventSource SpaprEventSource;
@@ -248,6 +250,11 @@ struct SpaprMachineState {
     uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
+
+    /* nvdimm flush states */
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
+
 };
 
 #define H_SUCCESS         0
@@ -328,6 +335,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
@@ -542,8 +550,9 @@ struct SpaprMachineState {
 #define H_SCM_UNBIND_MEM        0x3F0
 #define H_SCM_UNBIND_ALL        0x3FC
 #define H_SCM_HEALTH            0x400
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_SCM_HEALTH
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 764f999f54..24d8e37b33 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,6 +11,7 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
+#include "migration/vmstate.h"
 
 typedef struct SpaprDrc SpaprDrc;
 typedef struct SpaprMachineState SpaprMachineState;
@@ -21,5 +22,17 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr);
+
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    int backend_fd;
+    uint32_t drcidx;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+extern const VMStateDescription vmstate_spapr_nvdimm_states;
 
 #endif



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device
  2021-07-08  2:57 [PATCH REBASED v5 0/3] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
  2021-07-08  2:57 ` [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
@ 2021-07-08  2:57 ` Shivaprasad G Bhat
  2021-09-21  6:32   ` David Gibson
  1 sibling, 1 reply; 8+ messages in thread
From: Shivaprasad G Bhat @ 2021-07-08  2:57 UTC (permalink / raw)
  To: david, groug, qemu-ppc; +Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

If the device backend is not persistent memory for the nvdimm, there is
need for explicit IO flushes on the backend to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem. So, the
approach here is to convey when the hcall flush is required in a device
tree property. The guest once it knows the device backend is not pmem,
makes the hcall whenever flush is required.

To set the device tree property, the patch introduces a new papr specific
device type inheriting the nvdimm device. When the backend doesn't have
pmem="yes", the device tree property "ibm,hcall-flush-required" is set,
and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr_nvdimm.c         |   46 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_nvdimm.h |    4 ++++
 2 files changed, 50 insertions(+)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 4f8931ab15..4dc7c3f147 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -54,6 +54,8 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
 {
     const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
     const MachineState *ms = MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+    MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
     g_autofree char *uuidstr = NULL;
     QemuUUID uuid;
     int ret;
@@ -91,6 +93,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
         return false;
     }
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
+        (memory_region_get_fd(mr) < 0)) {
+        error_setg(errp, "spapr-nvdimm device requires the "
+                   "memdev %s to be of memory-backend-file type",
+                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));
+        return false;
+    }
+
     return true;
 }
 
@@ -162,6 +172,21 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+        bool is_pmem = false;
+#ifdef CONFIG_LIBPMEM
+        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+        HostMemoryBackend *hostmem = dimm->hostmem;
+
+        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem",
+                                           &error_abort);
+#endif
+        if (!is_pmem) {
+            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                             NULL, 0));
+        }
+    }
+
     return child_offset;
 }
 
@@ -585,7 +610,16 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     }
 
     dimm = PC_DIMM(drc->dev);
+    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
+        return H_PARAMETER;
+    }
+
     backend = MEMORY_BACKEND(dimm->hostmem);
+#ifdef CONFIG_LIBPMEM
+    if (object_property_get_bool(OBJECT(backend), "pmem", &error_abort)) {
+        return H_UNSUPPORTED;
+    }
+#endif
     fd = memory_region_get_fd(&backend->mr);
 
     if (fd < 0) {
@@ -766,3 +800,15 @@ static void spapr_scm_register_types(void)
 }
 
 type_init(spapr_scm_register_types)
+
+static TypeInfo spapr_nvdimm_info = {
+    .name          = TYPE_SPAPR_NVDIMM,
+    .parent        = TYPE_NVDIMM,
+};
+
+static void spapr_nvdimm_register_types(void)
+{
+    type_register_static(&spapr_nvdimm_info);
+}
+
+type_init(spapr_nvdimm_register_types)
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 24d8e37b33..fb4e56418e 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -13,6 +13,10 @@
 #include "hw/mem/nvdimm.h"
 #include "migration/vmstate.h"
 
+#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
+OBJECT_DECLARE_SIMPLE_TYPE(SpaprNVDIMMDevice, SPAPR_NVDIMM)
+
+typedef struct SpaprNVDIMMDevice  SpaprNVDIMMDevice;
 typedef struct SpaprDrc SpaprDrc;
 typedef struct SpaprMachineState SpaprMachineState;
 



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-07-08  2:57 ` [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
@ 2021-07-08  6:12   ` David Gibson
  2021-09-21  6:23   ` David Gibson
  1 sibling, 0 replies; 8+ messages in thread
From: David Gibson @ 2021-07-08  6:12 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: groug, qemu-ppc, qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

[-- Attachment #1: Type: text/plain, Size: 15673 bytes --]

On Wed, Jul 07, 2021 at 09:57:21PM -0500, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with one of H_LONG_BUSY when the operation is expected to take longer
> time along with a continue_token. The hcall to be called again providing
> the continue_token to get the status. So, all fresh requests are put into
> a 'pending' list and flush worker is submitted to the thread pool. The
> thread pool completion callbacks move the requests to 'completed' list,
> which are cleaned up after reporting to guest in subsequent hcalls to
> get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens and
> their return status across migrations. So, the completed flush states
> are forwarded to the destination and the pending ones are restarted
> at the destination in post_load. The necessary nvdimm flush specific
> vmstate structures are added to the spapr machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr.c                |    6 +
>  hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h        |   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   13 ++
>  4 files changed, 269 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 4dd90b75cc..546d825dde 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1622,6 +1622,8 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>      }
>  
> +    spapr_nvdimm_finish_flushes(spapr);
> +
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>       * if this device is used by another device (eg, a running vhost backend
>       * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> @@ -2018,6 +2020,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_ccf_assist,
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
> +        &vmstate_spapr_nvdimm_states,
>          NULL
>      }
>  };
> @@ -3014,6 +3017,9 @@ static void spapr_machine_init(MachineState *machine)
>      }
>  
>      qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> +
> +    QLIST_INIT(&spapr->pending_flush_states);
> +    QLIST_INIT(&spapr->completed_flush_states);

These need nvdimm in the variable names.  There are any number of
things in the machine that could be the subject of some sort of flush.

>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 91de1052f2..4f8931ab15 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
> @@ -30,6 +31,7 @@
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>  /* SCM device is unable to persist memory contents */
> @@ -375,6 +377,243 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static uint64_t flush_token;
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    int ret = H_SUCCESS;
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    /* flush raw backing image */
> +    if (qemu_fdatasync(state->backend_fd) < 0) {
> +        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                     strerror(errno));
> +        ret = H_HARDWARE;
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
> +}
> +
> +static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
> +     .name = "spapr_nvdimm_flush_state",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> +     SpaprMachineState *spapr = (SpaprMachineState *)opaque;
> +
> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> +             !QLIST_EMPTY(&spapr->completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_post_load(void *opaque, int version_id)
> +{
> +    SpaprMachineState *spapr = (SpaprMachineState *)opaque;
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    SpaprDrc *drc;
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        if (flush_token < state->continue_token) {
> +            flush_token = state->continue_token;
> +        }
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, next) {
> +        if (flush_token < state->continue_token) {
> +            flush_token = state->continue_token;
> +        }
> +
> +        drc = spapr_drc_by_index(state->drcidx);
> +        dimm = PC_DIMM(drc->dev);
> +        backend = MEMORY_BACKEND(dimm->hostmem);
> +        state->backend_fd = memory_region_get_fd(&backend->mr);
> +
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_states = {
> +    .name = "spapr_nvdimm_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nvdimm_states_needed,
> +    .post_load = spapr_nvdimm_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_QLIST_V(pending_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Assign a token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
> +                                                      SpaprMachineState *spapr)
> +{
> +    SpaprNVDIMMDeviceFlushState *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    flush_token++;
> +    /* Token zero is presumed as no job pending. Handle the overflow to zero */
> +    if (flush_token == 0) {
> +        flush_token++;
> +    }
> +    state->continue_token = flush_token;
> +
> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +
> +    /*
> +     * Called on reset path, the main loop thread which calls
> +     * the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        QLIST_REMOVE(state, node);
> +        g_free(state);
> +    }
> +}
> +
> +/*
> + * spapr_nvdimm_get_flush_status
> + *      Fetches the status of the hcall worker and returns
> + *      H_LONG_BUSY_XYZ if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(SpaprMachineState *spapr,
> +                                         uint64_t token)
> +{
> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            goto exit;

There's no need for an ugly goto here.  Just return.

> +        }
> +    }
> +    ret = H_P2; /* If not found in complete list too, invalid token */
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);

Likewise you can return here.

> +            break;
> +        }
> +    }
> +exit:
> +    return ret;

And here, and you won't need the 'ret' variable any more.

> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device.
> + * The hcall returns H_LONG_BUSY_XX when the flush takes longer time and
> + * the hcall needs to be issued multiple times in order to be completely
> + * serviced. The continue-token from the output to be passed in the
> + * argument list of subsequent hcalls until the hcall is completely serviced
> + * at which point H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    int fd;
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (continue_token != 0) {
> +        goto get_status;

Again, not really an idiomatically justified use of goto.  Just put
the body in an if block - it's pretty simple so that's not excessively
indented.

> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    backend = MEMORY_BACKEND(dimm->hostmem);
> +    fd = memory_region_get_fd(&backend->mr);
> +
> +    if (fd < 0) {
> +        return H_UNSUPPORTED;
> +    }
> +
> +    state = spapr_nvdimm_init_new_flush_state(spapr);
> +    if (!state) {
> +        return H_HARDWARE;
> +    }
> +
> +    state->drcidx = drc_index;
> +    state->backend_fd = fd;
> +
> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                           spapr_nvdimm_flush_completion_cb, state);
> +
> +    continue_token = state->continue_token;
> +
> +get_status:
> +    ret = spapr_nvdimm_get_flush_status(spapr, continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = continue_token;
> +    }
> +
> +    return ret;
> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                       target_ulong opcode, target_ulong *args)
>  {
> @@ -523,6 +762,7 @@ static void spapr_scm_register_types(void)
>      spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
>      spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>  }
>  
>  type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index f05219f75e..1684d72546 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -12,10 +12,12 @@
>  #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
>  #include "hw/ppc/xics.h"        /* For ICSState */
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/ppc/spapr_nvdimm.h"
>  
>  struct SpaprVioBus;
>  struct SpaprPhbState;
>  struct SpaprNvram;
> +struct SpaprNVDIMMDeviceFlushState;
>  
>  typedef struct SpaprEventLogEntry SpaprEventLogEntry;
>  typedef struct SpaprEventSource SpaprEventSource;
> @@ -248,6 +250,11 @@ struct SpaprMachineState {
>      uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>  
>      Error *fwnmi_migration_blocker;
> +
> +    /* nvdimm flush states */
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
> +
>  };
>  
>  #define H_SUCCESS         0
> @@ -328,6 +335,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>  #define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
> @@ -542,8 +550,9 @@ struct SpaprMachineState {
>  #define H_SCM_UNBIND_MEM        0x3F0
>  #define H_SCM_UNBIND_ALL        0x3FC
>  #define H_SCM_HEALTH            0x400
> +#define H_SCM_FLUSH             0x44C
>  
> -#define MAX_HCALL_OPCODE        H_SCM_HEALTH
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 764f999f54..24d8e37b33 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,6 +11,7 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> +#include "migration/vmstate.h"
>  
>  typedef struct SpaprDrc SpaprDrc;
>  typedef struct SpaprMachineState SpaprMachineState;
> @@ -21,5 +22,17 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr);
> +
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    int backend_fd;
> +    uint32_t drcidx;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +extern const VMStateDescription vmstate_spapr_nvdimm_states;
>  
>  #endif
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-07-08  2:57 ` [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
  2021-07-08  6:12   ` David Gibson
@ 2021-09-21  6:23   ` David Gibson
  2022-02-01 21:41     ` Shivaprasad G Bhat
  1 sibling, 1 reply; 8+ messages in thread
From: David Gibson @ 2021-09-21  6:23 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: groug, qemu-ppc, qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

[-- Attachment #1: Type: text/plain, Size: 16134 bytes --]

On Wed, Jul 07, 2021 at 09:57:21PM -0500, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with one of H_LONG_BUSY when the operation is expected to take longer
> time along with a continue_token. The hcall to be called again providing
> the continue_token to get the status. So, all fresh requests are put into
> a 'pending' list and flush worker is submitted to the thread pool. The
> thread pool completion callbacks move the requests to 'completed' list,
> which are cleaned up after reporting to guest in subsequent hcalls to
> get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens and
> their return status across migrations. So, the completed flush states
> are forwarded to the destination and the pending ones are restarted
> at the destination in post_load. The necessary nvdimm flush specific
> vmstate structures are added to the spapr machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr.c                |    6 +
>  hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h        |   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   13 ++
>  4 files changed, 269 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 4dd90b75cc..546d825dde 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1622,6 +1622,8 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>      }
>  
> +    spapr_nvdimm_finish_flushes(spapr);
> +
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>       * if this device is used by another device (eg, a running vhost backend
>       * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> @@ -2018,6 +2020,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_ccf_assist,
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
> +        &vmstate_spapr_nvdimm_states,
>          NULL
>      }
>  };
> @@ -3014,6 +3017,9 @@ static void spapr_machine_init(MachineState *machine)
>      }
>  
>      qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> +
> +    QLIST_INIT(&spapr->pending_flush_states);
> +    QLIST_INIT(&spapr->completed_flush_states);
>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 91de1052f2..4f8931ab15 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
> @@ -30,6 +31,7 @@
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>  /* SCM device is unable to persist memory contents */
> @@ -375,6 +377,243 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static uint64_t flush_token;

Better to put this in the machine state structure than a global.

> +static int flush_worker_cb(void *opaque)
> +{
> +    int ret = H_SUCCESS;
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    /* flush raw backing image */
> +    if (qemu_fdatasync(state->backend_fd) < 0) {
> +        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                     strerror(errno));
> +        ret = H_HARDWARE;
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
> +}
> +
> +static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
> +     .name = "spapr_nvdimm_flush_state",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> +     SpaprMachineState *spapr = (SpaprMachineState *)opaque;
> +
> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> +             !QLIST_EMPTY(&spapr->completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_post_load(void *opaque, int version_id)
> +{
> +    SpaprMachineState *spapr = (SpaprMachineState *)opaque;
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    SpaprDrc *drc;
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {

I don't think you need FOREACH_SAFE here.  You're not removing entries
from the loop body.  If you're trying to protect against concurrent
removals, I don't think FOREACH_SAFE is sufficient, you'll need an
actual lock (but I think it's already protected by the BQL).

> +        if (flush_token < state->continue_token) {
> +            flush_token = state->continue_token;
> +        }
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, next) {

Sane comments here.

> +        if (flush_token < state->continue_token) {
> +            flush_token = state->continue_token;
> +        }
> +
> +        drc = spapr_drc_by_index(state->drcidx);
> +        dimm = PC_DIMM(drc->dev);
> +        backend = MEMORY_BACKEND(dimm->hostmem);
> +        state->backend_fd = memory_region_get_fd(&backend->mr);
> +
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_states = {
> +    .name = "spapr_nvdimm_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nvdimm_states_needed,
> +    .post_load = spapr_nvdimm_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_QLIST_V(pending_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Assign a token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
> +                                                      SpaprMachineState *spapr)
> +{
> +    SpaprNVDIMMDeviceFlushState *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    flush_token++;
> +    /* Token zero is presumed as no job pending. Handle the overflow to zero */
> +    if (flush_token == 0) {
> +        flush_token++;

Hmm... strictly speaking, this isn't safe.  It's basically never going
to happen in practice, but in theory there's nothing preventing
continue_token 1 still being outstanding when the flush_token counter
overflows.

Come to think of it, since it's a uint64_t, I think an actual overflow
is also never going to happen in practice.  Maybe we should just
assert() on overflow, and fix it in the unlikely event that we ever
discover a case where it could happen.

> +    }
> +    state->continue_token = flush_token;
> +
> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +
> +    /*
> +     * Called on reset path, the main loop thread which calls
> +     * the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        QLIST_REMOVE(state, node);
> +        g_free(state);
> +    }
> +}
> +
> +/*
> + * spapr_nvdimm_get_flush_status
> + *      Fetches the status of the hcall worker and returns
> + *      H_LONG_BUSY_XYZ if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(SpaprMachineState *spapr,
> +                                         uint64_t token)
> +{
> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            goto exit;
> +        }
> +    }
> +    ret = H_P2; /* If not found in complete list too, invalid token */
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            break;
> +        }
> +    }
> +exit:
> +    return ret;
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device.
> + * The hcall returns H_LONG_BUSY_XX when the flush takes longer time and
> + * the hcall needs to be issued multiple times in order to be completely
> + * serviced. The continue-token from the output to be passed in the
> + * argument list of subsequent hcalls until the hcall is completely serviced
> + * at which point H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    int fd;
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (continue_token != 0) {
> +        goto get_status;

Ugly goto, just reverse the sense and put the next chunk in the if body.

> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    backend = MEMORY_BACKEND(dimm->hostmem);
> +    fd = memory_region_get_fd(&backend->mr);
> +
> +    if (fd < 0) {
> +        return H_UNSUPPORTED;
> +    }
> +
> +    state = spapr_nvdimm_init_new_flush_state(spapr);
> +    if (!state) {
> +        return H_HARDWARE;
> +    }
> +
> +    state->drcidx = drc_index;
> +    state->backend_fd = fd;
> +
> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                           spapr_nvdimm_flush_completion_cb, state);
> +
> +    continue_token = state->continue_token;
> +
> +get_status:
> +    ret = spapr_nvdimm_get_flush_status(spapr, continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = continue_token;
> +    }
> +
> +    return ret;
> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                       target_ulong opcode, target_ulong *args)
>  {
> @@ -523,6 +762,7 @@ static void spapr_scm_register_types(void)
>      spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
>      spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>  }
>  
>  type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index f05219f75e..1684d72546 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -12,10 +12,12 @@
>  #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
>  #include "hw/ppc/xics.h"        /* For ICSState */
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/ppc/spapr_nvdimm.h"
>  
>  struct SpaprVioBus;
>  struct SpaprPhbState;
>  struct SpaprNvram;
> +struct SpaprNVDIMMDeviceFlushState;
>  
>  typedef struct SpaprEventLogEntry SpaprEventLogEntry;
>  typedef struct SpaprEventSource SpaprEventSource;
> @@ -248,6 +250,11 @@ struct SpaprMachineState {
>      uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>  
>      Error *fwnmi_migration_blocker;
> +
> +    /* nvdimm flush states */
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
> +
>  };
>  
>  #define H_SUCCESS         0
> @@ -328,6 +335,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>  #define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
> @@ -542,8 +550,9 @@ struct SpaprMachineState {
>  #define H_SCM_UNBIND_MEM        0x3F0
>  #define H_SCM_UNBIND_ALL        0x3FC
>  #define H_SCM_HEALTH            0x400
> +#define H_SCM_FLUSH             0x44C
>  
> -#define MAX_HCALL_OPCODE        H_SCM_HEALTH
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 764f999f54..24d8e37b33 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,6 +11,7 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> +#include "migration/vmstate.h"
>  
>  typedef struct SpaprDrc SpaprDrc;
>  typedef struct SpaprMachineState SpaprMachineState;
> @@ -21,5 +22,17 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr);
> +
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    int backend_fd;
> +    uint32_t drcidx;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +extern const VMStateDescription vmstate_spapr_nvdimm_states;
>  
>  #endif
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device
  2021-07-08  2:57 ` [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
@ 2021-09-21  6:32   ` David Gibson
  2022-02-01 21:41     ` Shivaprasad G Bhat
  0 siblings, 1 reply; 8+ messages in thread
From: David Gibson @ 2021-09-21  6:32 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: groug, qemu-ppc, qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

[-- Attachment #1: Type: text/plain, Size: 6040 bytes --]

On Wed, Jul 07, 2021 at 09:57:31PM -0500, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there is
> need for explicit IO flushes on the backend to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem. So, the
> approach here is to convey when the hcall flush is required in a device
> tree property. The guest once it knows the device backend is not pmem,
> makes the hcall whenever flush is required.
> 
> To set the device tree property, the patch introduces a new papr specific
> device type inheriting the nvdimm device. When the backend doesn't have
> pmem="yes", the device tree property "ibm,hcall-flush-required" is set,
> and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr_nvdimm.c         |   46 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr_nvdimm.h |    4 ++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 4f8931ab15..4dc7c3f147 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -54,6 +54,8 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>  {
>      const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
>      const MachineState *ms = MACHINE(hotplug_dev);
> +    PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> +    MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
>      g_autofree char *uuidstr = NULL;
>      QemuUUID uuid;
>      int ret;
> @@ -91,6 +93,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>          return false;
>      }
>  
> +    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
> +        (memory_region_get_fd(mr) < 0)) {
> +        error_setg(errp, "spapr-nvdimm device requires the "
> +                   "memdev %s to be of memory-backend-file type",
> +                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));

It's not obvious to me why the spapr nvdimm device has an additional
restriction here over the regular nvdimm device.

> +        return false;
> +    }
> +
>      return true;
>  }
>  
> @@ -162,6 +172,21 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>                               "operating-system")));
>      _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>  
> +    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
> +        bool is_pmem = false;
> +#ifdef CONFIG_LIBPMEM
> +        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> +        HostMemoryBackend *hostmem = dimm->hostmem;
> +
> +        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem",
> +                                           &error_abort);

Presenting to the guest a property of the backend worries me
slightly.  How the backends are synchronized between the source and
destination is out of scope for qemu: is there any possibility that we
could migrate from a host where the backend is pmem to one where it is
not (or the reverse).

I think at the least we want a property on the spapr-nvdimm object
which will override what's presented to the guest (which, yes, might
mean lying to the guest).  I think that could be important for
testing, if nothing else.

> +#endif
> +        if (!is_pmem) {
> +            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
> +                             NULL, 0));
> +        }
> +    }
> +
>      return child_offset;
>  }
>  
> @@ -585,7 +610,16 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      }
>  
>      dimm = PC_DIMM(drc->dev);
> +    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
> +        return H_PARAMETER;
> +    }

Hmm.  If you're going to make flushes specific to spapr nvdimms, you
could put the queue of pending flushes into the spapr-nvdimm object,
rather than having a global list in the machine.

> +
>      backend = MEMORY_BACKEND(dimm->hostmem);
> +#ifdef CONFIG_LIBPMEM
> +    if (object_property_get_bool(OBJECT(backend), "pmem", &error_abort)) {
> +        return H_UNSUPPORTED;

Could you make this not be UNSUPPORTED, but instead fake the flush for
the pmem device?  Either as a no-op, or simulating the guest invoking
the right cpu cache flushes?  That seems like it would be more useful:
that way users who don't care too much about performance could just
always do a flush hcall and not have to have another path for the
"real" pmem case.

> +    }
> +#endif
>      fd = memory_region_get_fd(&backend->mr);
>  
>      if (fd < 0) {
> @@ -766,3 +800,15 @@ static void spapr_scm_register_types(void)
>  }
>  
>  type_init(spapr_scm_register_types)
> +
> +static TypeInfo spapr_nvdimm_info = {
> +    .name          = TYPE_SPAPR_NVDIMM,
> +    .parent        = TYPE_NVDIMM,
> +};
> +
> +static void spapr_nvdimm_register_types(void)
> +{
> +    type_register_static(&spapr_nvdimm_info);
> +}
> +
> +type_init(spapr_nvdimm_register_types)
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 24d8e37b33..fb4e56418e 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -13,6 +13,10 @@
>  #include "hw/mem/nvdimm.h"
>  #include "migration/vmstate.h"
>  
> +#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
> +OBJECT_DECLARE_SIMPLE_TYPE(SpaprNVDIMMDevice, SPAPR_NVDIMM)
> +
> +typedef struct SpaprNVDIMMDevice  SpaprNVDIMMDevice;
>  typedef struct SpaprDrc SpaprDrc;
>  typedef struct SpaprMachineState SpaprMachineState;
>  
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-09-21  6:23   ` David Gibson
@ 2022-02-01 21:41     ` Shivaprasad G Bhat
  0 siblings, 0 replies; 8+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-01 21:41 UTC (permalink / raw)
  To: David Gibson
  Cc: groug, qemu-ppc, qemu-devel, aneesh.kumar, nvdimm, kvm-ppc, bharata

Hi David,

Thanks for comments. Sorry about the delay. Replies inline.

On 9/21/21 11:53, David Gibson wrote:
> On Wed, Jul 07, 2021 at 09:57:21PM -0500, Shivaprasad G Bhat wrote:
>> The patch adds support for the SCM flush hcall for the nvdimm devices.
>> To be available for exploitation by guest through the next patch.
>>
>> The hcall expects the semantics such that the flush to return
>> with one of H_LONG_BUSY when the operation is expected to take longer
>> time along with a continue_token. The hcall to be called again providing
>> the continue_token to get the status. So, all fresh requests are put into
>> a 'pending' list and flush worker is submitted to the thread pool. The
>> thread pool completion callbacks move the requests to 'completed' list,
>> which are cleaned up after reporting to guest in subsequent hcalls t

<snip>

>> @@ -30,6 +31,7 @@
>>   #include "hw/ppc/fdt.h"
>>   #include "qemu/range.h"
>>   #include "hw/ppc/spapr_numa.h"
>> +#include "block/thread-pool.h"
>>   
>>   /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>>   /* SCM device is unable to persist memory contents */
>> @@ -375,6 +377,243 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>       return H_SUCCESS;
>>   }
>>   
>> +static uint64_t flush_token;
> 
> Better to put this in the machine state structure than a global.

Moved it to device state itself as suggested, the states list is per 
device now.

> 
>> +static int flush_worker_cb(void *opaque)
>> +{
>> +    int ret = H_SUCCESS;
>> +    SpaprNVDIMMDeviceFlushState *state = opaque;
>> +
>> +    /* flush raw backing image */
>> +  

<snip>

>> +             !QLIST_EMPTY(&spapr->completed_flush_states));
>> +}
>> +
>> +static int spapr_nvdimm_post_load(void *opaque, int version_id)
>> +{
>> +    SpaprMachineState *spapr = (SpaprMachineState *)opaque;
>> +    SpaprNVDIMMDeviceFlushState *state, *next;
>> +    PCDIMMDevice *dimm;
>> +    HostMemoryBackend *backend = NULL;
>> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
>> +    SpaprDrc *drc;
>> +
>> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> 
> I don't think you need FOREACH_SAFE here.  You're not removing entries
> from the loop body.  If you're trying to protect against concurrent
> removals, I don't think FOREACH_SAFE is sufficient, you'll need an
> actual lock (but I think it's already protected by the BQL).

Changing here, below and also at spapr_nvdimm_get_flush_status() while 
traversing the pending list. Verified all these invocations are called 
with BQL.

> 
>> +        if (flush_token < state->continue_token) {
>> +            flush_token = state->continue_token;
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, next) {
> 
> Sane comments here.
> 
>> +        if (flush_token < state->continue_token) {
>> +            flush_token = state->continue_token;
>> +        }
>> +
>> +        drc = spapr_drc_by_index(state->drcidx);
>> +        dimm = PC_DIMM(drc->dev);
>> +        backend = MEMORY_BACKEND(dimm->hostmem);
>> +        state->backend_fd = memory_region_get_fd(&backend->mr);
>> +
>> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
>> +                               spapr_nvdimm_flush_completion_cb, state);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +const VMStateDescription vmstate_spapr_nvdimm_states = {
>> +    .name = "spapr_nvdimm_states",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .needed = spapr_nvdimm_states_needed,
>> +    .post_load = spapr_nvdimm_post_load,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
>> +                        vmstate_spapr_nvdimm_flush_state,
>> +                        SpaprNVDIMMDeviceFlushState, node),
>> +        VMSTATE_QLIST_V(pending_flush_states, SpaprMachineState, 1,
>> +                        vmstate_spapr_nvdimm_flush_state,
>> +                        SpaprNVDIMMDeviceFlushState, node),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +/*
>> + * Assign a token and reserve it for the new flush state.
>> + */
>> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
>> +                                                      SpaprMachineState *spapr)
>> +{
>> +    SpaprNVDIMMDeviceFlushState *state;
>> +
>> +    state = g_malloc0(sizeof(*state));
>> +
>> +    flush_token++;
>> +    /* Token zero is presumed as no job pending. Handle the overflow to zero */
>> +    if (flush_token == 0) {
>> +        flush_token++;
> 
> Hmm... strictly speaking, this isn't safe.  It's basically never going
> to happen in practice, but in theory there's nothing preventing
> continue_token 1 still being outstanding when the flush_token counter
> overflows.
> 
> Come to think of it, since it's a uint64_t, I think an actual overflow
> is also never going to happen in practice.  Maybe we should just
> assert() on overflow, and fix it in the unlikely event that we ever
> discover a case where it could happen.

Have added the assert on overflow.

> 
>> +    }
>> +    state->continue_token = flush_token;
>> +
>> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
>> +
>> +    return state;
>> +}
>> +
>> +/*
>> + *

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device
  2021-09-21  6:32   ` David Gibson
@ 2022-02-01 21:41     ` Shivaprasad G Bhat
  0 siblings, 0 replies; 8+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-01 21:41 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, qemu-ppc, qemu-devel, aneesh.kumar, nvdimm, kvm-ppc


On 9/21/21 12:02, David Gibson wrote:
> On Wed, Jul 07, 2021 at 09:57:31PM -0500, Shivaprasad G Bhat wrote:
>> If the device backend is not persistent memory for the nvdimm, there is
>> need for explicit IO flushes on the backend to ensure persistence.
>>
>> On SPAPR, the issue is addressed by adding a new hcall to request for
>> an explicit flush from the guest when the backend is not pmem. So, the
>> approach here is to convey when the hcall flush is required in a device
>> tree property. The guest once it knows the device backend is not pmem,
>> makes the hcall whenever flush is required.
>>
>> To set the device tree property, the patch introduces a new papr specific
>> device type inheriting the nvdimm device. When the backend doesn't have
>> pmem="yes", the device tree property "ibm,hcall-flush-required" is set,
>> and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
<snip>
>> @@ -91,6 +93,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>>           return false;
>>       }
>>   
>> +    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
>> +        (memory_region_get_fd(mr) < 0)) {
>> +        error_setg(errp, "spapr-nvdimm device requires the "
>> +                   "memdev %s to be of memory-backend-file type",
>> +                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));
> 
> It's not obvious to me why the spapr nvdimm device has an additional
> restriction here over the regular nvdimm device.

For memory-backend-ram the fd is set to -1. The fdatasync would fail 
later. This restriction is for preventing the hcall failure later. May 
be it is intentionally allowed with nvdimms for testing purposes. Let me 
know if you want me to allow it with a dummy success return for the hcall.

> 
>> +        return false;
>> +    }
>> +
>>       return true;
>>   }
>>   
>> @@ -162,6 +172,21 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>                                "operating-system")));
>>       _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>>   
>> +    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
>> +        bool is_pmem = false;
>> +#ifdef CONFIG_LIBPMEM
>> +        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
>> +        HostMemoryBackend *hostmem = dimm->hostmem;
>> +
>> +        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem",
>> +                                           &error_abort);
> 
> Presenting to the guest a property of the backend worries me
> slightly.  How the backends are synchronized between the source and
> destination is out of scope for qemu: is there any possibility that we
> could migrate from a host where the backend is pmem to one where it is
> not (or the reverse).
> 
> I think at the least we want a property on the spapr-nvdimm object
> which will override what's presented to the guest (which, yes, might
> mean lying to the guest).  I think that could be important for
> testing, if nothing else.

Mix configurations can be attempted on a nested setup itself.

On a side note, the attempts to use pmem=on on non-pmem backend is being 
deprecated as that is unsafe pretension effective commit cdcf766d0b0.

I see your point, adding "pmem-override"(?, suggest me if you have 
better name) to spapr-nvdimm can be helpful. Adding it to spapr-nvdimm 
device. With pmem-override "on" device tree property is added allowing 
hcall-flush even when pmem=on for the backend. This works for migration 
compatibility in such a setup.

> 
>> +#endif
>> +        if (!is_pmem) {
>> +            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
>> +                             NULL, 0));
>> +        }
>> +    }
>> +
>>       return child_offset;
>>   }
>>   
>> @@ -585,7 +610,16 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>       }
>>   
>>       dimm = PC_DIMM(drc->dev);
>> +    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
>> +        return H_PARAMETER;
>> +    }
> 
> Hmm.  If you're going to make flushes specific to spapr nvdimms, you
> could put the queue of pending flushes into the spapr-nvdimm object,
> rather than having a global list in the machine.

Yes. I have changed the patches to move all the flush specific data 
structures into the spapr-nvdimm object.

> 
>> +
>>       backend = MEMORY_BACKEND(dimm->hostmem);
>> +#ifdef CONFIG_LIBPMEM
>> +    if (object_property_get_bool(OBJECT(backend), "pmem", &error_abort)) {
>> +        return H_UNSUPPORTED;
> 
> Could you make this not be UNSUPPORTED, but instead fake the flush for
> the pmem device?  Either as a no-op, or simulating the guest invoking
> the right cpu cache flushes?  That seems like it would be more useful:
> that way users who don't care too much about performance could just
> always do a flush hcall and not have to have another path for the
> "real" pmem case.
> 

It would actually be wrong use for kernel to attempt that. The device 
tree property is checked before setting the callback to flush in the 
kernel. If someone makes the hcall without the device tree property 
being set, it would actually be a mistaken/wrong usage.

For pmem-override=on, its better to allow this as you suggested along 
with exposing the device tree property. Will call the pmem_persist() for 
pmem backed devices. Switch between pmem_persist() or fdatasync() based 
on the backend type while flushing.


>> +    }
>> +#endif
>>       fd = memory_region_get_fd(&backend->mr);
>>   
>>       if (fd < 0) {
>> @@ -766,3 +800,15 @@ static void spapr_scm_register_types(void)

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-01 21:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-08  2:57 [PATCH REBASED v5 0/3] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
2021-07-08  2:57 ` [PATCH REBASED v5 1/2] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
2021-07-08  6:12   ` David Gibson
2021-09-21  6:23   ` David Gibson
2022-02-01 21:41     ` Shivaprasad G Bhat
2021-07-08  2:57 ` [PATCH REBASED v5 2/2] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
2021-09-21  6:32   ` David Gibson
2022-02-01 21:41     ` Shivaprasad G Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).