qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
@ 2021-04-29  3:48 Shivaprasad G Bhat
  2021-04-29  3:48 ` [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Shivaprasad G Bhat @ 2021-04-29  3:48 UTC (permalink / raw)
  To: david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, peter.maydell, eblake, qemu-arm,
	richard.henderson, pbonzini, marcel.apfelbaum, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on POWER
to flush the cache data to backend nvdimm device during normal writes
followed by explicit flushes if the backend devices are not synchronous
DAX capable.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
and the subsequent flush doesn't traslate to actual flush to the backend
file on the host in case of file backed v-nvdimms. This is addressed by
virtio-pmem in case of x86_64 by making explicit flushes translating to
fsync at qemu.

On SPAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.

A new device property sync-dax is added to the nvdimm device. When the 
sync-dax is 'writeback'(default for PPC), device property
"hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
requesting for an explicit flush. 

sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
now as the flush semantics are unimplemented.

When the backend file is actually synchronous DAX capable and no explicit
flushes are required, the sync-dax mode 'direct' is to be used.

The below demonstration shows the map_sync behavior with sync-dax writeback &
direct.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
nvdimm with syn-dax=writeback, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback
Failed to mmap  with Operation not supported

The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      spapr: nvdimm: Forward declare and move the definitions
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      nvdimm: Enable sync-dax device property for nvdimm


 hw/arm/virt.c                 |   28 ++++
 hw/i386/pc.c                  |   28 ++++
 hw/mem/nvdimm.c               |   52 +++++++
 hw/ppc/spapr.c                |   16 ++
 hw/ppc/spapr_nvdimm.c         |  285 +++++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   11 ++
 include/hw/ppc/spapr.h        |   11 +-
 include/hw/ppc/spapr_nvdimm.h |   27 ++--
 qapi/common.json              |   20 +++
 9 files changed, 455 insertions(+), 23 deletions(-)

--
Signature



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions
  2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
@ 2021-04-29  3:48 ` Shivaprasad G Bhat
  2021-05-03 18:23   ` Eric Blake
  2021-04-29  3:48 ` [PATCH v4 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Shivaprasad G Bhat @ 2021-04-29  3:48 UTC (permalink / raw)
  To: david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, peter.maydell, eblake, qemu-arm,
	richard.henderson, pbonzini, marcel.apfelbaum, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The subsequent patches add definitions which tend to
get the compilation to cyclic dependency. So, prepare
with forward declarations, move the defitions and clean up.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
 include/hw/ppc/spapr_nvdimm.h |   14 ++------------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index b46c36917c..8cf3fb2ffb 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -31,6 +31,18 @@
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
 
+/*
+ * The nvdimm size should be aligned to SCM block size.
+ * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
+ * inorder to have SCM regions not to overlap with dimm memory regions.
+ * The SCM devices can have variable block sizes. For now, fixing the
+ * block size to the minimum value.
+ */
+#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
+
+/* Have an explicit check for alignment */
+QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 73be250e2a..764f999f54 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,19 +11,9 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
-#include "hw/ppc/spapr.h"
 
-/*
- * The nvdimm size should be aligned to SCM block size.
- * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
- * inorder to have SCM regions not to overlap with dimm memory regions.
- * The SCM devices can have variable block sizes. For now, fixing the
- * block size to the minimum value.
- */
-#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
-
-/* Have an explicit check for alignment */
-QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+typedef struct SpaprDrc SpaprDrc;
+typedef struct SpaprMachineState SpaprMachineState;
 
 int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                            void *fdt, int *fdt_start_offset, Error **errp);




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
  2021-04-29  3:48 ` [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
@ 2021-04-29  3:48 ` Shivaprasad G Bhat
  2021-04-29  3:49 ` [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Shivaprasad G Bhat @ 2021-04-29  3:48 UTC (permalink / raw)
  To: david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, peter.maydell, eblake, qemu-arm,
	richard.henderson, pbonzini, marcel.apfelbaum, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status across migrations. So, the completed
flush states are forwarded to the destination and the pending
ones are restarted at the destination in post_load. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  234 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   10 ++
 include/hw/ppc/spapr_nvdimm.h |   13 ++
 4 files changed, 262 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e4be00b732..80957f9188 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes(spapr);
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_ccf_assist,
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
+        &vmstate_spapr_nvdimm_states,
         NULL
     }
 };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
+
+    QLIST_INIT(&spapr->pending_flush_states);
+    QLIST_INIT(&spapr->completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 8cf3fb2ffb..77eb7e1293 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -30,6 +31,7 @@
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /*
  * The nvdimm size should be aligned to SCM block size.
@@ -371,6 +373,237 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static uint64_t flush_token;
+
+static int flush_worker_cb(void *opaque)
+{
+    int ret = H_SUCCESS;
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    /* flush raw backing image */
+    if (qemu_fdatasync(state->backend_fd) < 0) {
+        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                     strerror(errno));
+        ret = H_HARDWARE;
+    }
+
+    return ret;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
+}
+
+static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
+     .name = "spapr_nvdimm_flush_state",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+     SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+
+     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
+             !QLIST_EMPTY(&spapr->completed_flush_states));
+}
+
+static int spapr_nvdimm_post_load(void *opaque, int version_id)
+{
+    SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    SpaprDrc *drc;
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        if (flush_token < state->continue_token) {
+            flush_token = state->continue_token;
+        }
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, next) {
+        if (flush_token < state->continue_token) {
+            flush_token = state->continue_token;
+        }
+
+        drc = spapr_drc_by_index(state->drcidx);
+        dimm = PC_DIMM(drc->dev);
+        backend = MEMORY_BACKEND(dimm->hostmem);
+        state->backend_fd = memory_region_get_fd(&backend->mr);
+
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+    }
+
+    return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_states = {
+    .name = "spapr_nvdimm_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nvdimm_states_needed,
+    .post_load = spapr_nvdimm_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_QLIST_V(pending_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Assign a token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
+                                                      SpaprMachineState *spapr)
+{
+    SpaprNVDIMMDeviceFlushState *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    flush_token++;
+    /* Token zero is presumed as no job pending. Handle the overflow to zero */
+    if (flush_token == 0) {
+        flush_token++;
+    }
+    state->continue_token = flush_token;
+
+    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+
+    /*
+     * Called on reset path, the main loop thread which calls
+     * the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        QLIST_REMOVE(state, node);
+        g_free(state);
+    }
+}
+
+/*
+ * spapr_nvdimm_get_flush_status
+ *      Fetches the status of the hcall worker and returns
+ *      H_LONG_BUSY_XYZ if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(SpaprMachineState *spapr,
+                                         uint64_t token)
+{
+    int ret = H_LONG_BUSY_ORDER_10_MSEC;
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
+        if (state->continue_token == token) {
+            goto exit;
+        }
+    }
+    ret = H_P2; /* If not found in complete list too, invalid token */
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
+        if (state->continue_token == token) {
+            ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            break;
+        }
+    }
+exit:
+    return ret;
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_LONG_BUSY_XX when the flush takes longer time and
+ * the hcall needs to be issued multiple times in order to be completely
+ * serviced. The continue-token from the output to be passed in the
+ * argument list of subsequent hcalls until the hcall is completely serviced
+ * at which point H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                      target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (continue_token != 0) {
+        goto get_status;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    backend = MEMORY_BACKEND(dimm->hostmem);
+
+    state = spapr_nvdimm_init_new_flush_state(spapr);
+    if (!state) {
+        return H_HARDWARE;
+    }
+
+    state->drcidx = drc_index;
+    state->backend_fd = memory_region_get_fd(&backend->mr);
+
+    thread_pool_submit_aio(pool, flush_worker_cb, state,
+                           spapr_nvdimm_flush_completion_cb, state);
+
+    continue_token = state->continue_token;
+
+get_status:
+    ret = spapr_nvdimm_get_flush_status(spapr, continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -487,6 +720,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index bf7cab7a2c..478c031396 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -12,10 +12,12 @@
 #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
 #include "hw/ppc/xics.h"        /* For ICSState */
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/ppc/spapr_nvdimm.h"
 
 struct SpaprVioBus;
 struct SpaprPhbState;
 struct SpaprNvram;
+struct SpaprNVDIMMDeviceFlushState;
 
 typedef struct SpaprEventLogEntry SpaprEventLogEntry;
 typedef struct SpaprEventSource SpaprEventSource;
@@ -245,6 +247,11 @@ struct SpaprMachineState {
     uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
+
+    /* nvdimm flush states */
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
+
 };
 
 #define H_SUCCESS         0
@@ -538,8 +545,9 @@ struct SpaprMachineState {
 #define H_SCM_BIND_MEM          0x3EC
 #define H_SCM_UNBIND_MEM        0x3F0
 #define H_SCM_UNBIND_ALL        0x3FC
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 764f999f54..24d8e37b33 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,6 +11,7 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
+#include "migration/vmstate.h"
 
 typedef struct SpaprDrc SpaprDrc;
 typedef struct SpaprMachineState SpaprMachineState;
@@ -21,5 +22,17 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(SpaprMachineState *spapr);
+
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    int backend_fd;
+    uint32_t drcidx;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+extern const VMStateDescription vmstate_spapr_nvdimm_states;
 
 #endif




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm
  2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
  2021-04-29  3:48 ` [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
  2021-04-29  3:48 ` [PATCH v4 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
@ 2021-04-29  3:49 ` Shivaprasad G Bhat
  2021-05-03 18:27   ` Eric Blake
  2021-04-29 15:55 ` [PATCH v4 0/3] nvdimm: Enable sync-dax " Stefan Hajnoczi
  2021-04-30 19:14 ` Dan Williams
  4 siblings, 1 reply; 19+ messages in thread
From: Shivaprasad G Bhat @ 2021-04-29  3:49 UTC (permalink / raw)
  To: david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, peter.maydell, eblake, qemu-arm,
	richard.henderson, pbonzini, marcel.apfelbaum, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds the 'sync-dax' property to the nvdimm device.

When the sync-dax is 'direct' indicates the backend is synchronous DAX
capable and no explicit flush requests are required. When the mode is
set to 'writeback' it indicates the backend is not synhronous DAX
capable and explicit flushes to Hypervisor are required.

On PPC where the flush requests from guest can be honoured by the qemu,
the 'writeback' mode is supported and set as the default. The device
tree property "hcall-flush-required" is added to the nvdimm node which
makes the guest to issue H_SCM_FLUSH hcalls to request for flushes
explicitly. This would be the default behaviour without sync-dax
property set for the nvdimm device. For old pSeries machine, the
default is 'unsafe'.

For non-PPC platforms, the mode is set to 'unsafe' as the default.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/arm/virt.c           |   28 +++++++++++++++++++++++--
 hw/i386/pc.c            |   28 +++++++++++++++++++++++--
 hw/mem/nvdimm.c         |   52 +++++++++++++++++++++++++++++++++++++++++++----
 hw/ppc/spapr.c          |   10 +++++++++
 hw/ppc/spapr_nvdimm.c   |   39 +++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h |   11 ++++++++++
 include/hw/ppc/spapr.h  |    1 +
 qapi/common.json        |   20 ++++++++++++++++++
 8 files changed, 179 insertions(+), 10 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9f01d9041b..f32e3e4010 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2358,6 +2358,27 @@ static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
     return ms->possible_cpus;
 }
 
+static bool virt_nvdimm_validate(const MachineState *ms, NVDIMMDevice *nvdimm,
+                                 Error **errp)
+{
+    NvdimmSyncModes sync;
+
+    if (!ms->nvdimms_state->is_enabled) {
+        error_setg(errp, "nvdimm is not enabled: add 'nvdimm=on' to '-M'");
+        return false;
+    }
+
+    sync = object_property_get_enum(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                    "NvdimmSyncModes", &error_abort);
+    if (sync == NVDIMM_SYNC_MODES_WRITEBACK) {
+        error_setg(errp, "NVDIMM device " NVDIMM_SYNC_DAX_PROP
+                         "=%s mode unsupported", NvdimmSyncModes_str(sync));
+        return false;
+    }
+
+    return true;
+}
+
 static void virt_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
                                  Error **errp)
 {
@@ -2376,9 +2397,10 @@ static void virt_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
         return;
     }
 
-    if (is_nvdimm && !ms->nvdimms_state->is_enabled) {
-        error_setg(errp, "nvdimm is not enabled: add 'nvdimm=on' to '-M'");
-        return;
+    if (is_nvdimm) {
+        if (!virt_nvdimm_validate(ms, NVDIMM(dev), errp)) {
+            return;
+        }
     }
 
     pc_dimm_pre_plug(PC_DIMM(dev), MACHINE(hotplug_dev), NULL, errp);
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8a84b25a03..2d5151462c 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1211,6 +1211,27 @@ void pc_i8259_create(ISABus *isa_bus, qemu_irq *i8259_irqs)
     g_free(i8259);
 }
 
+static bool pc_nvdimm_validate(const MachineState *ms, NVDIMMDevice *nvdimm,
+                               Error **errp)
+{
+    NvdimmSyncModes sync;
+
+    if (!ms->nvdimms_state->is_enabled) {
+        error_setg(errp, "nvdimm is not enabled: add 'nvdimm=on' to '-M'");
+        return false;
+    }
+
+    sync = object_property_get_enum(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                    "NvdimmSyncModes", &error_abort);
+    if (sync == NVDIMM_SYNC_MODES_WRITEBACK) {
+        error_setg(errp, "NVDIMM device " NVDIMM_SYNC_DAX_PROP
+                   "=%s mode unsupported", NvdimmSyncModes_str(sync));
+        return false;
+    }
+
+    return true;
+}
+
 static void pc_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
                                Error **errp)
 {
@@ -1233,9 +1254,10 @@ static void pc_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
         return;
     }
 
-    if (is_nvdimm && !ms->nvdimms_state->is_enabled) {
-        error_setg(errp, "nvdimm is not enabled: missing 'nvdimm' in '-M'");
-        return;
+    if (is_nvdimm) {
+        if (!pc_nvdimm_validate(ms, NVDIMM(dev), errp)) {
+            return;
+        }
     }
 
     hotplug_handler_pre_plug(x86ms->acpi_dev, dev, &local_err);
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..56b4527362 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -96,6 +96,19 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
     g_free(value);
 }
 
+static int nvdimm_get_sync_mode(Object *obj, Error **errp G_GNUC_UNUSED)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+    return nvdimm->sync_dax;
+}
+
+static void nvdimm_set_sync_mode(Object *obj, int mode, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+    nvdimm->sync_dax = mode;
+}
 
 static void nvdimm_init(Object *obj)
 {
@@ -105,6 +118,13 @@ static void nvdimm_init(Object *obj)
 
     object_property_add(obj, NVDIMM_UUID_PROP, "QemuUUID", nvdimm_get_uuid,
                         nvdimm_set_uuid, NULL, NULL);
+
+    object_property_add_enum(obj, NVDIMM_SYNC_DAX_PROP, "NvdimmSyncModes",
+                             &NvdimmSyncModes_lookup, nvdimm_get_sync_mode,
+                             nvdimm_set_sync_mode);
+    object_property_set_description(obj, NVDIMM_SYNC_DAX_PROP,
+                                    "Set the Synchronus DAX mode");
+
 }
 
 static void nvdimm_finalize(Object *obj)
@@ -119,6 +139,9 @@ static void nvdimm_prepare_memory_region(NVDIMMDevice *nvdimm, Error **errp)
     PCDIMMDevice *dimm = PC_DIMM(nvdimm);
     uint64_t align, pmem_size, size;
     MemoryRegion *mr;
+    HostMemoryBackend *hostmem;
+    bool is_file_backed;
+    bool __attribute__((unused)) is_pmem = false;
 
     g_assert(!nvdimm->nvdimm_mr);
 
@@ -135,9 +158,8 @@ static void nvdimm_prepare_memory_region(NVDIMMDevice *nvdimm, Error **errp)
     nvdimm->label_data = memory_region_get_ram_ptr(mr) + pmem_size;
     pmem_size = QEMU_ALIGN_DOWN(pmem_size, align);
 
+    hostmem = dimm->hostmem;
     if (size <= nvdimm->label_size || !pmem_size) {
-        HostMemoryBackend *hostmem = dimm->hostmem;
-
         error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too "
                    "small to contain nvdimm label (0x%" PRIx64 ") and "
                    "aligned PMEM (0x%" PRIx64 ")",
@@ -147,14 +169,36 @@ static void nvdimm_prepare_memory_region(NVDIMMDevice *nvdimm, Error **errp)
     }
 
     if (!nvdimm->unarmed && memory_region_is_rom(mr)) {
-        HostMemoryBackend *hostmem = dimm->hostmem;
-
         error_setg(errp, "'unarmed' property must be off since memdev %s "
                    "is read-only",
                    object_get_canonical_path_component(OBJECT(hostmem)));
         return;
     }
 
+    is_file_backed = (memory_region_get_fd(mr) > 0);
+    if (nvdimm->sync_dax == NVDIMM_SYNC_MODES_WRITEBACK && !is_file_backed) {
+        error_setg(errp, NVDIMM_SYNC_DAX_PROP"='%s' mode requires the "
+                   "memdev %s to be file backed",
+                   NvdimmSyncModes_str(nvdimm->sync_dax),
+                   object_get_canonical_path_component(OBJECT(hostmem)));
+        return;
+    }
+
+#ifdef CONFIG_LIBPMEM
+    if (is_file_backed) {
+        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem",
+                                           &error_abort);
+    }
+
+    if (nvdimm->sync_dax == NVDIMM_SYNC_MODES_DIRECT && !is_pmem) {
+        error_setg(errp, "NVDIMM device "NVDIMM_SYNC_DAX_PROP"=%s mode requires"
+                   " the memory backend device to be synchronous DAX capable. "
+                   "Indicate it so with pmem=yes for the corresponding "
+                   "memory-backend-file.",
+                   NvdimmSyncModes_str(nvdimm->sync_dax));
+    }
+#endif
+
     nvdimm->nvdimm_mr = g_new(MemoryRegion, 1);
     memory_region_init_alias(nvdimm->nvdimm_mr, OBJECT(dimm),
                              "nvdimm-memory", mr, 0, pmem_size);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 80957f9188..d0058bc13b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4616,6 +4616,11 @@ static void spapr_machine_latest_class_options(MachineClass *mc)
 static void spapr_machine_6_0_class_options(MachineClass *mc)
 {
     /* Defaults for the latest behaviour inherited from the base class */
+    static GlobalProperty compat[] = {
+        { "nvdimm", "sync-dax", "writeback" },
+    };
+
+    compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat));
 }
 
 DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
@@ -4625,8 +4630,13 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
  */
 static void spapr_machine_5_2_class_options(MachineClass *mc)
 {
+    static GlobalProperty compat[] = {
+        { "nvdimm", "sync-dax", "unsafe" },
+    };
+
     spapr_machine_6_0_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
+    compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat));
 }
 
 DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 77eb7e1293..615439391c 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -50,6 +50,10 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
 {
     const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
     const MachineState *ms = MACHINE(hotplug_dev);
+    PCDIMMDevice __attribute__((unused)) *dimm = PC_DIMM(nvdimm);
+    MemoryRegion __attribute__((unused)) *mr;
+    bool __attribute__((unused)) is_pmem = false;
+    NvdimmSyncModes __attribute__((unused)) sync;
     g_autofree char *uuidstr = NULL;
     QemuUUID uuid;
     int ret;
@@ -77,6 +81,24 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
         return false;
     }
 
+#ifdef CONFIG_LIBPMEM
+    sync = object_property_get_enum(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                    "NvdimmSyncModes", &error_abort);
+
+    mr = host_memory_backend_get_memory(dimm->hostmem);
+    if (memory_region_get_fd(mr) > 0) { /* memor-backend-file */
+        HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
+        is_pmem = object_property_get_bool(OBJECT(backend), "pmem",
+                                           &error_abort);
+    }
+
+    if (sync == NVDIMM_SYNC_MODES_WRITEBACK && is_pmem) {
+        warn_report("The NVDIMM backing device being Synchronous DAX capable, "
+                    NVDIMM_SYNC_DAX_PROP"='%s' is unnecessary as the backend "
+                    "ensures the safety already.", NvdimmSyncModes_str(sync));
+    }
+#endif
+
     uuidstr = object_property_get_str(OBJECT(nvdimm), NVDIMM_UUID_PROP,
                                       &error_abort);
     ret = qemu_uuid_parse(uuidstr, &uuid);
@@ -124,6 +146,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
     uint64_t lsize = nvdimm->label_size;
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
+    NvdimmSyncModes sync_dax = object_property_get_enum(OBJECT(nvdimm),
+                                         NVDIMM_SYNC_DAX_PROP,
+                                         "NvdimmSyncModes", &error_abort);
 
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
@@ -158,6 +183,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (sync_dax == NVDIMM_SYNC_MODES_WRITEBACK) {
+        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                         NULL, 0));
+    }
+
     return child_offset;
 }
 
@@ -566,6 +596,8 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     uint64_t continue_token = args[1];
     SpaprDrc *drc = spapr_drc_by_index(drc_index);
     PCDIMMDevice *dimm;
+    NVDIMMDevice *nvdimm;
+    NvdimmSyncModes sync_dax;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
@@ -575,6 +607,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
         return H_PARAMETER;
     }
 
+    nvdimm = NVDIMM(drc->dev);
+    sync_dax = object_property_get_enum(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                    "NvdimmSyncModes", &error_abort);
+    if (sync_dax != NVDIMM_SYNC_MODES_WRITEBACK) {
+        return H_UNSUPPORTED;
+    }
+
     if (continue_token != 0) {
         goto get_status;
     }
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..ef30bdeca4 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -28,6 +28,7 @@
 #include "qemu/uuid.h"
 #include "hw/acpi/aml-build.h"
 #include "qom/object.h"
+#include "qapi/qapi-types-machine.h"
 
 #define NVDIMM_DEBUG 0
 #define nvdimm_debug(fmt, ...)                                \
@@ -51,6 +52,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
+#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
 
 struct NVDIMMDevice {
     /* private */
@@ -85,6 +87,15 @@ struct NVDIMMDevice {
      */
     bool unarmed;
 
+    /*
+     * The 'writeback' value would indicate the guest to make explicit
+     * flush requests to hypervisor. When 'direct', the device is
+     * assumed to be synchronous DAX capable and no explicit flush
+     * is required. 'unsafe' indicates flush semantics unimplemented
+     * and the data persistence not guaranteed in power failure scenarios.
+     */
+    NvdimmSyncModes sync_dax;
+
     /*
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 478c031396..ddde87e2b6 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -332,6 +332,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
diff --git a/qapi/common.json b/qapi/common.json
index 7c976296f0..bec1b45b09 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -197,3 +197,23 @@
 { 'enum': 'GrabToggleKeys',
   'data': [ 'ctrl-ctrl', 'alt-alt', 'shift-shift','meta-meta', 'scrolllock',
             'ctrl-scrolllock' ] }
+
+##
+# @NvdimmSyncModes:
+#
+# Indicates the mode of flush to be used to ensure persistence in case
+# of power failures.
+#
+# @unsafe: This is to indicate, the data on the backend device not be
+#          consistent in power failure scenarios.
+# @direct: This is to indicate the backend device supports synchronous DAX
+#          and no explicit flush requests from the guest is required.
+# @writeback: To be used when the backend device doesn't support synchronous
+#             DAX. The hypervisor issues flushes to the disk when requested
+#             by the guest.
+# Since: 6.0
+#
+##
+{ 'enum': 'NvdimmSyncModes',
+  'data': [ 'unsafe', 'writeback',
+            { 'name': 'direct', 'if': 'defined(CONFIG_LIBPMEM)' } ] }




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
                   ` (2 preceding siblings ...)
  2021-04-29  3:49 ` [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
@ 2021-04-29 15:55 ` Stefan Hajnoczi
  2021-04-29 16:32   ` Aneesh Kumar K.V
  2021-04-30 19:14 ` Dan Williams
  4 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-04-29 15:55 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: peter.maydell, mst, qemu-devel, linux-nvdimm, aneesh.kumar,
	armbru, bharata, haozhong.zhang, ehabkost, richard.henderson,
	groug, kvm-ppc, qemu-arm, imammedo, kwangwoo.lee, david,
	xiaoguangrong.eric, shameerali.kolothum.thodi, shivaprasadbhat,
	qemu-ppc, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote:
> The nvdimm devices are expected to ensure write persistence during power
> failure kind of scenarios.
> 
> The libpmem has architecture specific instructions like dcbf on POWER
> to flush the cache data to backend nvdimm device during normal writes
> followed by explicit flushes if the backend devices are not synchronous
> DAX capable.
> 
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> and the subsequent flush doesn't traslate to actual flush to the backend
> file on the host in case of file backed v-nvdimms. This is addressed by
> virtio-pmem in case of x86_64 by making explicit flushes translating to
> fsync at qemu.
> 
> On SPAPR, the issue is addressed by adding a new hcall to
> request for an explicit flush from the guest ndctl driver when the backend
> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> here is to convey when the hcall flush is required in a device tree
> property. The guest makes the hcall when the property is found, instead
> of relying on dcbf.

Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the
virtio-nvdimm device already exists?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-29 15:55 ` [PATCH v4 0/3] nvdimm: Enable sync-dax " Stefan Hajnoczi
@ 2021-04-29 16:32   ` Aneesh Kumar K.V
  2021-04-30  4:27     ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Aneesh Kumar K.V @ 2021-04-29 16:32 UTC (permalink / raw)
  To: Stefan Hajnoczi, Shivaprasad G Bhat
  Cc: peter.maydell, mst, qemu-devel, linux-nvdimm, armbru, bharata,
	haozhong.zhang, ehabkost, richard.henderson, groug, kvm-ppc,
	qemu-arm, imammedo, kwangwoo.lee, david, xiaoguangrong.eric,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc, pbonzini

On 4/29/21 9:25 PM, Stefan Hajnoczi wrote:
> On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote:
>> The nvdimm devices are expected to ensure write persistence during power
>> failure kind of scenarios.
>>
>> The libpmem has architecture specific instructions like dcbf on POWER
>> to flush the cache data to backend nvdimm device during normal writes
>> followed by explicit flushes if the backend devices are not synchronous
>> DAX capable.
>>
>> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
>> and the subsequent flush doesn't traslate to actual flush to the backend
>> file on the host in case of file backed v-nvdimms. This is addressed by
>> virtio-pmem in case of x86_64 by making explicit flushes translating to
>> fsync at qemu.
>>
>> On SPAPR, the issue is addressed by adding a new hcall to
>> request for an explicit flush from the guest ndctl driver when the backend
>> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
>> here is to convey when the hcall flush is required in a device tree
>> property. The guest makes the hcall when the property is found, instead
>> of relying on dcbf.
> 
> Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the
> virtio-nvdimm device already exists?
> 

On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for 
persistent memory support. This was done such that we can use one kernel 
driver to support persistent memory with multiple hypervisors. To avoid 
supporting multiple drivers in the guest, -device nvdimm Qemu 
command-line results in Qemu using PAPR SCM backend. What this patch 
series does is to make sure we expose the correct synchronous fault 
support, when we back such nvdimm device with a file.

The existing PAPR SCM backend enables persistent memory support with the 
help of multiple hypercall.

#define H_SCM_READ_METADATA     0x3E4
#define H_SCM_WRITE_METADATA    0x3E8
#define H_SCM_BIND_MEM          0x3EC
#define H_SCM_UNBIND_MEM        0x3F0
#define H_SCM_UNBIND_ALL        0x3FC

Most of them are already implemented in Qemu. This patch series 
implements H_SCM_FLUSH hypercall.




-aneesh



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-29 16:32   ` Aneesh Kumar K.V
@ 2021-04-30  4:27     ` David Gibson
  2021-04-30 15:08       ` Stefan Hajnoczi
  0 siblings, 1 reply; 19+ messages in thread
From: David Gibson @ 2021-04-30  4:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: peter.maydell, Shivaprasad G Bhat, mst, qemu-devel, linux-nvdimm,
	armbru, bharata, haozhong.zhang, ehabkost, richard.henderson,
	groug, kvm-ppc, qemu-arm, Stefan Hajnoczi, imammedo,
	kwangwoo.lee, xiaoguangrong.eric, shameerali.kolothum.thodi,
	shivaprasadbhat, qemu-ppc, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2828 bytes --]

On Thu, Apr 29, 2021 at 10:02:23PM +0530, Aneesh Kumar K.V wrote:
> On 4/29/21 9:25 PM, Stefan Hajnoczi wrote:
> > On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote:
> > > The nvdimm devices are expected to ensure write persistence during power
> > > failure kind of scenarios.
> > > 
> > > The libpmem has architecture specific instructions like dcbf on POWER
> > > to flush the cache data to backend nvdimm device during normal writes
> > > followed by explicit flushes if the backend devices are not synchronous
> > > DAX capable.
> > > 
> > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> > > and the subsequent flush doesn't traslate to actual flush to the backend
> > > file on the host in case of file backed v-nvdimms. This is addressed by
> > > virtio-pmem in case of x86_64 by making explicit flushes translating to
> > > fsync at qemu.
> > > 
> > > On SPAPR, the issue is addressed by adding a new hcall to
> > > request for an explicit flush from the guest ndctl driver when the backend
> > > nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> > > here is to convey when the hcall flush is required in a device tree
> > > property. The guest makes the hcall when the property is found, instead
> > > of relying on dcbf.
> > 
> > Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the
> > virtio-nvdimm device already exists?
> > 
> 
> On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for
> persistent memory support. This was done such that we can use one kernel
> driver to support persistent memory with multiple hypervisors. To avoid
> supporting multiple drivers in the guest, -device nvdimm Qemu command-line
> results in Qemu using PAPR SCM backend. What this patch series does is to
> make sure we expose the correct synchronous fault support, when we back such
> nvdimm device with a file.
> 
> The existing PAPR SCM backend enables persistent memory support with the
> help of multiple hypercall.
> 
> #define H_SCM_READ_METADATA     0x3E4
> #define H_SCM_WRITE_METADATA    0x3E8
> #define H_SCM_BIND_MEM          0x3EC
> #define H_SCM_UNBIND_MEM        0x3F0
> #define H_SCM_UNBIND_ALL        0x3FC
> 
> Most of them are already implemented in Qemu. This patch series implements
> H_SCM_FLUSH hypercall.

The overall point here is that we didn't define the hypercall.  It was
defined in order to support NVDIMM/pmem devices under PowerVM.  For
uniformity between PowerVM and KVM guests, we want to support the same
hypercall interface on KVM/qemu as well.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-30  4:27     ` David Gibson
@ 2021-04-30 15:08       ` Stefan Hajnoczi
  0 siblings, 0 replies; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-04-30 15:08 UTC (permalink / raw)
  To: David Gibson
  Cc: peter.maydell, Shivaprasad G Bhat, mst, qemu-devel, linux-nvdimm,
	Aneesh Kumar K.V, armbru, bharata, haozhong.zhang, ehabkost,
	richard.henderson, groug, kvm-ppc, qemu-arm, imammedo,
	kwangwoo.lee, xiaoguangrong.eric, shameerali.kolothum.thodi,
	shivaprasadbhat, qemu-ppc, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

On Fri, Apr 30, 2021 at 02:27:18PM +1000, David Gibson wrote:
> On Thu, Apr 29, 2021 at 10:02:23PM +0530, Aneesh Kumar K.V wrote:
> > On 4/29/21 9:25 PM, Stefan Hajnoczi wrote:
> > > On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote:
> > > > The nvdimm devices are expected to ensure write persistence during power
> > > > failure kind of scenarios.
> > > > 
> > > > The libpmem has architecture specific instructions like dcbf on POWER
> > > > to flush the cache data to backend nvdimm device during normal writes
> > > > followed by explicit flushes if the backend devices are not synchronous
> > > > DAX capable.
> > > > 
> > > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> > > > and the subsequent flush doesn't traslate to actual flush to the backend
> > > > file on the host in case of file backed v-nvdimms. This is addressed by
> > > > virtio-pmem in case of x86_64 by making explicit flushes translating to
> > > > fsync at qemu.
> > > > 
> > > > On SPAPR, the issue is addressed by adding a new hcall to
> > > > request for an explicit flush from the guest ndctl driver when the backend
> > > > nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> > > > here is to convey when the hcall flush is required in a device tree
> > > > property. The guest makes the hcall when the property is found, instead
> > > > of relying on dcbf.
> > > 
> > > Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the
> > > virtio-nvdimm device already exists?
> > > 
> > 
> > On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for
> > persistent memory support. This was done such that we can use one kernel
> > driver to support persistent memory with multiple hypervisors. To avoid
> > supporting multiple drivers in the guest, -device nvdimm Qemu command-line
> > results in Qemu using PAPR SCM backend. What this patch series does is to
> > make sure we expose the correct synchronous fault support, when we back such
> > nvdimm device with a file.
> > 
> > The existing PAPR SCM backend enables persistent memory support with the
> > help of multiple hypercall.
> > 
> > #define H_SCM_READ_METADATA     0x3E4
> > #define H_SCM_WRITE_METADATA    0x3E8
> > #define H_SCM_BIND_MEM          0x3EC
> > #define H_SCM_UNBIND_MEM        0x3F0
> > #define H_SCM_UNBIND_ALL        0x3FC
> > 
> > Most of them are already implemented in Qemu. This patch series implements
> > H_SCM_FLUSH hypercall.
> 
> The overall point here is that we didn't define the hypercall.  It was
> defined in order to support NVDIMM/pmem devices under PowerVM.  For
> uniformity between PowerVM and KVM guests, we want to support the same
> hypercall interface on KVM/qemu as well.

Okay, that's fine. Now Linux and QEMU have multiple ways of doing this,
but it's fair enough if it's an existing platform hypercall.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
                   ` (3 preceding siblings ...)
  2021-04-29 15:55 ` [PATCH v4 0/3] nvdimm: Enable sync-dax " Stefan Hajnoczi
@ 2021-04-30 19:14 ` Dan Williams
  2021-05-01 13:55   ` Aneesh Kumar K.V
  2021-05-03 14:05   ` Shivaprasad G Bhat
  4 siblings, 2 replies; 19+ messages in thread
From: Dan Williams @ 2021-04-30 19:14 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: peter.maydell, Michael S. Tsirkin, Qemu Developers, linux-nvdimm,
	Aneesh Kumar K.V, Markus Armbruster, bharata, Haozhong Zhang,
	Eduardo Habkost, richard.henderson, Greg Kurz, kvm-ppc, qemu-arm,
	Stefan Hajnoczi, Igor Mammedov, kwangwoo.lee, David Gibson,
	Xiao Guangrong, shameerali.kolothum.thodi, shivaprasadbhat,
	qemu-ppc, Paolo Bonzini

Some corrections to terminology confusion below...


On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>
> The nvdimm devices are expected to ensure write persistence during power
> failure kind of scenarios.

No, QEMU is not expected to make that guarantee. QEMU is free to lie
to the guest about the persistence guarantees of the guest PMEM
ranges. It's more accurate to say that QEMU nvdimm devices can emulate
persistent memory and optionally pass through host power-fail
persistence guarantees to the guest. The power-fail persistence domain
can be one of "cpu_cache", or "memory_controller" if the persistent
memory region is "synchronous". If the persistent range is not
synchronous, it really isn't "persistent memory"; it's memory mapped
storage that needs I/O commands to flush.

> The libpmem has architecture specific instructions like dcbf on POWER

Which "libpmem" is this? PMDK is a reference library not a PMEM
interface... maybe I'm missing what libpmem has to do with QEMU?

> to flush the cache data to backend nvdimm device during normal writes
> followed by explicit flushes if the backend devices are not synchronous
> DAX capable.
>
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> and the subsequent flush doesn't traslate to actual flush to the backend

s/traslate/translate/

> file on the host in case of file backed v-nvdimms. This is addressed by
> virtio-pmem in case of x86_64 by making explicit flushes translating to
> fsync at qemu.

Note that virtio-pmem was a proposal for a specific optimization of
allowing guests to share page cache. The virtio-pmem approach is not
to be confused with actual persistent memory.

> On SPAPR, the issue is addressed by adding a new hcall to
> request for an explicit flush from the guest ndctl driver when the backend

What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
guest pmem driver?

> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> here is to convey when the hcall flush is required in a device tree
> property. The guest makes the hcall when the property is found, instead
> of relying on dcbf.
>
> A new device property sync-dax is added to the nvdimm device. When the
> sync-dax is 'writeback'(default for PPC), device property
> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> requesting for an explicit flush.

I'm not sure "sync-dax" is a suitable name for the property of the
guest persistent memory. There is no requirement that the
memory-backend file for a guest be a dax-capable file. It's also
implementation specific what hypercall needs to be invoked for a given
occurrence of "sync-dax". What does that map to on non-PPC platforms
for example? It seems to me that an "nvdimm" device presents the
synchronous usage model and a whole other device type implements an
async-hypercall setup that the guest happens to service with its
nvdimm stack, but it's not an "nvdimm" anymore at that point.

> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
> now as the flush semantics are unimplemented.

"sync-dax" has no meaning on its own, I think this needs an explicit
mechanism to convey both the "not-sync" property *and* the callback
method, it shouldn't be inferred by arch type.

> When the backend file is actually synchronous DAX capable and no explicit
> flushes are required, the sync-dax mode 'direct' is to be used.
>
> The below demonstration shows the map_sync behavior with sync-dax writeback &
> direct.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
>
> The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
> nvdimm with syn-dax=writeback, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback
> Failed to mmap  with Operation not supported
>
> The first patch does the header file cleanup necessary for the
> subsequent ones. Second patch implements the hcall, adds the necessary
> vmstate properties to spapr machine structure for carrying the hcall
> status during save-restore. The nature of the hcall being asynchronus,
> the patch uses aio utilities to offload the flush. The third patch adds
> the 'sync-dax' device property and enables the device tree property
> for the guest to utilise the hcall.
>
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
>
> ---
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>       - Fixed the forward declaration coding guideline violations in 1st patch.
>       - Removed the code waiting for the flushes to complete during migration,
>         instead restart the flush worker on destination qemu in post load.
>       - Got rid of the randomization of the flush tokens, using simple
>         counter.
>       - Got rid of the redundant flush state lock, relying on the BQL now.
>       - Handling the memory-backend-ram usage
>       - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
>         Added prevention code using 'writeback' on arm and x86_64.
>       - Fixed all the miscellaneous comments.
>
> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
> Changes from v2:
>       - Using the thread pool based approach as suggested
>       - Moved the async hcall handling code to spapr_nvdimm.c along
>         with some simplifications
>       - Added vmstate to preserve the hcall status during save-restore
>         along with pre_save handler code to complete all ongoning flushes.
>       - Added hw_compat magic for sync-dax 'on' on previous machines.
>       - Miscellanious minor fixes.
>
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>       - Fixed a missed-out unlock
>       - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
>
> Shivaprasad G Bhat (3):
>       spapr: nvdimm: Forward declare and move the definitions
>       spapr: nvdimm: Implement H_SCM_FLUSH hcall
>       nvdimm: Enable sync-dax device property for nvdimm
>
>
>  hw/arm/virt.c                 |   28 ++++
>  hw/i386/pc.c                  |   28 ++++
>  hw/mem/nvdimm.c               |   52 +++++++
>  hw/ppc/spapr.c                |   16 ++
>  hw/ppc/spapr_nvdimm.c         |  285 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/mem/nvdimm.h       |   11 ++
>  include/hw/ppc/spapr.h        |   11 +-
>  include/hw/ppc/spapr_nvdimm.h |   27 ++--
>  qapi/common.json              |   20 +++
>  9 files changed, 455 insertions(+), 23 deletions(-)
>
> --
> Signature
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-30 19:14 ` Dan Williams
@ 2021-05-01 13:55   ` Aneesh Kumar K.V
  2021-05-03 14:05   ` Shivaprasad G Bhat
  1 sibling, 0 replies; 19+ messages in thread
From: Aneesh Kumar K.V @ 2021-05-01 13:55 UTC (permalink / raw)
  To: Dan Williams, Shivaprasad G Bhat
  Cc: peter.maydell, Michael S. Tsirkin, Qemu Developers, linux-nvdimm,
	Markus Armbruster, bharata, Haozhong Zhang, Eduardo Habkost,
	richard.henderson, Greg Kurz, kvm-ppc, qemu-arm, Stefan Hajnoczi,
	Igor Mammedov, kwangwoo.lee, David Gibson, Xiao Guangrong,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc,
	Paolo Bonzini

On 5/1/21 12:44 AM, Dan Williams wrote:
> Some corrections to terminology confusion below...

.......

> 
>> file on the host in case of file backed v-nvdimms. This is addressed by
>> virtio-pmem in case of x86_64 by making explicit flushes translating to
>> fsync at qemu.
> 
> Note that virtio-pmem was a proposal for a specific optimization of
> allowing guests to share page cache. The virtio-pmem approach is not
> to be confused with actual persistent memory.
> 

.....

>>
>> A new device property sync-dax is added to the nvdimm device. When the
>> sync-dax is 'writeback'(default for PPC), device property
>> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
>> requesting for an explicit flush.
> 
> I'm not sure "sync-dax" is a suitable name for the property of the
> guest persistent memory. There is no requirement that the
> memory-backend file for a guest be a dax-capable file. It's also
> implementation specific what hypercall needs to be invoked for a given
> occurrence of "sync-dax". What does that map to on non-PPC platforms
> for example? It seems to me that an "nvdimm" device presents the
> synchronous usage model and a whole other device type implements an
> async-hypercall setup that the guest happens to service with its
> nvdimm stack, but it's not an "nvdimm" anymore at that point.
> 

What is attempted here is to use the same guest driver papr_scm.ko 
support the usecase of sharing page cache from the host instead of 
depending on a new guest driver virtio-pmem. This also try to correctly 
indicate to the guest that an usage like

-object memory-backend-file,id=memnvdimm1,mem-path=file_name
-device nvdimm,memdev=memnvdimm1

correctly indicate to the guest that we are indeed sharing page cache 
and not really emulating a persistent memory.

W.r.t non ppc platforms, it was discussed earlier and one of the 
suggestion there was to mark that as "unsafe".

Any suggestion for an alternate property name than "sync-dax"?

>> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
>> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
>> now as the flush semantics are unimplemented.
> 
> "sync-dax" has no meaning on its own, I think this needs an explicit
> mechanism to convey both the "not-sync" property *and* the callback
> method, it shouldn't be inferred by arch type.

Won't a non-sync property imply that guest needs to do a callback to 
ensure persistence? Hence they are related?


-aneesh


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-04-30 19:14 ` Dan Williams
  2021-05-01 13:55   ` Aneesh Kumar K.V
@ 2021-05-03 14:05   ` Shivaprasad G Bhat
  2021-05-03 19:41     ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Shivaprasad G Bhat @ 2021-05-03 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: peter.maydell, Michael S. Tsirkin, Qemu Developers, linux-nvdimm,
	Aneesh Kumar K.V, Markus Armbruster, bharata, Haozhong Zhang,
	Eduardo Habkost, richard.henderson, Greg Kurz, kvm-ppc, qemu-arm,
	Stefan Hajnoczi, Igor Mammedov, kwangwoo.lee, David Gibson,
	Xiao Guangrong, shameerali.kolothum.thodi, shivaprasadbhat,
	qemu-ppc, Paolo Bonzini


On 5/1/21 12:44 AM, Dan Williams wrote:
> Some corrections to terminology confusion below...
>
>
> On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>> The nvdimm devices are expected to ensure write persistence during power
>> failure kind of scenarios.
> No, QEMU is not expected to make that guarantee. QEMU is free to lie
> to the guest about the persistence guarantees of the guest PMEM
> ranges. It's more accurate to say that QEMU nvdimm devices can emulate
> persistent memory and optionally pass through host power-fail
> persistence guarantees to the guest. The power-fail persistence domain
> can be one of "cpu_cache", or "memory_controller" if the persistent
> memory region is "synchronous". If the persistent range is not
> synchronous, it really isn't "persistent memory"; it's memory mapped
> storage that needs I/O commands to flush.

Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is 
completely

in the host pagecache, and we need a way to ensure that host pagecaches

are flushed to the backend. This analogous to the WPQ flush being offloaded

to the hypervisor.


Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt


>
>> The libpmem has architecture specific instructions like dcbf on POWER
> Which "libpmem" is this? PMDK is a reference library not a PMEM
> interface... maybe I'm missing what libpmem has to do with QEMU?


I was referrering to semantics of flushing pmem cache lines as in

PMDK/libpmem.


>
>> to flush the cache data to backend nvdimm device during normal writes
>> followed by explicit flushes if the backend devices are not synchronous
>> DAX capable.
>>
>> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
>> and the subsequent flush doesn't traslate to actual flush to the backend
> s/traslate/translate/
>
>> file on the host in case of file backed v-nvdimms. This is addressed by
>> virtio-pmem in case of x86_64 by making explicit flushes translating to
>> fsync at qemu.
> Note that virtio-pmem was a proposal for a specific optimization of
> allowing guests to share page cache. The virtio-pmem approach is not
> to be confused with actual persistent memory.
>
>> On SPAPR, the issue is addressed by adding a new hcall to
>> request for an explicit flush from the guest ndctl driver when the backend
> What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
> guest pmem driver?


oops, wrong terminologies. I was referring to guest libnvdimm and

papr_scm kernel modules.


>
>> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
>> here is to convey when the hcall flush is required in a device tree
>> property. The guest makes the hcall when the property is found, instead
>> of relying on dcbf.
>>
>> A new device property sync-dax is added to the nvdimm device. When the
>> sync-dax is 'writeback'(default for PPC), device property
>> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
>> requesting for an explicit flush.
> I'm not sure "sync-dax" is a suitable name for the property of the
> guest persistent memory.


sync-dax property translates ND_REGION_ASYNC flag being set/unset

for the pmem region also if the nvdimm_flush callback is provided in the

papr_scm or not. As everything boils down to synchronous nature

of the device, I chose sync-dax for the name.


>   There is no requirement that the
> memory-backend file for a guest be a dax-capable file. It's also
> implementation specific what hypercall needs to be invoked for a given
> occurrence of "sync-dax". What does that map to on non-PPC platforms
> for example?


The backend file can be dax-capable, to be hinted using "sync-dax=direct".

When the backend is not dax-capable, the "sync-dax=writeback" to used,

so that the guest makes the hcall. On all non-PPC archs, with the

"sync-dax=writeback" qemu errors out stating the lack of support.


>   It seems to me that an "nvdimm" device presents the
> synchronous usage model and a whole other device type implements an
> async-hypercall setup that the guest happens to service with its
> nvdimm stack, but it's not an "nvdimm" anymore at that point.


In case the file backing the v-nvdimm is not dax-capable, we need flush

semantics on the guest to be mapped to pagecache flush on the host side.


>
>> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
>> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
>> now as the flush semantics are unimplemented.
> "sync-dax" has no meaning on its own, I think this needs an explicit
> mechanism to convey both the "not-sync" property *and* the callback
> method, it shouldn't be inferred by arch type.


Yes. On all platforms the "sync-dax=unsafe" meaning - with host power

failure the host pagecache is lost and subsequently data written by the

guest will also be gone. This is the default for non-PPC.


On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC

is set for the region and the guest makes hcalls to issue fsync on the host.


Are you suggesting me to keep it "unsafe" as default for all architectures

including PPC and a user can set it to "writeback" if desired.


>
>> When the backend file is actually synchronous DAX capable and no explicit
>> flushes are required, the sync-dax mode 'direct' is to be used.
>>
>> The below demonstration shows the map_sync behavior with sync-dax writeback &
>> direct.
>> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
>>
>> The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
>> nvdimm with syn-dax=writeback, mounted as
>> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>>
>> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct
>> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback
>> Failed to mmap  with Operation not supported
>>
>> The first patch does the header file cleanup necessary for the
>> subsequent ones. Second patch implements the hcall, adds the necessary
>> vmstate properties to spapr machine structure for carrying the hcall
>> status during save-restore. The nature of the hcall being asynchronus,
>> the patch uses aio utilities to offload the flush. The third patch adds
>> the 'sync-dax' device property and enables the device tree property
>> for the guest to utilise the hcall.
>>
>> The kernel changes to exploit this hcall is at
>> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
>>
>> ---
>> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
>> Changes from v3:
>>        - Fixed the forward declaration coding guideline violations in 1st patch.
>>        - Removed the code waiting for the flushes to complete during migration,
>>          instead restart the flush worker on destination qemu in post load.
>>        - Got rid of the randomization of the flush tokens, using simple
>>          counter.
>>        - Got rid of the redundant flush state lock, relying on the BQL now.
>>        - Handling the memory-backend-ram usage
>>        - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
>>          Added prevention code using 'writeback' on arm and x86_64.
>>        - Fixed all the miscellaneous comments.
>>
>> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
>> Changes from v2:
>>        - Using the thread pool based approach as suggested
>>        - Moved the async hcall handling code to spapr_nvdimm.c along
>>          with some simplifications
>>        - Added vmstate to preserve the hcall status during save-restore
>>          along with pre_save handler code to complete all ongoning flushes.
>>        - Added hw_compat magic for sync-dax 'on' on previous machines.
>>        - Miscellanious minor fixes.
>>
>> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
>> Changes from v1
>>        - Fixed a missed-out unlock
>>        - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
>>
>> Shivaprasad G Bhat (3):
>>        spapr: nvdimm: Forward declare and move the definitions
>>        spapr: nvdimm: Implement H_SCM_FLUSH hcall
>>        nvdimm: Enable sync-dax device property for nvdimm
>>
>>
>>   hw/arm/virt.c                 |   28 ++++
>>   hw/i386/pc.c                  |   28 ++++
>>   hw/mem/nvdimm.c               |   52 +++++++
>>   hw/ppc/spapr.c                |   16 ++
>>   hw/ppc/spapr_nvdimm.c         |  285 +++++++++++++++++++++++++++++++++++++++++
>>   include/hw/mem/nvdimm.h       |   11 ++
>>   include/hw/ppc/spapr.h        |   11 +-
>>   include/hw/ppc/spapr_nvdimm.h |   27 ++--
>>   qapi/common.json              |   20 +++
>>   9 files changed, 455 insertions(+), 23 deletions(-)
>>
>> --
>> Signature
>> _______________________________________________
>> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
>> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions
  2021-04-29  3:48 ` [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
@ 2021-05-03 18:23   ` Eric Blake
  2021-05-04  1:21     ` David Gibson
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Blake @ 2021-05-03 18:23 UTC (permalink / raw)
  To: Shivaprasad G Bhat, david, groug, qemu-ppc, ehabkost,
	marcel.apfelbaum, mst, imammedo, xiaoguangrong.eric,
	peter.maydell, qemu-arm, richard.henderson, pbonzini, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

On 4/28/21 10:48 PM, Shivaprasad G Bhat wrote:
> The subsequent patches add definitions which tend to
> get the compilation to cyclic dependency. So, prepare
> with forward declarations, move the defitions and clean up.

definitions

> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
>  include/hw/ppc/spapr_nvdimm.h |   14 ++------------
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index b46c36917c..8cf3fb2ffb 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -31,6 +31,18 @@
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
>  
> +/*
> + * The nvdimm size should be aligned to SCM block size.
> + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> + * inorder to have SCM regions not to overlap with dimm memory regions.

And while at it, even though it is code motion...

> + * The SCM devices can have variable block sizes. For now, fixing the
> + * block size to the minimum value.
> + */
> +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE

> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,19 +11,9 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> -#include "hw/ppc/spapr.h"
>  
> -/*
> - * The nvdimm size should be aligned to SCM block size.
> - * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> - * inorder to have SCM regions not to overlap with dimm memory regions.

... this should be "in order"

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm
  2021-04-29  3:49 ` [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
@ 2021-05-03 18:27   ` Eric Blake
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Blake @ 2021-05-03 18:27 UTC (permalink / raw)
  To: Shivaprasad G Bhat, david, groug, qemu-ppc, ehabkost,
	marcel.apfelbaum, mst, imammedo, xiaoguangrong.eric,
	peter.maydell, qemu-arm, richard.henderson, pbonzini, stefanha,
	haozhong.zhang, shameerali.kolothum.thodi, kwangwoo.lee, armbru
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

On 4/28/21 10:49 PM, Shivaprasad G Bhat wrote:
> The patch adds the 'sync-dax' property to the nvdimm device.
> 
> When the sync-dax is 'direct' indicates the backend is synchronous DAX
> capable and no explicit flush requests are required. When the mode is
> set to 'writeback' it indicates the backend is not synhronous DAX

synchronous

> capable and explicit flushes to Hypervisor are required.
> 
> On PPC where the flush requests from guest can be honoured by the qemu,

s/the qemu/qemu/

> the 'writeback' mode is supported and set as the default. The device
> tree property "hcall-flush-required" is added to the nvdimm node which
> makes the guest to issue H_SCM_FLUSH hcalls to request for flushes

s/to issue/issue/
s/request for/request/

> explicitly. This would be the default behaviour without sync-dax
> property set for the nvdimm device. For old pSeries machine, the
> default is 'unsafe'.
> 
> For non-PPC platforms, the mode is set to 'unsafe' as the default.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---

> +++ b/qapi/common.json
> @@ -197,3 +197,23 @@
>  { 'enum': 'GrabToggleKeys',
>    'data': [ 'ctrl-ctrl', 'alt-alt', 'shift-shift','meta-meta', 'scrolllock',
>              'ctrl-scrolllock' ] }
> +
> +##
> +# @NvdimmSyncModes:
> +#
> +# Indicates the mode of flush to be used to ensure persistence in case
> +# of power failures.
> +#
> +# @unsafe: This is to indicate, the data on the backend device not be
> +#          consistent in power failure scenarios.

s/This is to indicate, the/This indicates that/
s/device not/device might not/

> +# @direct: This is to indicate the backend device supports synchronous DAX
> +#          and no explicit flush requests from the guest is required.

This indicates the backend device supports synchronous DAX, and no
explicit flush requests from the guest are required.

> +# @writeback: To be used when the backend device doesn't support synchronous
> +#             DAX. The hypervisor issues flushes to the disk when requested
> +#             by the guest.
> +# Since: 6.0

6.1

> +#
> +##
> +{ 'enum': 'NvdimmSyncModes',
> +  'data': [ 'unsafe', 'writeback',
> +            { 'name': 'direct', 'if': 'defined(CONFIG_LIBPMEM)' } ] }
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-05-03 14:05   ` Shivaprasad G Bhat
@ 2021-05-03 19:41     ` Dan Williams
  2021-05-04  4:59       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2021-05-03 19:41 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: Peter Maydell, Michael S. Tsirkin, Qemu Developers, linux-nvdimm,
	Aneesh Kumar K.V, Markus Armbruster, bharata, Haozhong Zhang,
	Eduardo Habkost, richard.henderson, Greg Kurz, kvm-ppc, qemu-arm,
	Stefan Hajnoczi, Igor Mammedov, kwangwoo.lee, David Gibson,
	Xiao Guangrong, shameerali.kolothum.thodi, shivaprasadbhat,
	qemu-ppc, Paolo Bonzini

On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>
>
> On 5/1/21 12:44 AM, Dan Williams wrote:
> > Some corrections to terminology confusion below...
> >
> >
> > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
> >> The nvdimm devices are expected to ensure write persistence during power
> >> failure kind of scenarios.
> > No, QEMU is not expected to make that guarantee. QEMU is free to lie
> > to the guest about the persistence guarantees of the guest PMEM
> > ranges. It's more accurate to say that QEMU nvdimm devices can emulate
> > persistent memory and optionally pass through host power-fail
> > persistence guarantees to the guest. The power-fail persistence domain
> > can be one of "cpu_cache", or "memory_controller" if the persistent
> > memory region is "synchronous". If the persistent range is not
> > synchronous, it really isn't "persistent memory"; it's memory mapped
> > storage that needs I/O commands to flush.
>
> Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is
> completely
>
> in the host pagecache, and we need a way to ensure that host pagecaches
>
> are flushed to the backend. This analogous to the WPQ flush being offloaded
>
> to the hypervisor.

No, it isn't analogous. WPQ flush is an optional mechanism to force
data to a higher durability domain. The flush in this interface is
mandatory. It's a different class of device.

The proposal that "sync-dax=unsafe" for non-PPC architectures, is a
fundamental misrepresentation of how this is supposed to work. Rather
than make "sync-dax" a first class citizen of the device-description
interface I'm proposing that you make this a separate device-type.
This also solves the problem that "sync-dax" with an implicit
architecture backend assumption is not precise, but a new "non-nvdimm"
device type would make it explicit what the host is advertising to the
guest.

>
>
> Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt
>
>
>
> >
> >> The libpmem has architecture specific instructions like dcbf on POWER
> > Which "libpmem" is this? PMDK is a reference library not a PMEM
> > interface... maybe I'm missing what libpmem has to do with QEMU?
>
>
> I was referrering to semantics of flushing pmem cache lines as in
>
> PMDK/libpmem.
>
>
> >
> >> to flush the cache data to backend nvdimm device during normal writes
> >> followed by explicit flushes if the backend devices are not synchronous
> >> DAX capable.
> >>
> >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> >> and the subsequent flush doesn't traslate to actual flush to the backend
> > s/traslate/translate/
> >
> >> file on the host in case of file backed v-nvdimms. This is addressed by
> >> virtio-pmem in case of x86_64 by making explicit flushes translating to
> >> fsync at qemu.
> > Note that virtio-pmem was a proposal for a specific optimization of
> > allowing guests to share page cache. The virtio-pmem approach is not
> > to be confused with actual persistent memory.
> >
> >> On SPAPR, the issue is addressed by adding a new hcall to
> >> request for an explicit flush from the guest ndctl driver when the backend
> > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
> > guest pmem driver?
>
>
> oops, wrong terminologies. I was referring to guest libnvdimm and
>
> papr_scm kernel modules.
>
>
> >
> >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> >> here is to convey when the hcall flush is required in a device tree
> >> property. The guest makes the hcall when the property is found, instead
> >> of relying on dcbf.
> >>
> >> A new device property sync-dax is added to the nvdimm device. When the
> >> sync-dax is 'writeback'(default for PPC), device property
> >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> >> requesting for an explicit flush.
> > I'm not sure "sync-dax" is a suitable name for the property of the
> > guest persistent memory.
>
>
> sync-dax property translates ND_REGION_ASYNC flag being set/unset

Yes, I am aware, but that property alone is not sufficient to identify
the flush mechanism.

>
> for the pmem region also if the nvdimm_flush callback is provided in the
>
> papr_scm or not. As everything boils down to synchronous nature
>
> of the device, I chose sync-dax for the name.
>
>
> >   There is no requirement that the
> > memory-backend file for a guest be a dax-capable file. It's also
> > implementation specific what hypercall needs to be invoked for a given
> > occurrence of "sync-dax". What does that map to on non-PPC platforms
> > for example?
>
>
> The backend file can be dax-capable, to be hinted using "sync-dax=direct".

All memory-mapped files are "dax-capable". "DAX" is an access
mechanism, not a data-integrity contract.

> When the backend is not dax-capable, the "sync-dax=writeback" to used,

No, the qemu property for this shuold be a separate device-type.

> so that the guest makes the hcall. On all non-PPC archs, with the
>
> "sync-dax=writeback" qemu errors out stating the lack of support.

There is no "lack of support" to be worried about on other archs if
the interface is explicit about the atypical flush arrangement.

>
>
> >   It seems to me that an "nvdimm" device presents the
> > synchronous usage model and a whole other device type implements an
> > async-hypercall setup that the guest happens to service with its
> > nvdimm stack, but it's not an "nvdimm" anymore at that point.
>
>
> In case the file backing the v-nvdimm is not dax-capable, we need flush
>
> semantics on the guest to be mapped to pagecache flush on the host side.
>
>
> >
> >> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
> >> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
> >> now as the flush semantics are unimplemented.
> > "sync-dax" has no meaning on its own, I think this needs an explicit
> > mechanism to convey both the "not-sync" property *and* the callback
> > method, it shouldn't be inferred by arch type.
>
>
> Yes. On all platforms the "sync-dax=unsafe" meaning - with host power
>
> failure the host pagecache is lost and subsequently data written by the
>
> guest will also be gone. This is the default for non-PPC.

The default to date has been for the guest to trust that an nvdimm is
an nvdimm with no explicit flushing required. It's too late now to
introduce an "unsafe" default.

>
>
> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC
>
> is set for the region and the guest makes hcalls to issue fsync on the host.
>
>
> Are you suggesting me to keep it "unsafe" as default for all architectures
>
> including PPC and a user can set it to "writeback" if desired.

No, I am suggesting that "sync-dax" is insufficient to convey this
property. This behavior warrants its own device type, not an ambiguous
property of the memory-backend-file with implicit architecture
assumptions attached.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions
  2021-05-03 18:23   ` Eric Blake
@ 2021-05-04  1:21     ` David Gibson
  0 siblings, 0 replies; 19+ messages in thread
From: David Gibson @ 2021-05-04  1:21 UTC (permalink / raw)
  To: Eric Blake
  Cc: peter.maydell, Shivaprasad G Bhat, mst, qemu-devel, linux-nvdimm,
	aneesh.kumar, armbru, bharata, haozhong.zhang, ehabkost,
	richard.henderson, groug, kvm-ppc, qemu-arm, stefanha, imammedo,
	kwangwoo.lee, xiaoguangrong.eric, shameerali.kolothum.thodi,
	shivaprasadbhat, qemu-ppc, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

On Mon, May 03, 2021 at 01:23:47PM -0500, Eric Blake wrote:
> On 4/28/21 10:48 PM, Shivaprasad G Bhat wrote:
> > The subsequent patches add definitions which tend to
> > get the compilation to cyclic dependency. So, prepare
> > with forward declarations, move the defitions and clean up.
> 
> definitions
> 
> > 
> > Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> > ---
> >  hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
> >  include/hw/ppc/spapr_nvdimm.h |   14 ++------------
> >  2 files changed, 14 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> > index b46c36917c..8cf3fb2ffb 100644
> > --- a/hw/ppc/spapr_nvdimm.c
> > +++ b/hw/ppc/spapr_nvdimm.c
> > @@ -31,6 +31,18 @@
> >  #include "qemu/range.h"
> >  #include "hw/ppc/spapr_numa.h"
> >  
> > +/*
> > + * The nvdimm size should be aligned to SCM block size.
> > + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> > + * inorder to have SCM regions not to overlap with dimm memory regions.
> 
> And while at it, even though it is code motion...

It looks like the patch no longer applies clear to ppc-for-6.1, so can
you rebase and fix up Eric's nitpicks at the same time?

> 
> > + * The SCM devices can have variable block sizes. For now, fixing the
> > + * block size to the minimum value.
> > + */
> > +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> 
> > +++ b/include/hw/ppc/spapr_nvdimm.h
> > @@ -11,19 +11,9 @@
> >  #define HW_SPAPR_NVDIMM_H
> >  
> >  #include "hw/mem/nvdimm.h"
> > -#include "hw/ppc/spapr.h"
> >  
> > -/*
> > - * The nvdimm size should be aligned to SCM block size.
> > - * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> > - * inorder to have SCM regions not to overlap with dimm memory regions.
> 
> ... this should be "in order"
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-05-03 19:41     ` Dan Williams
@ 2021-05-04  4:59       ` Aneesh Kumar K.V
  2021-05-04  5:43         ` Pankaj Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Aneesh Kumar K.V @ 2021-05-04  4:59 UTC (permalink / raw)
  To: Dan Williams, Shivaprasad G Bhat
  Cc: Peter Maydell, Michael S. Tsirkin, Qemu Developers, linux-nvdimm,
	Markus Armbruster, bharata, Haozhong Zhang, Eduardo Habkost,
	richard.henderson, Greg Kurz, kvm-ppc, qemu-arm, Stefan Hajnoczi,
	Igor Mammedov, kwangwoo.lee, David Gibson, Xiao Guangrong,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc,
	Paolo Bonzini

On 5/4/21 1:11 AM, Dan Williams wrote:
> On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:
>>


.....

> 
> The proposal that "sync-dax=unsafe" for non-PPC architectures, is a
> fundamental misrepresentation of how this is supposed to work. Rather
> than make "sync-dax" a first class citizen of the device-description
> interface I'm proposing that you make this a separate device-type.
> This also solves the problem that "sync-dax" with an implicit
> architecture backend assumption is not precise, but a new "non-nvdimm"
> device type would make it explicit what the host is advertising to the
> guest.
> 

Currently, users can use a virtualized nvdimm support in Qemu to share 
host page cache to the guest via the below command

-object memory-backend-file,id=memnvdimm1,mem-path=file_name_in_host_fs
-device nvdimm,memdev=memnvdimm1

Such usage can results in wrong application behavior because there is no 
hint to the application/guest OS that a cpu cache flush is not 
sufficient to ensure persistence.

I understand that virio-pmem is suggested as an alternative for that. 
But why not fix virtualized nvdimm if platforms can express the details.

ie, can ACPI indicate to the guest OS that the device need a flush 
mechanism to ensure persistence in the above case?

What this patch series did was to express that property via a device 
tree node and guest driver enables a hypercall based flush mechanism to 
ensure persistence.


....

>>
>> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC
>>
>> is set for the region and the guest makes hcalls to issue fsync on the host.
>>
>>
>> Are you suggesting me to keep it "unsafe" as default for all architectures
>>
>> including PPC and a user can set it to "writeback" if desired.
> 
> No, I am suggesting that "sync-dax" is insufficient to convey this
> property. This behavior warrants its own device type, not an ambiguous
> property of the memory-backend-file with implicit architecture
> assumptions attached.
> 

Why is it insufficient?  Is it because other architectures don't have an 
ability express this detail to guest OS? Isn't that an arch limitations?

-aneesh


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-05-04  4:59       ` Aneesh Kumar K.V
@ 2021-05-04  5:43         ` Pankaj Gupta
  2021-05-04  9:02           ` Aneesh Kumar K.V
  0 siblings, 1 reply; 19+ messages in thread
From: Pankaj Gupta @ 2021-05-04  5:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Peter Maydell, Shivaprasad G Bhat, Michael S. Tsirkin,
	Qemu Developers, linux-nvdimm, Markus Armbruster, bharata,
	Haozhong Zhang, Eduardo Habkost, richard.henderson, Greg Kurz,
	kvm-ppc, qemu-arm, Stefan Hajnoczi, Igor Mammedov, Dan Williams,
	kwangwoo.lee, David Gibson, Xiao Guangrong,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc,
	Paolo Bonzini

> > The proposal that "sync-dax=unsafe" for non-PPC architectures, is a
> > fundamental misrepresentation of how this is supposed to work. Rather
> > than make "sync-dax" a first class citizen of the device-description
> > interface I'm proposing that you make this a separate device-type.
> > This also solves the problem that "sync-dax" with an implicit
> > architecture backend assumption is not precise, but a new "non-nvdimm"
> > device type would make it explicit what the host is advertising to the
> > guest.
> >
>
> Currently, users can use a virtualized nvdimm support in Qemu to share
> host page cache to the guest via the below command
>
> -object memory-backend-file,id=memnvdimm1,mem-path=file_name_in_host_fs
> -device nvdimm,memdev=memnvdimm1
>
> Such usage can results in wrong application behavior because there is no
> hint to the application/guest OS that a cpu cache flush is not
> sufficient to ensure persistence.
>
> I understand that virio-pmem is suggested as an alternative for that.
> But why not fix virtualized nvdimm if platforms can express the details.
>
> ie, can ACPI indicate to the guest OS that the device need a flush
> mechanism to ensure persistence in the above case?
>
> What this patch series did was to express that property via a device
> tree node and guest driver enables a hypercall based flush mechanism to
> ensure persistence.

Would VIRTIO (entirely asynchronous, no trap at host side) based
mechanism is better
than hyper-call based? Registering memory can be done any way. We
implemented virtio-pmem
flush mechanisms with below considerations:

- Proper semantic for guest flush requests.
- Efficient mechanism for performance pov.

I am just asking myself if we have platform agnostic mechanism already
there, maybe
we can extend it to suit our needs? Maybe I am missing some points here.

> >> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC
> >>
> >> is set for the region and the guest makes hcalls to issue fsync on the host.
> >>
> >>
> >> Are you suggesting me to keep it "unsafe" as default for all architectures
> >>
> >> including PPC and a user can set it to "writeback" if desired.
> >
> > No, I am suggesting that "sync-dax" is insufficient to convey this
> > property. This behavior warrants its own device type, not an ambiguous
> > property of the memory-backend-file with implicit architecture
> > assumptions attached.
> >
>
> Why is it insufficient?  Is it because other architectures don't have an
> ability express this detail to guest OS? Isn't that an arch limitations?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-05-04  5:43         ` Pankaj Gupta
@ 2021-05-04  9:02           ` Aneesh Kumar K.V
  2021-05-05  0:12             ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Aneesh Kumar K.V @ 2021-05-04  9:02 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Peter Maydell, Shivaprasad G Bhat, Michael S. Tsirkin,
	Qemu Developers, linux-nvdimm, Markus Armbruster, bharata,
	Haozhong Zhang, Eduardo Habkost, richard.henderson, Greg Kurz,
	kvm-ppc, qemu-arm, Stefan Hajnoczi, Igor Mammedov, Dan Williams,
	kwangwoo.lee, David Gibson, Xiao Guangrong,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc,
	Paolo Bonzini

On 5/4/21 11:13 AM, Pankaj Gupta wrote:
....

>>
>> What this patch series did was to express that property via a device
>> tree node and guest driver enables a hypercall based flush mechanism to
>> ensure persistence.
> 
> Would VIRTIO (entirely asynchronous, no trap at host side) based
> mechanism is better
> than hyper-call based? Registering memory can be done any way. We
> implemented virtio-pmem
> flush mechanisms with below considerations:
> 
> - Proper semantic for guest flush requests.
> - Efficient mechanism for performance pov.
> 

sure, virio-pmem can be used as an alternative.

> I am just asking myself if we have platform agnostic mechanism already
> there, maybe
> we can extend it to suit our needs? Maybe I am missing some points here.
> 

What is being attempted in this series is to indicate to the guest OS 
that the backing device/file used for emulated nvdimm device cannot 
guarantee the persistence via cpu cache flush instructions.


>>>> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC
>>>>
>>>> is set for the region and the guest makes hcalls to issue fsync on the host.
>>>>
>>>>
>>>> Are you suggesting me to keep it "unsafe" as default for all architectures
>>>>
>>>> including PPC and a user can set it to "writeback" if desired.
>>>
>>> No, I am suggesting that "sync-dax" is insufficient to convey this
>>> property. This behavior warrants its own device type, not an ambiguous
>>> property of the memory-backend-file with implicit architecture
>>> assumptions attached.
>>>
>>
>> Why is it insufficient?  Is it because other architectures don't have an
>> ability express this detail to guest OS? Isn't that an arch limitations?

-aneesh


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
  2021-05-04  9:02           ` Aneesh Kumar K.V
@ 2021-05-05  0:12             ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2021-05-05  0:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Peter Maydell, Shivaprasad G Bhat, Michael S. Tsirkin,
	Qemu Developers, linux-nvdimm, Markus Armbruster, bharata,
	Haozhong Zhang, Eduardo Habkost, richard.henderson, Greg Kurz,
	kvm-ppc, qemu-arm, Stefan Hajnoczi, Igor Mammedov, kwangwoo.lee,
	David Gibson, Pankaj Gupta, Xiao Guangrong,
	shameerali.kolothum.thodi, shivaprasadbhat, qemu-ppc,
	Paolo Bonzini

On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/4/21 11:13 AM, Pankaj Gupta wrote:
> ....
>
> >>
> >> What this patch series did was to express that property via a device
> >> tree node and guest driver enables a hypercall based flush mechanism to
> >> ensure persistence.
> >
> > Would VIRTIO (entirely asynchronous, no trap at host side) based
> > mechanism is better
> > than hyper-call based? Registering memory can be done any way. We
> > implemented virtio-pmem
> > flush mechanisms with below considerations:
> >
> > - Proper semantic for guest flush requests.
> > - Efficient mechanism for performance pov.
> >
>
> sure, virio-pmem can be used as an alternative.
>
> > I am just asking myself if we have platform agnostic mechanism already
> > there, maybe
> > we can extend it to suit our needs? Maybe I am missing some points here.
> >
>
> What is being attempted in this series is to indicate to the guest OS
> that the backing device/file used for emulated nvdimm device cannot
> guarantee the persistence via cpu cache flush instructions.

Right, the detail I am arguing is that it should be a device
description, not a backend file property. In other words this proposal
is advocating this:

-object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path
-device nvdimm,memdev=mem1

...and I am advocating for reusing or duplicating the virtio-pmem
model like this:

-object memory-backend-file,id=mem1,share,mem-path=$path
-device spapr-hcall,memdev=mem1

...because the interface to the guest is the device, not the
memory-backend-file.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-05-05  0:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-29  3:48 [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
2021-04-29  3:48 ` [PATCH v4 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
2021-05-03 18:23   ` Eric Blake
2021-05-04  1:21     ` David Gibson
2021-04-29  3:48 ` [PATCH v4 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
2021-04-29  3:49 ` [PATCH v4 3/3] nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
2021-05-03 18:27   ` Eric Blake
2021-04-29 15:55 ` [PATCH v4 0/3] nvdimm: Enable sync-dax " Stefan Hajnoczi
2021-04-29 16:32   ` Aneesh Kumar K.V
2021-04-30  4:27     ` David Gibson
2021-04-30 15:08       ` Stefan Hajnoczi
2021-04-30 19:14 ` Dan Williams
2021-05-01 13:55   ` Aneesh Kumar K.V
2021-05-03 14:05   ` Shivaprasad G Bhat
2021-05-03 19:41     ` Dan Williams
2021-05-04  4:59       ` Aneesh Kumar K.V
2021-05-04  5:43         ` Pankaj Gupta
2021-05-04  9:02           ` Aneesh Kumar K.V
2021-05-05  0:12             ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).