[Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory
@ 2018-02-07  7:33 Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters Haozhong Zhang
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

This v2 patch series extends v1 [1] by covering the migration path as
well.

QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
live migration. If the backend is on the persistent memory, QEMU needs
to take proper operations to ensure its writes persistent on the
persistent memory. Otherwise, a host power failure may result in the
loss the guest data on the persistent memory.

This patch series is based on Marcel's patch "mem: add share parameter
to memory-backend-ram" [2] because of the changes in patch 1.

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html
[2] http://lists.gnu.org/archive/html/qemu-devel/2018-02/msg00768.html

Changes in v2:
 * (Patch 1) Use a flags parameter in file ram allocation functions.
 * (Patch 2) Add a new option 'pmem' to hostmem-file.
 * (Patch 3) Use libpmem to operate on the persistent memory, rather
   than re-implementing those operations in QEMU.
 * (Patch 5-8) Consider the write persistence in the migration path.

Haozhong Zhang (8):
 [1/8] memory, exec: switch file ram allocation functions to 'flags' parameters
 [2/8] hostmem-file: add the 'pmem' option
 [3/8] configure: add libpmem support
 [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
 [5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
 [6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
 [7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
 [8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

 backends/hostmem-file.c             | 27 +++++++++++++-
 configure                           | 35 ++++++++++++++++++
 docs/nvdimm.txt                     | 14 ++++++++
 exec.c                              | 23 +++++++++---
 hw/mem/nvdimm.c                     |  9 ++++-
 include/exec/memory.h               | 12 +++++--
 include/exec/ram_addr.h             | 28 +++++++++++++--
 include/migration/qemu-file-types.h |  1 +
 include/qemu/pmem.h                 | 50 ++++++++++++++++++++++++++
 memory.c                            |  8 +++--
 migration/qemu-file.c               | 41 +++++++++++++++------
 migration/ram.c                     | 71 ++++++++++++++++++++++++++++---------
 migration/xbzrle.c                  | 20 +++++++++--
 migration/xbzrle.h                  |  1 +
 numa.c                              |  2 +-
 qemu-options.hx                     |  9 ++++-
 16 files changed, 308 insertions(+), 43 deletions(-)
 create mode 100644 include/qemu/pmem.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 2/8] hostmem-file: add the 'pmem' option Haozhong Zhang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to following functions
    memory_region_init_ram_from_file
    qemu_ram_alloc_from_fd
    qemu_ram_alloc_from_file
, let's switch them to use the 'flags' parameters so as to ease future
flag additions.

The existing 'share' flag is converted to the QEMU_RAM_SHARE bit in
flags, and other flag bits are ignored by above functions right now.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 backends/hostmem-file.c |  3 ++-
 exec.c                  |  7 ++++---
 include/exec/memory.h   | 10 ++++++++--
 include/exec/ram_addr.h | 25 +++++++++++++++++++++++--
 memory.c                |  8 +++++---
 numa.c                  |  2 +-
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 134b08d63a..30df843d90 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -58,7 +58,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         path = object_get_canonical_path(OBJECT(backend));
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
-                                 backend->size, fb->align, backend->share,
+                                 backend->size, fb->align,
+                                 backend->share ? QEMU_RAM_SHARE : 0,
                                  fb->mem_path, errp);
         g_free(path);
     }
diff --git a/exec.c b/exec.c
index 5e56efefeb..16b373a86b 100644
--- a/exec.c
+++ b/exec.c
@@ -2000,12 +2000,13 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
-                                 bool share, int fd,
+                                 uint64_t flags, int fd,
                                  Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
     int64_t file_size;
+    bool share = flags & QEMU_RAM_SHARE;
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-                                   bool share, const char *mem_path,
+                                   uint64_t flags, const char *mem_path,
                                    Error **errp)
 {
     int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+    block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
     if (!block) {
         if (created) {
             unlink(mem_path);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1b02bbd334..d87258b6ae 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -479,6 +479,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
                                                        void *host),
                                        Error **errp);
 #ifdef __linux__
+
+#define QEMU_RAM_SHARE      (1UL << 0)
+
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
  *                                    mmap-ed backend.
@@ -490,7 +493,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ *         of following values:
+ *         - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ *         Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -502,7 +508,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       const char *name,
                                       uint64_t size,
                                       uint64_t align,
-                                      bool share,
+                                      uint64_t flags,
                                       const char *path,
                                       Error **errp);
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index cf2446a176..b8b01d1eb9 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd:  Allocate a ram block from the specified back
+ *                          file or device
+ *
+ * Parameters:
+ *  @size: the size in bytes of the ram block
+ *  @mr: the memory region where the ram block is
+ *  @flags: specify the properties of the ram block, which can be one
+ *          or bit-or of following values
+ *          - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *          Other bits are ignored.
+ *  @mem_path or @fd: specify the back file or device
+ *  @errp: pointer to Error*, to store an error if it happens
+ *
+ * Return:
+ *  On success, return a pointer to the ram block.
+ *  On failure, return NULL.
+ */
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-                                   bool share, const char *mem_path,
+                                   uint64_t flags, const char *mem_path,
                                    Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
-                                 bool share, int fd,
+                                 uint64_t flags, int fd,
                                  Error **errp);
+
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
 RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
diff --git a/memory.c b/memory.c
index 3211fdc15f..229f64b24b 100644
--- a/memory.c
+++ b/memory.c
@@ -1581,7 +1581,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       const char *name,
                                       uint64_t size,
                                       uint64_t align,
-                                      bool share,
+                                      uint64_t flags,
                                       const char *path,
                                       Error **errp)
 {
@@ -1590,7 +1590,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
     mr->align = align;
-    mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp);
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr, flags, path, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1606,7 +1606,9 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+    mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
+                                           share ? QEMU_RAM_SHARE : 0,
+                                           fd, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 #endif
diff --git a/numa.c b/numa.c
index 83675a03f3..fa202a376d 100644
--- a/numa.c
+++ b/numa.c
@@ -456,7 +456,7 @@ static void allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner,
     if (mem_path) {
 #ifdef __linux__
         Error *err = NULL;
-        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, false,
+        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, 0,
                                          mem_path, &err);
         if (err) {
             error_report_err(err);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 2/8] hostmem-file: add the 'pmem' option
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 3/8] configure: add libpmem support Haozhong Zhang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

When QEMU emulates vNVDIMM labels and migrates vNVDIMM devices, it
needs to know whether the backend storage is a real persistent memory,
in order to decide whether special operations should be performed to
ensure the data persistence.

This boolean option 'pmem' allows users to specify whether the backend
storage of memory-backend-file is a real persistent memory. If
'pmem=on', QEMU will set the flag RAM_PMEM in the RAM block of the
corresponding memory region.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 backends/hostmem-file.c | 26 +++++++++++++++++++++++++-
 docs/nvdimm.txt         | 14 ++++++++++++++
 exec.c                  | 16 +++++++++++++++-
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h |  3 +++
 qemu-options.hx         |  9 ++++++++-
 6 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 30df843d90..5d706d471f 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -34,6 +34,7 @@ struct HostMemoryBackendFile {
     bool discard_data;
     char *mem_path;
     uint64_t align;
+    bool is_pmem;
 };
 
 static void
@@ -59,7 +60,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
                                  backend->size, fb->align,
-                                 backend->share ? QEMU_RAM_SHARE : 0,
+                                 (backend->share ? QEMU_RAM_SHARE : 0) |
+                                 (fb->is_pmem ? QEMU_RAM_PMEM : 0),
                                  fb->mem_path, errp);
         g_free(path);
     }
@@ -131,6 +133,25 @@ static void file_memory_backend_set_align(Object *o, Visitor *v,
     error_propagate(errp, local_err);
 }
 
+static bool file_memory_backend_get_pmem(Object *o, Error **errp)
+{
+    return MEMORY_BACKEND_FILE(o)->is_pmem;
+}
+
+static void file_memory_backend_set_pmem(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property 'pmem' of %s '%s'",
+                   object_get_typename(o), backend->id);
+        return;
+    }
+
+    fb->is_pmem = value;
+}
+
 static void file_backend_unparent(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -162,6 +183,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
         file_memory_backend_get_align,
         file_memory_backend_set_align,
         NULL, NULL, &error_abort);
+    object_class_property_add_bool(oc, "pmem",
+        file_memory_backend_get_pmem, file_memory_backend_set_pmem,
+        &error_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..bcb2032672 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -153,3 +153,17 @@ guest NVDIMM region mapping structure.  This unarmed flag indicates
 guest software that this vNVDIMM device contains a region that cannot
 accept persistent writes. In result, for example, the guest Linux
 NVDIMM driver, marks such vNVDIMM device as read-only.
+
+If the vNVDIMM backend is on the host persistent memory that can be
+accessed in SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's
+suggested to set the 'pmem' option of memory-backend-file to 'on'. When
+'pmem=on' and QEMU is built with libpmem [2] support (configured with
+--enable-libpmem), QEMU will take necessary operations to guarantee
+the persistence of its own writes to the vNVDIMM backend (e.g., in
+vNVDIMM label emulation and live migration).
+
+References
+----------
+
+[1] SNIA NVM Programming Model: https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
+[2] PMDK: http://pmem.io/pmdk/
diff --git a/exec.c b/exec.c
index 16b373a86b..1d83441afe 100644
--- a/exec.c
+++ b/exec.c
@@ -99,6 +99,9 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* RAM is backed by the persistent memory. */
+#define RAM_PMEM       (1 << 3)
+
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -2007,6 +2010,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     Error *local_err = NULL;
     int64_t file_size;
     bool share = flags & QEMU_RAM_SHARE;
+    bool is_pmem = flags & QEMU_RAM_PMEM;
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2043,7 +2047,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->mr = mr;
     new_block->used_length = size;
     new_block->max_length = size;
-    new_block->flags = share ? RAM_SHARED : 0;
+    new_block->flags = (share ? RAM_SHARED : 0) |
+                       (is_pmem ? RAM_PMEM : 0);
     new_block->host = file_ram_alloc(new_block, size, fd, !file_size, errp);
     if (!new_block->host) {
         g_free(new_block);
@@ -3847,3 +3852,12 @@ void mtree_print_dispatch(fprintf_function mon, void *f,
 }
 
 #endif
+
+bool ramblock_is_pmem(RAMBlock *rb)
+{
+#if !defined(CONFIG_USER_ONLY)
+    return rb->flags & RAM_PMEM;
+#else
+    return false;
+#endif
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index d87258b6ae..018334312a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -481,6 +481,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
 #ifdef __linux__
 
 #define QEMU_RAM_SHARE      (1UL << 0)
+#define QEMU_RAM_PMEM       (1UL << 1)
 
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
@@ -496,6 +497,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @flags: specify properties of this memory region, which can be one or bit-or
  *         of following values:
  *         - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ *         - QEMU_RAM_PMEM: the backend @path is persistent memory
  *         Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index b8b01d1eb9..f8d8614e4d 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -70,6 +70,8 @@ static inline unsigned long int ramblock_recv_bitmap_offset(void *host_addr,
     return host_addr_offset >> TARGET_PAGE_BITS;
 }
 
+bool ramblock_is_pmem(RAMBlock *rb);
+
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
 
@@ -84,6 +86,7 @@ unsigned long last_ram_page(void);
  *  @flags: specify the properties of the ram block, which can be one
  *          or bit-or of following values
  *          - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *          - QEMU_RAM_PMEM: the backend @mem_path or @fd is persistent memory
  *          Other bits are ignored.
  *  @mem_path or @fd: specify the back file or device
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/qemu-options.hx b/qemu-options.hx
index 61497ce136..12aa842ab9 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3957,7 +3957,7 @@ property must be set.  These objects are placed in the
 
 @table @option
 
-@item -object memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align}
+@item -object memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align},pmem=@var{on|off}
 
 Creates a memory file backend object, which can be used to back
 the guest RAM with huge pages.
@@ -4025,6 +4025,13 @@ requires an alignment different than the default one used by QEMU, eg
 the device DAX /dev/dax0.0 requires 2M alignment rather than 4K. In
 such cases, users can specify the required alignment via this option.
 
+The @option{pmem} option specifies whether the backend store specified
+by @option{mem-path} is on the persistent memory that can be accessed
+in the SNIA NVM programming model (e.g. Intel NVDIMM).
+If @option{pmem}=@var{on}, QEMU will take necessary operations to
+guarantee the persistence of its own writes to @option{mem-path}
+(e.g. in vNVDIMM label emulation and live migration).
+
 @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
 
 Creates a memory backend object, which can be used to back the guest RAM.
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 3/8] configure: add libpmem support
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 2/8] hostmem-file: add the 'pmem' option Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation Haozhong Zhang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

Add a pair of configure options --{enable,disable}-libpmem to control
whether QEMU is compiled with PMDK libpmem [1].

QEMU may write to the host persistent memory (e.g. in vNVDIMM label
emulation and live migration), so it must take the proper operations
to ensure the persistence of its own writes. Depending on the CPU
models and available instructions, the optimal operation can vary [2].
PMDK libpmem have already implemented those operations on multiple CPU
models (x86 and ARM) and the logic to select the optimal ones, so QEMU
can just use libpmem rather than re-implement them.

[1] PMDK (formerly known as NMVL), https://github.com/pmem/pmdk/
[2] https://github.com/pmem/pmdk/blob/38bfa652721a37fd94c0130ce0e3f5d8baa3ed40/src/libpmem/pmem.c#L33

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 configure | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/configure b/configure
index 302fdc92ff..595967e5df 100755
--- a/configure
+++ b/configure
@@ -436,6 +436,7 @@ jemalloc="no"
 replication="yes"
 vxhs=""
 libxml2=""
+libpmem=""
 
 supported_cpu="no"
 supported_os="no"
@@ -1341,6 +1342,10 @@ for opt do
   ;;
   --disable-git-update) git_update=no
   ;;
+  --enable-libpmem) libpmem=yes
+  ;;
+  --disable-libpmem) libpmem=no
+  ;;
   *)
       echo "ERROR: unknown option $opt"
       echo "Try '$0 --help' for more information"
@@ -1592,6 +1597,7 @@ disabled with --disable-FEATURE, default is enabled if available:
   crypto-afalg    Linux AF_ALG crypto backend driver
   vhost-user      vhost-user support
   capstone        capstone disassembler support
+  libpmem         libpmem support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -5205,6 +5211,30 @@ if compile_prog "" "" ; then
     have_utmpx=yes
 fi
 
+##########################################
+# check for libpmem
+
+if test "$libpmem" != "no"; then
+  cat > $TMPC <<EOF
+#include <libpmem.h>
+int main(void)
+{
+  pmem_is_pmem(0, 0);
+  return 0;
+}
+EOF
+  libpmem_libs="-lpmem"
+  if compile_prog "" "$libpmem_libs" ; then
+    libs_softmmu="$libpmem_libs $libs_softmmu"
+    libpmem="yes"
+  else
+    if test "$libpmem" = "yes" ; then
+      feature_not_found "libpmem" "Install nvml or pmdk"
+    fi
+    libpmem="no"
+  fi
+fi
+
 ##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -5657,6 +5687,7 @@ echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
 echo "VxHS block device $vxhs"
 echo "capstone          $capstone"
+echo "libpmem support   $libpmem"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -6374,6 +6405,10 @@ if test "$vxhs" = "yes" ; then
   echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
 fi
 
+if test "$libpmem" = "yes" ; then
+  echo "CONFIG_LIBPMEM=y" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
                   ` (2 preceding siblings ...)
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 3/8] configure: add libpmem support Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-09 14:27   ` Stefan Hajnoczi
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM Haozhong Zhang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

Guest writes to vNVDIMM labels are intercepted and performed on the
backend by QEMU. When the backend is a real persistent memort, QEMU
needs to take proper operations to ensure its write persistence on the
persistent memory. Otherwise, a host power failure may result in the
loss of guest label configurations.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/mem/nvdimm.c     |  9 ++++++++-
 include/qemu/pmem.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/pmem.h

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 61e677f92f..18861d1a7a 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/pmem.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
@@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
 {
     MemoryRegion *mr;
     PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+    bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
+                                            "pmem", NULL);
     uint64_t backend_offset;
 
     nvdimm_validate_rw_label_data(nvdimm, size, offset);
 
-    memcpy(nvdimm->label_data + offset, buf, size);
+    if (!is_pmem) {
+        memcpy(nvdimm->label_data + offset, buf, size);
+    } else {
+        pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
+    }
 
     mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort);
     backend_offset = memory_region_size(mr) - nvdimm->label_size + offset;
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
new file mode 100644
index 0000000000..9017596ff0
--- /dev/null
+++ b/include/qemu/pmem.h
@@ -0,0 +1,31 @@
+/*
+ * Stub functions for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zhang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_PMEM_H
+#define QEMU_PMEM_H
+
+#ifdef CONFIG_LIBPMEM
+#include <libpmem.h>
+#else  /* !CONFIG_LIBPMEM */
+
+#include <string.h>
+
+/* Stubs */
+
+static inline void *
+pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
+{
+    return memcpy(pmemdest, src, len);
+}
+
+#endif /* CONFIG_LIBPMEM */
+
+#endif /* !QEMU_PMEM_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
                   ` (3 preceding siblings ...)
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07 10:17   ` Pankaj Gupta
  2018-02-07 11:38   ` Dr. David Alan Gilbert
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal " Haozhong Zhang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

When loading a zero page, check whether it will be loaded to
persistent memory If yes, load it by libpmem function
pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
end of RAM loading, we can guarantee all those zero pages are
persistently loaded.

Depending on the host HW/SW configurations, pmem_drain() can be
"sfence".  Therefore, we do not call pmem_drain() after each
pmem_memset_nodrain(), or use pmem_memset_persist() (equally
pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
overhead.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 include/qemu/pmem.h |  9 +++++++++
 migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 9017596ff0..861d8ecc21 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
     return memcpy(pmemdest, src, len);
 }
 
+static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
+{
+    return memset(pmemdest, c, len);
+}
+
+static inline void pmem_drain(void)
+{
+}
+
 #endif /* CONFIG_LIBPMEM */
 
 #endif /* !QEMU_PMEM_H */
diff --git a/migration/ram.c b/migration/ram.c
index cb1950f3eb..5a0e503818 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -49,6 +49,7 @@
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
 #include "migration/block.h"
+#include "qemu/pmem.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -2467,6 +2468,20 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
     return block->host + offset;
 }
 
+static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t size,
+                                         bool is_pmem)
+{
+    if (!ch && is_zero_range(host, size)) {
+        return;
+    }
+
+    if (!is_pmem) {
+        memset(host, ch, size);
+    } else {
+        pmem_memset_nodrain(host, ch, size);
+    }
+}
+
 /**
  * ram_handle_compressed: handle the zero page case
  *
@@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
  */
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
 {
-    if (ch != 0 || !is_zero_range(host, size)) {
-        memset(host, ch, size);
-    }
+    return ram_handle_compressed_common(host, ch, size, false);
 }
 
 static void *do_data_decompress(void *opaque)
@@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     bool postcopy_running = postcopy_is_running();
     /* ADVISE is earlier, it shows the source has the postcopy capability on */
     bool postcopy_advised = postcopy_is_advised();
+    bool need_pmem_drain = false;
 
     seq_iter++;
 
@@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
         ram_addr_t addr, total_ram_bytes;
         void *host = NULL;
         uint8_t ch;
+        RAMBlock *block = NULL;
+        bool is_pmem = false;
 
         addr = qemu_get_be64(f);
         flags = addr & ~TARGET_PAGE_MASK;
@@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(f, flags);
+            block = ram_block_from_stream(f, flags);
 
             host = host_from_ram_block_offset(block, addr);
             if (!host) {
@@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             }
             ramblock_recv_bitmap_set(block, host);
             trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
+
+            is_pmem = ramblock_is_pmem(block);
+            need_pmem_drain = need_pmem_drain || is_pmem;
         }
 
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
 
         case RAM_SAVE_FLAG_ZERO:
             ch = qemu_get_byte(f);
-            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
+            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE, is_pmem);
             break;
 
         case RAM_SAVE_FLAG_PAGE:
@@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     }
 
     wait_for_decompress_done();
+
+    if (need_pmem_drain) {
+        pmem_drain();
+    }
+
     rcu_read_unlock();
     trace_ram_load_complete(ret, seq_iter);
     return ret;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
                   ` (4 preceding siblings ...)
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07 11:49   ` Dr. David Alan Gilbert
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed " Haozhong Zhang
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle " Haozhong Zhang
  7 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

When loading a normal page to persistent memory, load its data by
libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
with a call to pmem_drain() at the end of memory loading, we can
guarantee all those normal pages are persistenly loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 include/migration/qemu-file-types.h |  1 +
 include/qemu/pmem.h                 |  6 ++++++
 migration/qemu-file.c               | 41 ++++++++++++++++++++++++++++---------
 migration/ram.c                     |  6 +++++-
 4 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
index bd6d7dd7f9..bb5c547498 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -34,6 +34,7 @@ void qemu_put_be16(QEMUFile *f, unsigned int v);
 void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
 size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
+size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size);
 
 int qemu_get_byte(QEMUFile *f);
 
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 861d8ecc21..77ee1fc4eb 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -26,6 +26,12 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
     return memcpy(pmemdest, src, len);
 }
 
+static inline void *
+pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
+{
+    return memcpy(pmemdest, src, len);
+}
+
 static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
 {
     return memset(pmemdest, c, len);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..7e573010d9 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -26,6 +26,7 @@
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
+#include "qemu/pmem.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "trace.h"
@@ -471,15 +472,8 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, size_t size, size_t offset)
     return size;
 }
 
-/*
- * Read 'size' bytes of data from the file into buf.
- * 'size' can be larger than the internal buffer.
- *
- * It will return size bytes unless there was an error, in which case it will
- * return as many as it managed to read (assuming blocking fd's which
- * all current QEMUFile are)
- */
-size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+static size_t
+qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size, bool is_pmem)
 {
     size_t pending = size;
     size_t done = 0;
@@ -492,7 +486,11 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
         if (res == 0) {
             return done;
         }
-        memcpy(buf, src, res);
+        if (!is_pmem) {
+            memcpy(buf, src, res);
+        } else {
+            pmem_memcpy_nodrain(buf, src, res);
+        }
         qemu_file_skip(f, res);
         buf += res;
         pending -= res;
@@ -501,6 +499,29 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
     return done;
 }
 
+/*
+ * Read 'size' bytes of data from the file into buf.
+ * 'size' can be larger than the internal buffer.
+ *
+ * It will return size bytes unless there was an error, in which case it will
+ * return as many as it managed to read (assuming blocking fd's which
+ * all current QEMUFile are)
+ */
+size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+{
+    return qemu_get_buffer_common(f, buf, size, false);
+}
+
+/*
+ * Mostly the same as qemu_get_buffer(), except that
+ * 1) it's for the case that 'buf' is in the persistent memory, and
+ * 2) it takes necessary operations to ensure the data persistence in 'buf'.
+ */
+size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size)
+{
+    return qemu_get_buffer_common(f, buf, size, true);
+}
+
 /*
  * Read 'size' bytes of data from the file.
  * 'size' can be larger than the internal buffer.
diff --git a/migration/ram.c b/migration/ram.c
index 5a0e503818..5a79bbff64 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2950,7 +2950,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             break;
 
         case RAM_SAVE_FLAG_PAGE:
-            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+            if (!is_pmem) {
+                qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+            } else {
+                qemu_get_buffer_to_pmem(f, host, TARGET_PAGE_SIZE);
+            }
             break;
 
         case RAM_SAVE_FLAG_COMPRESS_PAGE:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
                   ` (5 preceding siblings ...)
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal " Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07 11:54   ` Dr. David Alan Gilbert
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle " Haozhong Zhang
  7 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

When loading a compressed page to persistent memory, flush CPU cache
after the data is decompressed. Combined with a call to pmem_drain()
at the end of memory loading, we can guarantee those compressed pages
are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 include/qemu/pmem.h |  4 ++++
 migration/ram.c     | 16 +++++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 77ee1fc4eb..20e3f6e71d 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
     return memset(pmemdest, c, len);
 }
 
+static inline void pmem_flush(const void *addr, size_t len)
+{
+}
+
 static inline void pmem_drain(void)
 {
 }
diff --git a/migration/ram.c b/migration/ram.c
index 5a79bbff64..924d2b9537 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -274,6 +274,7 @@ struct DecompressParam {
     void *des;
     uint8_t *compbuf;
     int len;
+    bool is_pmem;
 };
 typedef struct DecompressParam DecompressParam;
 
@@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
     DecompressParam *param = opaque;
     unsigned long pagesize;
     uint8_t *des;
-    int len;
+    int len, rc;
 
     qemu_mutex_lock(&param->mutex);
     while (!param->quit) {
@@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
              * not a problem because the dirty page will be retransferred
              * and uncompress() won't break the data in other pages.
              */
-            uncompress((Bytef *)des, &pagesize,
-                       (const Bytef *)param->compbuf, len);
+            rc = uncompress((Bytef *)des, &pagesize,
+                            (const Bytef *)param->compbuf, len);
+            if (rc == Z_OK && param->is_pmem) {
+                pmem_flush(des, len);
+            }
 
             qemu_mutex_lock(&decomp_done_lock);
             param->done = true;
@@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-                                               void *host, int len)
+                                               void *host, int len,
+                                               bool is_pmem)
 {
     int idx, thread_count;
 
@@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
                 qemu_get_buffer(f, decomp_param[idx].compbuf, len);
                 decomp_param[idx].des = host;
                 decomp_param[idx].len = len;
+                decomp_param[idx].is_pmem = is_pmem;
                 qemu_cond_signal(&decomp_param[idx].cond);
                 qemu_mutex_unlock(&decomp_param[idx].mutex);
                 break;
@@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                 ret = -EINVAL;
                 break;
             }
-            decompress_data_with_multi_threads(f, host, len);
+            decompress_data_with_multi_threads(f, host, len, is_pmem);
             break;
 
         case RAM_SAVE_FLAG_XBZRLE:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM
  2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
                   ` (6 preceding siblings ...)
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed " Haozhong Zhang
@ 2018-02-07  7:33 ` Haozhong Zhang
  2018-02-07 13:08   ` Dr. David Alan Gilbert
  7 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07  7:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Stefan Hajnoczi,
	Dan Williams, Haozhong Zhang

When loading a xbzrle encoded page to persistent memory, load the data
via libpmem function pmem_memcpy_nodrain() instead of memcpy().
Combined with a call to pmem_drain() at the end of memory loading, we
can guarantee those xbzrle encoded pages are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 migration/ram.c    | 15 ++++++++++-----
 migration/xbzrle.c | 20 ++++++++++++++++++--
 migration/xbzrle.h |  1 +
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 924d2b9537..87f977617d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2388,10 +2388,10 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
     }
 }
 
-static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
+static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host, bool is_pmem)
 {
     unsigned int xh_len;
-    int xh_flags;
+    int xh_flags, rc;
     uint8_t *loaded_data;
 
     /* extract RLE header */
@@ -2413,8 +2413,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
     qemu_get_buffer_in_place(f, &loaded_data, xh_len);
 
     /* decode RLE */
-    if (xbzrle_decode_buffer(loaded_data, xh_len, host,
-                             TARGET_PAGE_SIZE) == -1) {
+    if (!is_pmem) {
+        rc = xbzrle_decode_buffer(loaded_data, xh_len, host, TARGET_PAGE_SIZE);
+    } else {
+        rc = xbzrle_decode_buffer_to_pmem(loaded_data, xh_len, host,
+                                          TARGET_PAGE_SIZE);
+    }
+    if (rc == -1) {
         error_report("Failed to load XBZRLE page - decode error!");
         return -1;
     }
@@ -2974,7 +2979,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             break;
 
         case RAM_SAVE_FLAG_XBZRLE:
-            if (load_xbzrle(f, addr, host) < 0) {
+            if (load_xbzrle(f, addr, host, is_pmem) < 0) {
                 error_report("Failed to decompress XBZRLE page at "
                              RAM_ADDR_FMT, addr);
                 ret = -EINVAL;
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..499d8e1bfb 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -12,6 +12,7 @@
  */
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/pmem.h"
 #include "xbzrle.h"
 
 /*
@@ -126,7 +127,8 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
     return d;
 }
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+static int xbzrle_decode_buffer_common(uint8_t *src, int slen, uint8_t *dst,
+                                       int dlen, bool is_pmem)
 {
     int i = 0, d = 0;
     int ret;
@@ -167,10 +169,24 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
             return -1;
         }
 
-        memcpy(dst + d, src + i, count);
+        if (!is_pmem) {
+            memcpy(dst + d, src + i, count);
+        } else {
+            pmem_memcpy_nodrain(dst + d, src + i, count);
+        }
         d += count;
         i += count;
     }
 
     return d;
 }
+
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+{
+    return xbzrle_decode_buffer_common(src, slen, dst, dlen, false);
+}
+
+int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int dlen)
+{
+    return xbzrle_decode_buffer_common(src, slen, dst, dlen, true);
+}
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..ac5ae32666 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -18,4 +18,5 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
                          uint8_t *dst, int dlen);
 
 int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int dlen);
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM Haozhong Zhang
@ 2018-02-07 10:17   ` Pankaj Gupta
  2018-02-07 11:18     ` Haozhong Zhang
  2018-02-07 11:38   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2018-02-07 10:17 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Xiao Guangrong, mst, Juan Quintela, dgilbert,
	Stefan Hajnoczi, Paolo Bonzini, Igor Mammedov, Dan Williams,
	Eduardo Habkost


> 
> When loading a zero page, check whether it will be loaded to
> persistent memory If yes, load it by libpmem function
> pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> end of RAM loading, we can guarantee all those zero pages are
> persistently loaded.
> 
> Depending on the host HW/SW configurations, pmem_drain() can be
> "sfence".  Therefore, we do not call pmem_drain() after each
> pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> overhead.

Are you saying we don't need 'pmem_drain()'?

I can see its called in patch 5 in ram_load? But
I also see empty definition. Anything I am missing here?

Thanks,
Pankaj

> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  include/qemu/pmem.h |  9 +++++++++
>  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
>  2 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> index 9017596ff0..861d8ecc21 100644
> --- a/include/qemu/pmem.h
> +++ b/include/qemu/pmem.h
> @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src,
> size_t len)
>      return memcpy(pmemdest, src, len);
>  }
>  
> +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> +{
> +    return memset(pmemdest, c, len);
> +}
> +
> +static inline void pmem_drain(void)
> +{
> +}
> +
>  #endif /* CONFIG_LIBPMEM */
>  
>  #endif /* !QEMU_PMEM_H */
> diff --git a/migration/ram.c b/migration/ram.c
> index cb1950f3eb..5a0e503818 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -49,6 +49,7 @@
>  #include "qemu/rcu_queue.h"
>  #include "migration/colo.h"
>  #include "migration/block.h"
> +#include "qemu/pmem.h"
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -2467,6 +2468,20 @@ static inline void
> *host_from_ram_block_offset(RAMBlock *block,
>      return block->host + offset;
>  }
>  
> +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t
> size,
> +                                         bool is_pmem)
> +{
> +    if (!ch && is_zero_range(host, size)) {
> +        return;
> +    }
> +
> +    if (!is_pmem) {
> +        memset(host, ch, size);
> +    } else {
> +        pmem_memset_nodrain(host, ch, size);
> +    }
> +}
> +
>  /**
>   * ram_handle_compressed: handle the zero page case
>   *
> @@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock
> *block,
>   */
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
>  {
> -    if (ch != 0 || !is_zero_range(host, size)) {
> -        memset(host, ch, size);
> -    }
> +    return ram_handle_compressed_common(host, ch, size, false);
>  }
>  
>  static void *do_data_decompress(void *opaque)
> @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>      bool postcopy_running = postcopy_is_running();
>      /* ADVISE is earlier, it shows the source has the postcopy capability on
>      */
>      bool postcopy_advised = postcopy_is_advised();
> +    bool need_pmem_drain = false;
>  
>      seq_iter++;
>  
> @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>          ram_addr_t addr, total_ram_bytes;
>          void *host = NULL;
>          uint8_t ch;
> +        RAMBlock *block = NULL;
> +        bool is_pmem = false;
>  
>          addr = qemu_get_be64(f);
>          flags = addr & ~TARGET_PAGE_MASK;
> @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>  
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(f, flags);
> +            block = ram_block_from_stream(f, flags);
>  
>              host = host_from_ram_block_offset(block, addr);
>              if (!host) {
> @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>              }
>              ramblock_recv_bitmap_set(block, host);
>              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
> +
> +            is_pmem = ramblock_is_pmem(block);
> +            need_pmem_drain = need_pmem_drain || is_pmem;
>          }
>  
>          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>  
>          case RAM_SAVE_FLAG_ZERO:
>              ch = qemu_get_byte(f);
> -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE,
> is_pmem);
>              break;
>  
>          case RAM_SAVE_FLAG_PAGE:
> @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>      }
>  
>      wait_for_decompress_done();
> +
> +    if (need_pmem_drain) {
> +        pmem_drain();
> +    }
> +
>      rcu_read_unlock();
>      trace_ram_load_complete(ret, seq_iter);
>      return ret;
> --
> 2.14.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 10:17   ` Pankaj Gupta
@ 2018-02-07 11:18     ` Haozhong Zhang
  2018-02-07 11:30       ` Pankaj Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 11:18 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: qemu-devel, Xiao Guangrong, mst, Juan Quintela, dgilbert,
	Stefan Hajnoczi, Paolo Bonzini, Igor Mammedov, Dan Williams,
	Eduardo Habkost

On 02/07/18 05:17 -0500, Pankaj Gupta wrote:
> 
> > 
> > When loading a zero page, check whether it will be loaded to
> > persistent memory If yes, load it by libpmem function
> > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > end of RAM loading, we can guarantee all those zero pages are
> > persistently loaded.
> > 
> > Depending on the host HW/SW configurations, pmem_drain() can be
> > "sfence".  Therefore, we do not call pmem_drain() after each
> > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > overhead.
> 
> Are you saying we don't need 'pmem_drain()'?

pmem_drain() is called only once after all ram blocks are loaded.

> 
> I can see its called in patch 5 in ram_load? But
> I also see empty definition. Anything I am missing here?

Functions defined in include/qemu/pmem.h are stubs and used only when
QEMU is not compiled with libpmem. When QEMU is compiled with
--enabled-libpmem, the one in libpmem is used.

Haozhong

> 
> Thanks,
> Pankaj
> 
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  include/qemu/pmem.h |  9 +++++++++
> >  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
> >  2 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 9017596ff0..861d8ecc21 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src,
> > size_t len)
> >      return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > +{
> > +    return memset(pmemdest, c, len);
> > +}
> > +
> > +static inline void pmem_drain(void)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_LIBPMEM */
> >  
> >  #endif /* !QEMU_PMEM_H */
> > diff --git a/migration/ram.c b/migration/ram.c
> > index cb1950f3eb..5a0e503818 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -49,6 +49,7 @@
> >  #include "qemu/rcu_queue.h"
> >  #include "migration/colo.h"
> >  #include "migration/block.h"
> > +#include "qemu/pmem.h"
> >  
> >  /***********************************************************/
> >  /* ram save/restore */
> > @@ -2467,6 +2468,20 @@ static inline void
> > *host_from_ram_block_offset(RAMBlock *block,
> >      return block->host + offset;
> >  }
> >  
> > +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t
> > size,
> > +                                         bool is_pmem)
> > +{
> > +    if (!ch && is_zero_range(host, size)) {
> > +        return;
> > +    }
> > +
> > +    if (!is_pmem) {
> > +        memset(host, ch, size);
> > +    } else {
> > +        pmem_memset_nodrain(host, ch, size);
> > +    }
> > +}
> > +
> >  /**
> >   * ram_handle_compressed: handle the zero page case
> >   *
> > @@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock
> > *block,
> >   */
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> >  {
> > -    if (ch != 0 || !is_zero_range(host, size)) {
> > -        memset(host, ch, size);
> > -    }
> > +    return ram_handle_compressed_common(host, ch, size, false);
> >  }
> >  
> >  static void *do_data_decompress(void *opaque)
> > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >      bool postcopy_running = postcopy_is_running();
> >      /* ADVISE is earlier, it shows the source has the postcopy capability on
> >      */
> >      bool postcopy_advised = postcopy_is_advised();
> > +    bool need_pmem_drain = false;
> >  
> >      seq_iter++;
> >  
> > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >          ram_addr_t addr, total_ram_bytes;
> >          void *host = NULL;
> >          uint8_t ch;
> > +        RAMBlock *block = NULL;
> > +        bool is_pmem = false;
> >  
> >          addr = qemu_get_be64(f);
> >          flags = addr & ~TARGET_PAGE_MASK;
> > @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >  
> >          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
> >                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> > -            RAMBlock *block = ram_block_from_stream(f, flags);
> > +            block = ram_block_from_stream(f, flags);
> >  
> >              host = host_from_ram_block_offset(block, addr);
> >              if (!host) {
> > @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >              }
> >              ramblock_recv_bitmap_set(block, host);
> >              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
> > +
> > +            is_pmem = ramblock_is_pmem(block);
> > +            need_pmem_drain = need_pmem_drain || is_pmem;
> >          }
> >  
> >          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> > @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >  
> >          case RAM_SAVE_FLAG_ZERO:
> >              ch = qemu_get_byte(f);
> > -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> > +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE,
> > is_pmem);
> >              break;
> >  
> >          case RAM_SAVE_FLAG_PAGE:
> > @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >      }
> >  
> >      wait_for_decompress_done();
> > +
> > +    if (need_pmem_drain) {
> > +        pmem_drain();
> > +    }
> > +
> >      rcu_read_unlock();
> >      trace_ram_load_complete(ret, seq_iter);
> >      return ret;
> > --
> > 2.14.1
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 11:18     ` Haozhong Zhang
@ 2018-02-07 11:30       ` Pankaj Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2018-02-07 11:30 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Xiao Guangrong, mst, Juan Quintela, dgilbert,
	Stefan Hajnoczi, Paolo Bonzini, Igor Mammedov, Dan Williams,
	Eduardo Habkost


> 
> On 02/07/18 05:17 -0500, Pankaj Gupta wrote:
> > 
> > > 
> > > When loading a zero page, check whether it will be loaded to
> > > persistent memory If yes, load it by libpmem function
> > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > end of RAM loading, we can guarantee all those zero pages are
> > > persistently loaded.
> > > 
> > > Depending on the host HW/SW configurations, pmem_drain() can be
> > > "sfence".  Therefore, we do not call pmem_drain() after each
> > > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > > overhead.
> > 
> > Are you saying we don't need 'pmem_drain()'?
> 
> pmem_drain() is called only once after all ram blocks are loaded.

o.k

> 
> > 
> > I can see its called in patch 5 in ram_load? But
> > I also see empty definition. Anything I am missing here?
> 
> Functions defined in include/qemu/pmem.h are stubs and used only when
> QEMU is not compiled with libpmem. When QEMU is compiled with
> --enabled-libpmem, the one in libpmem is used.

p.k.

Thanks,
Pankaj

> 
> Haozhong
> 
> > 
> > Thanks,
> > Pankaj
> > 
> > > 
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > ---
> > >  include/qemu/pmem.h |  9 +++++++++
> > >  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
> > >  2 files changed, 38 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > index 9017596ff0..861d8ecc21 100644
> > > --- a/include/qemu/pmem.h
> > > +++ b/include/qemu/pmem.h
> > > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src,
> > > size_t len)
> > >      return memcpy(pmemdest, src, len);
> > >  }
> > >  
> > > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t
> > > len)
> > > +{
> > > +    return memset(pmemdest, c, len);
> > > +}
> > > +
> > > +static inline void pmem_drain(void)
> > > +{
> > > +}
> > > +
> > >  #endif /* CONFIG_LIBPMEM */
> > >  
> > >  #endif /* !QEMU_PMEM_H */
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index cb1950f3eb..5a0e503818 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -49,6 +49,7 @@
> > >  #include "qemu/rcu_queue.h"
> > >  #include "migration/colo.h"
> > >  #include "migration/block.h"
> > > +#include "qemu/pmem.h"
> > >  
> > >  /***********************************************************/
> > >  /* ram save/restore */
> > > @@ -2467,6 +2468,20 @@ static inline void
> > > *host_from_ram_block_offset(RAMBlock *block,
> > >      return block->host + offset;
> > >  }
> > >  
> > > +static void ram_handle_compressed_common(void *host, uint8_t ch,
> > > uint64_t
> > > size,
> > > +                                         bool is_pmem)
> > > +{
> > > +    if (!ch && is_zero_range(host, size)) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (!is_pmem) {
> > > +        memset(host, ch, size);
> > > +    } else {
> > > +        pmem_memset_nodrain(host, ch, size);
> > > +    }
> > > +}
> > > +
> > >  /**
> > >   * ram_handle_compressed: handle the zero page case
> > >   *
> > > @@ -2479,9 +2494,7 @@ static inline void
> > > *host_from_ram_block_offset(RAMBlock
> > > *block,
> > >   */
> > >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> > >  {
> > > -    if (ch != 0 || !is_zero_range(host, size)) {
> > > -        memset(host, ch, size);
> > > -    }
> > > +    return ram_handle_compressed_common(host, ch, size, false);
> > >  }
> > >  
> > >  static void *do_data_decompress(void *opaque)
> > > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >      bool postcopy_running = postcopy_is_running();
> > >      /* ADVISE is earlier, it shows the source has the postcopy
> > >      capability on
> > >      */
> > >      bool postcopy_advised = postcopy_is_advised();
> > > +    bool need_pmem_drain = false;
> > >  
> > >      seq_iter++;
> > >  
> > > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >          ram_addr_t addr, total_ram_bytes;
> > >          void *host = NULL;
> > >          uint8_t ch;
> > > +        RAMBlock *block = NULL;
> > > +        bool is_pmem = false;
> > >  
> > >          addr = qemu_get_be64(f);
> > >          flags = addr & ~TARGET_PAGE_MASK;
> > > @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >  
> > >          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
> > >                       RAM_SAVE_FLAG_COMPRESS_PAGE |
> > >                       RAM_SAVE_FLAG_XBZRLE)) {
> > > -            RAMBlock *block = ram_block_from_stream(f, flags);
> > > +            block = ram_block_from_stream(f, flags);
> > >  
> > >              host = host_from_ram_block_offset(block, addr);
> > >              if (!host) {
> > > @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >              }
> > >              ramblock_recv_bitmap_set(block, host);
> > >              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags,
> > >              host);
> > > +
> > > +            is_pmem = ramblock_is_pmem(block);
> > > +            need_pmem_drain = need_pmem_drain || is_pmem;
> > >          }
> > >  
> > >          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> > > @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >  
> > >          case RAM_SAVE_FLAG_ZERO:
> > >              ch = qemu_get_byte(f);
> > > -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> > > +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE,
> > > is_pmem);
> > >              break;
> > >  
> > >          case RAM_SAVE_FLAG_PAGE:
> > > @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > > version_id)
> > >      }
> > >  
> > >      wait_for_decompress_done();
> > > +
> > > +    if (need_pmem_drain) {
> > > +        pmem_drain();
> > > +    }
> > > +
> > >      rcu_read_unlock();
> > >      trace_ram_load_complete(ret, seq_iter);
> > >      return ret;
> > > --
> > > 2.14.1
> > > 
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM Haozhong Zhang
  2018-02-07 10:17   ` Pankaj Gupta
@ 2018-02-07 11:38   ` Dr. David Alan Gilbert
  2018-02-07 11:52     ` Haozhong Zhang
  1 sibling, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 11:38 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> When loading a zero page, check whether it will be loaded to
> persistent memory If yes, load it by libpmem function
> pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> end of RAM loading, we can guarantee all those zero pages are
> persistently loaded.

I'm surprised pmem is this invasive to be honest;   I hadn't expected
the need for special memcpy's etc everywhere.  We're bound to miss some.
I assume there's separate kernel work needed to make postcopy work;
certainly the patches here don't seem to touch the postcopy path.

> Depending on the host HW/SW configurations, pmem_drain() can be
> "sfence".  Therefore, we do not call pmem_drain() after each
> pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> overhead.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  include/qemu/pmem.h |  9 +++++++++
>  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
>  2 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> index 9017596ff0..861d8ecc21 100644
> --- a/include/qemu/pmem.h
> +++ b/include/qemu/pmem.h
> @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
>      return memcpy(pmemdest, src, len);
>  }
>  
> +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> +{
> +    return memset(pmemdest, c, len);
> +}
> +
> +static inline void pmem_drain(void)
> +{
> +}
> +
>  #endif /* CONFIG_LIBPMEM */
>  
>  #endif /* !QEMU_PMEM_H */
> diff --git a/migration/ram.c b/migration/ram.c
> index cb1950f3eb..5a0e503818 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -49,6 +49,7 @@
>  #include "qemu/rcu_queue.h"
>  #include "migration/colo.h"
>  #include "migration/block.h"
> +#include "qemu/pmem.h"
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -2467,6 +2468,20 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
>      return block->host + offset;
>  }
>  
> +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t size,
> +                                         bool is_pmem)

I don't think there's any advantage of splitting out this _common
routine; lets just add the parameter to ram_handle_compressed.

> +{
> +    if (!ch && is_zero_range(host, size)) {
> +        return;
> +    }
> +
> +    if (!is_pmem) {
> +        memset(host, ch, size);
> +    } else {
> +        pmem_memset_nodrain(host, ch, size);
> +    }

I'm wondering if it would be easier to pass in a memsetfunc ptr and call
that (defualting to memset if it's NULL).

> +}
> +
>  /**
>   * ram_handle_compressed: handle the zero page case
>   *
> @@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
>   */
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
>  {
> -    if (ch != 0 || !is_zero_range(host, size)) {
> -        memset(host, ch, size);
> -    }
> +    return ram_handle_compressed_common(host, ch, size, false);
>  }
>  
>  static void *do_data_decompress(void *opaque)
> @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      bool postcopy_running = postcopy_is_running();
>      /* ADVISE is earlier, it shows the source has the postcopy capability on */
>      bool postcopy_advised = postcopy_is_advised();
> +    bool need_pmem_drain = false;
>  
>      seq_iter++;
>  
> @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>          ram_addr_t addr, total_ram_bytes;
>          void *host = NULL;
>          uint8_t ch;
> +        RAMBlock *block = NULL;
> +        bool is_pmem = false;
>  
>          addr = qemu_get_be64(f);
>          flags = addr & ~TARGET_PAGE_MASK;
> @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>  
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(f, flags);
> +            block = ram_block_from_stream(f, flags);
>  
>              host = host_from_ram_block_offset(block, addr);
>              if (!host) {
> @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              }
>              ramblock_recv_bitmap_set(block, host);
>              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
> +
> +            is_pmem = ramblock_is_pmem(block);
> +            need_pmem_drain = need_pmem_drain || is_pmem;
>          }
>  
>          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>  
>          case RAM_SAVE_FLAG_ZERO:
>              ch = qemu_get_byte(f);
> -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE, is_pmem);
>              break;
>  
>          case RAM_SAVE_FLAG_PAGE:
> @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      }
>  
>      wait_for_decompress_done();
> +
> +    if (need_pmem_drain) {
> +        pmem_drain();
> +    }
> +
>      rcu_read_unlock();
>      trace_ram_load_complete(ret, seq_iter);
>      return ret;
> -- 
> 2.14.1

Dave

> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal " Haozhong Zhang
@ 2018-02-07 11:49   ` Dr. David Alan Gilbert
  2018-02-07 12:02     ` Haozhong Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 11:49 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> When loading a normal page to persistent memory, load its data by
> libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
> with a call to pmem_drain() at the end of memory loading, we can
> guarantee all those normal pages are persistenly loaded to PMEM.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  include/migration/qemu-file-types.h |  1 +
>  include/qemu/pmem.h                 |  6 ++++++
>  migration/qemu-file.c               | 41 ++++++++++++++++++++++++++++---------
>  migration/ram.c                     |  6 +++++-
>  4 files changed, 43 insertions(+), 11 deletions(-)
> 
> diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
> index bd6d7dd7f9..bb5c547498 100644
> --- a/include/migration/qemu-file-types.h
> +++ b/include/migration/qemu-file-types.h
> @@ -34,6 +34,7 @@ void qemu_put_be16(QEMUFile *f, unsigned int v);
>  void qemu_put_be32(QEMUFile *f, unsigned int v);
>  void qemu_put_be64(QEMUFile *f, uint64_t v);
>  size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
> +size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size);
>  
>  int qemu_get_byte(QEMUFile *f);
>  
> diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> index 861d8ecc21..77ee1fc4eb 100644
> --- a/include/qemu/pmem.h
> +++ b/include/qemu/pmem.h
> @@ -26,6 +26,12 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
>      return memcpy(pmemdest, src, len);
>  }
>  
> +static inline void *
> +pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
> +{
> +    return memcpy(pmemdest, src, len);
> +}
> +
>  static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
>  {
>      return memset(pmemdest, c, len);
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 2ab2bf362d..7e573010d9 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -26,6 +26,7 @@
>  #include "qemu-common.h"
>  #include "qemu/error-report.h"
>  #include "qemu/iov.h"
> +#include "qemu/pmem.h"
>  #include "migration.h"
>  #include "qemu-file.h"
>  #include "trace.h"
> @@ -471,15 +472,8 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, size_t size, size_t offset)
>      return size;
>  }
>  
> -/*
> - * Read 'size' bytes of data from the file into buf.
> - * 'size' can be larger than the internal buffer.
> - *
> - * It will return size bytes unless there was an error, in which case it will
> - * return as many as it managed to read (assuming blocking fd's which
> - * all current QEMUFile are)
> - */
> -size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> +static size_t
> +qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size, bool is_pmem)
>  {
>      size_t pending = size;
>      size_t done = 0;
> @@ -492,7 +486,11 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
>          if (res == 0) {
>              return done;
>          }
> -        memcpy(buf, src, res);
> +        if (!is_pmem) {
> +            memcpy(buf, src, res);
> +        } else {
> +            pmem_memcpy_nodrain(buf, src, res);
> +        }

I see why you're doing it, but again I'm surprised it's ended up having
to modify qemu-file.

>          qemu_file_skip(f, res);
>          buf += res;
>          pending -= res;
> @@ -501,6 +499,29 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
>      return done;
>  }
>  
> +/*
> + * Read 'size' bytes of data from the file into buf.
> + * 'size' can be larger than the internal buffer.
> + *
> + * It will return size bytes unless there was an error, in which case it will
> + * return as many as it managed to read (assuming blocking fd's which
> + * all current QEMUFile are)
> + */
> +size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> +{
> +    return qemu_get_buffer_common(f, buf, size, false);
> +}
> +
> +/*
> + * Mostly the same as qemu_get_buffer(), except that
> + * 1) it's for the case that 'buf' is in the persistent memory, and
> + * 2) it takes necessary operations to ensure the data persistence in 'buf'.
> + */
> +size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size)
> +{
> +    return qemu_get_buffer_common(f, buf, size, true);
> +}
> +
>  /*
>   * Read 'size' bytes of data from the file.
>   * 'size' can be larger than the internal buffer.
> diff --git a/migration/ram.c b/migration/ram.c
> index 5a0e503818..5a79bbff64 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2950,7 +2950,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              break;
>  
>          case RAM_SAVE_FLAG_PAGE:
> -            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> +            if (!is_pmem) {
> +                qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> +            } else {
> +                qemu_get_buffer_to_pmem(f, host, TARGET_PAGE_SIZE);
> +            }

can you just expose qemu_get_buffer_common instead of making it static
and just pass in the is_pmem flag.

Dave

>              break;
>  
>          case RAM_SAVE_FLAG_COMPRESS_PAGE:
> -- 
> 2.14.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 11:38   ` Dr. David Alan Gilbert
@ 2018-02-07 11:52     ` Haozhong Zhang
  2018-02-07 12:51       ` Haozhong Zhang
  2018-02-07 12:56       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 11:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

On 02/07/18 11:38 +0000, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > When loading a zero page, check whether it will be loaded to
> > persistent memory If yes, load it by libpmem function
> > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > end of RAM loading, we can guarantee all those zero pages are
> > persistently loaded.
> 
> I'm surprised pmem is this invasive to be honest;   I hadn't expected
> the need for special memcpy's etc everywhere.  We're bound to miss some.
> I assume there's separate kernel work needed to make postcopy work;
> certainly the patches here don't seem to touch the postcopy path.

This link at
https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
that postcopy with memory-backend-file requires kernel support. Can
you point me the details of the required kernel support, so that I can
understand what would be needed to NVDIMM postcopy?

> 
> > Depending on the host HW/SW configurations, pmem_drain() can be
> > "sfence".  Therefore, we do not call pmem_drain() after each
> > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > overhead.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  include/qemu/pmem.h |  9 +++++++++
> >  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
> >  2 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 9017596ff0..861d8ecc21 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
> >      return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > +{
> > +    return memset(pmemdest, c, len);
> > +}
> > +
> > +static inline void pmem_drain(void)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_LIBPMEM */
> >  
> >  #endif /* !QEMU_PMEM_H */
> > diff --git a/migration/ram.c b/migration/ram.c
> > index cb1950f3eb..5a0e503818 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -49,6 +49,7 @@
> >  #include "qemu/rcu_queue.h"
> >  #include "migration/colo.h"
> >  #include "migration/block.h"
> > +#include "qemu/pmem.h"
> >  
> >  /***********************************************************/
> >  /* ram save/restore */
> > @@ -2467,6 +2468,20 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
> >      return block->host + offset;
> >  }
> >  
> > +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t size,
> > +                                         bool is_pmem)
> 
> I don't think there's any advantage of splitting out this _common
> routine; lets just add the parameter to ram_handle_compressed.
> 
> > +{
> > +    if (!ch && is_zero_range(host, size)) {
> > +        return;
> > +    }
> > +
> > +    if (!is_pmem) {
> > +        memset(host, ch, size);
> > +    } else {
> > +        pmem_memset_nodrain(host, ch, size);
> > +    }
> 
> I'm wondering if it would be easier to pass in a memsetfunc ptr and call
> that (defualting to memset if it's NULL).

Yes, it would be more extensible if we have other special memory
devices in the future.

Thank,
Haozhong

> 
> > +}
> > +
> >  /**
> >   * ram_handle_compressed: handle the zero page case
> >   *
> > @@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
> >   */
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> >  {
> > -    if (ch != 0 || !is_zero_range(host, size)) {
> > -        memset(host, ch, size);
> > -    }
> > +    return ram_handle_compressed_common(host, ch, size, false);
> >  }
> >  
> >  static void *do_data_decompress(void *opaque)
> > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      bool postcopy_running = postcopy_is_running();
> >      /* ADVISE is earlier, it shows the source has the postcopy capability on */
> >      bool postcopy_advised = postcopy_is_advised();
> > +    bool need_pmem_drain = false;
> >  
> >      seq_iter++;
> >  
> > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >          ram_addr_t addr, total_ram_bytes;
> >          void *host = NULL;
> >          uint8_t ch;
> > +        RAMBlock *block = NULL;
> > +        bool is_pmem = false;
> >  
> >          addr = qemu_get_be64(f);
> >          flags = addr & ~TARGET_PAGE_MASK;
> > @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >  
> >          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
> >                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> > -            RAMBlock *block = ram_block_from_stream(f, flags);
> > +            block = ram_block_from_stream(f, flags);
> >  
> >              host = host_from_ram_block_offset(block, addr);
> >              if (!host) {
> > @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >              }
> >              ramblock_recv_bitmap_set(block, host);
> >              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
> > +
> > +            is_pmem = ramblock_is_pmem(block);
> > +            need_pmem_drain = need_pmem_drain || is_pmem;
> >          }
> >  
> >          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> > @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >  
> >          case RAM_SAVE_FLAG_ZERO:
> >              ch = qemu_get_byte(f);
> > -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> > +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE, is_pmem);
> >              break;
> >  
> >          case RAM_SAVE_FLAG_PAGE:
> > @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      }
> >  
> >      wait_for_decompress_done();
> > +
> > +    if (need_pmem_drain) {
> > +        pmem_drain();
> > +    }
> > +
> >      rcu_read_unlock();
> >      trace_ram_load_complete(ret, seq_iter);
> >      return ret;
> > -- 
> > 2.14.1
> 
> Dave
> 
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed " Haozhong Zhang
@ 2018-02-07 11:54   ` Dr. David Alan Gilbert
  2018-02-07 12:15     ` Haozhong Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 11:54 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> When loading a compressed page to persistent memory, flush CPU cache
> after the data is decompressed. Combined with a call to pmem_drain()
> at the end of memory loading, we can guarantee those compressed pages
> are persistently loaded to PMEM.

Can you explain why this can use the flush and doesn't need the special
memset?

Dave

> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  include/qemu/pmem.h |  4 ++++
>  migration/ram.c     | 16 +++++++++++-----
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> index 77ee1fc4eb..20e3f6e71d 100644
> --- a/include/qemu/pmem.h
> +++ b/include/qemu/pmem.h
> @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
>      return memset(pmemdest, c, len);
>  }
>  
> +static inline void pmem_flush(const void *addr, size_t len)
> +{
> +}
> +
>  static inline void pmem_drain(void)
>  {
>  }
> diff --git a/migration/ram.c b/migration/ram.c
> index 5a79bbff64..924d2b9537 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -274,6 +274,7 @@ struct DecompressParam {
>      void *des;
>      uint8_t *compbuf;
>      int len;
> +    bool is_pmem;
>  };
>  typedef struct DecompressParam DecompressParam;
>  
> @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
>      DecompressParam *param = opaque;
>      unsigned long pagesize;
>      uint8_t *des;
> -    int len;
> +    int len, rc;
>  
>      qemu_mutex_lock(&param->mutex);
>      while (!param->quit) {
> @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
>               * not a problem because the dirty page will be retransferred
>               * and uncompress() won't break the data in other pages.
>               */
> -            uncompress((Bytef *)des, &pagesize,
> -                       (const Bytef *)param->compbuf, len);
> +            rc = uncompress((Bytef *)des, &pagesize,
> +                            (const Bytef *)param->compbuf, len);
> +            if (rc == Z_OK && param->is_pmem) {
> +                pmem_flush(des, len);
> +            }
>  
>              qemu_mutex_lock(&decomp_done_lock);
>              param->done = true;
> @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
>  }
>  
>  static void decompress_data_with_multi_threads(QEMUFile *f,
> -                                               void *host, int len)
> +                                               void *host, int len,
> +                                               bool is_pmem)
>  {
>      int idx, thread_count;
>  
> @@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
>                  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
>                  decomp_param[idx].des = host;
>                  decomp_param[idx].len = len;
> +                decomp_param[idx].is_pmem = is_pmem;
>                  qemu_cond_signal(&decomp_param[idx].cond);
>                  qemu_mutex_unlock(&decomp_param[idx].mutex);
>                  break;
> @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>                  ret = -EINVAL;
>                  break;
>              }
> -            decompress_data_with_multi_threads(f, host, len);
> +            decompress_data_with_multi_threads(f, host, len, is_pmem);
>              break;
>  
>          case RAM_SAVE_FLAG_XBZRLE:
> -- 
> 2.14.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
  2018-02-07 11:49   ` Dr. David Alan Gilbert
@ 2018-02-07 12:02     ` Haozhong Zhang
  0 siblings, 0 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 12:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

On 02/07/18 11:49 +0000, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > When loading a normal page to persistent memory, load its data by
> > libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
> > with a call to pmem_drain() at the end of memory loading, we can
> > guarantee all those normal pages are persistenly loaded to PMEM.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  include/migration/qemu-file-types.h |  1 +
> >  include/qemu/pmem.h                 |  6 ++++++
> >  migration/qemu-file.c               | 41 ++++++++++++++++++++++++++++---------
> >  migration/ram.c                     |  6 +++++-
> >  4 files changed, 43 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
> > index bd6d7dd7f9..bb5c547498 100644
> > --- a/include/migration/qemu-file-types.h
> > +++ b/include/migration/qemu-file-types.h
> > @@ -34,6 +34,7 @@ void qemu_put_be16(QEMUFile *f, unsigned int v);
> >  void qemu_put_be32(QEMUFile *f, unsigned int v);
> >  void qemu_put_be64(QEMUFile *f, uint64_t v);
> >  size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
> > +size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size);
> >  
> >  int qemu_get_byte(QEMUFile *f);
> >  
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 861d8ecc21..77ee1fc4eb 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,12 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
> >      return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *
> > +pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
> > +{
> > +    return memcpy(pmemdest, src, len);
> > +}
> > +
> >  static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> >  {
> >      return memset(pmemdest, c, len);
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> > index 2ab2bf362d..7e573010d9 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -26,6 +26,7 @@
> >  #include "qemu-common.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/iov.h"
> > +#include "qemu/pmem.h"
> >  #include "migration.h"
> >  #include "qemu-file.h"
> >  #include "trace.h"
> > @@ -471,15 +472,8 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, size_t size, size_t offset)
> >      return size;
> >  }
> >  
> > -/*
> > - * Read 'size' bytes of data from the file into buf.
> > - * 'size' can be larger than the internal buffer.
> > - *
> > - * It will return size bytes unless there was an error, in which case it will
> > - * return as many as it managed to read (assuming blocking fd's which
> > - * all current QEMUFile are)
> > - */
> > -size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> > +static size_t
> > +qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size, bool is_pmem)
> >  {
> >      size_t pending = size;
> >      size_t done = 0;
> > @@ -492,7 +486,11 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> >          if (res == 0) {
> >              return done;
> >          }
> > -        memcpy(buf, src, res);
> > +        if (!is_pmem) {
> > +            memcpy(buf, src, res);
> > +        } else {
> > +            pmem_memcpy_nodrain(buf, src, res);
> > +        }
> 
> I see why you're doing it, but again I'm surprised it's ended up having
> to modify qemu-file.

Well, the current programming model of pmem requires users to take
special care on the data persistence, so QEMU (as a user of pmem) has
to do that as well.

> 
> >          qemu_file_skip(f, res);
> >          buf += res;
> >          pending -= res;
> > @@ -501,6 +499,29 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> >      return done;
> >  }
> >  
> > +/*
> > + * Read 'size' bytes of data from the file into buf.
> > + * 'size' can be larger than the internal buffer.
> > + *
> > + * It will return size bytes unless there was an error, in which case it will
> > + * return as many as it managed to read (assuming blocking fd's which
> > + * all current QEMUFile are)
> > + */
> > +size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> > +{
> > +    return qemu_get_buffer_common(f, buf, size, false);
> > +}
> > +
> > +/*
> > + * Mostly the same as qemu_get_buffer(), except that
> > + * 1) it's for the case that 'buf' is in the persistent memory, and
> > + * 2) it takes necessary operations to ensure the data persistence in 'buf'.
> > + */
> > +size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size)
> > +{
> > +    return qemu_get_buffer_common(f, buf, size, true);
> > +}
> > +
> >  /*
> >   * Read 'size' bytes of data from the file.
> >   * 'size' can be larger than the internal buffer.
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5a0e503818..5a79bbff64 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2950,7 +2950,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >              break;
> >  
> >          case RAM_SAVE_FLAG_PAGE:
> > -            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> > +            if (!is_pmem) {
> > +                qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> > +            } else {
> > +                qemu_get_buffer_to_pmem(f, host, TARGET_PAGE_SIZE);
> > +            }
> 
> can you just expose qemu_get_buffer_common instead of making it static
> and just pass in the is_pmem flag.

Or perhaps passing a memcpyfunc function similarly to your comment on
patch 5.

Thanks,
Haozhong

> 
> Dave
> 
> >              break;
> >  
> >          case RAM_SAVE_FLAG_COMPRESS_PAGE:
> > -- 
> > 2.14.1
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 11:54   ` Dr. David Alan Gilbert
@ 2018-02-07 12:15     ` Haozhong Zhang
  2018-02-07 13:03       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 12:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > When loading a compressed page to persistent memory, flush CPU cache
> > after the data is decompressed. Combined with a call to pmem_drain()
> > at the end of memory loading, we can guarantee those compressed pages
> > are persistently loaded to PMEM.
> 
> Can you explain why this can use the flush and doesn't need the special
> memset?

The best approach to ensure the write persistence is to operate pmem
all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
the write to pmem in this case is performed by uncompress() which is
implemented out of QEMU and libpmem. It may or may not use libpmem,
which is not controlled by QEMU. Therefore, we have to use the less
optimal approach, that is to flush cache for all pmem addresses that
uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
uncompress(), and pmem_flush() + pmem_drain() in QEMU.

Haozhong

> 
> Dave
> 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  include/qemu/pmem.h |  4 ++++
> >  migration/ram.c     | 16 +++++++++++-----
> >  2 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 77ee1fc4eb..20e3f6e71d 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> >      return memset(pmemdest, c, len);
> >  }
> >  
> > +static inline void pmem_flush(const void *addr, size_t len)
> > +{
> > +}
> > +
> >  static inline void pmem_drain(void)
> >  {
> >  }
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5a79bbff64..924d2b9537 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -274,6 +274,7 @@ struct DecompressParam {
> >      void *des;
> >      uint8_t *compbuf;
> >      int len;
> > +    bool is_pmem;
> >  };
> >  typedef struct DecompressParam DecompressParam;
> >  
> > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> >      DecompressParam *param = opaque;
> >      unsigned long pagesize;
> >      uint8_t *des;
> > -    int len;
> > +    int len, rc;
> >  
> >      qemu_mutex_lock(&param->mutex);
> >      while (!param->quit) {
> > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> >               * not a problem because the dirty page will be retransferred
> >               * and uncompress() won't break the data in other pages.
> >               */
> > -            uncompress((Bytef *)des, &pagesize,
> > -                       (const Bytef *)param->compbuf, len);
> > +            rc = uncompress((Bytef *)des, &pagesize,
> > +                            (const Bytef *)param->compbuf, len);
> > +            if (rc == Z_OK && param->is_pmem) {
> > +                pmem_flush(des, len);
> > +            }
> >  
> >              qemu_mutex_lock(&decomp_done_lock);
> >              param->done = true;
> > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> >  }
> >  
> >  static void decompress_data_with_multi_threads(QEMUFile *f,
> > -                                               void *host, int len)
> > +                                               void *host, int len,
> > +                                               bool is_pmem)
> >  {
> >      int idx, thread_count;
> >  
> > @@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
> >                  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
> >                  decomp_param[idx].des = host;
> >                  decomp_param[idx].len = len;
> > +                decomp_param[idx].is_pmem = is_pmem;
> >                  qemu_cond_signal(&decomp_param[idx].cond);
> >                  qemu_mutex_unlock(&decomp_param[idx].mutex);
> >                  break;
> > @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >                  ret = -EINVAL;
> >                  break;
> >              }
> > -            decompress_data_with_multi_threads(f, host, len);
> > +            decompress_data_with_multi_threads(f, host, len, is_pmem);
> >              break;
> >  
> >          case RAM_SAVE_FLAG_XBZRLE:
> > -- 
> > 2.14.1
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 11:52     ` Haozhong Zhang
@ 2018-02-07 12:51       ` Haozhong Zhang
  2018-02-07 12:59         ` Dr. David Alan Gilbert
  2018-02-07 14:10         ` Pankaj Gupta
  2018-02-07 12:56       ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 12:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, qemu-devel, Eduardo Habkost,
	Igor Mammedov, Paolo Bonzini, mst, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi, Dan Williams

On 02/07/18 19:52 +0800, Haozhong Zhang wrote:
> On 02/07/18 11:38 +0000, Dr. David Alan Gilbert wrote:
> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > When loading a zero page, check whether it will be loaded to
> > > persistent memory If yes, load it by libpmem function
> > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > end of RAM loading, we can guarantee all those zero pages are
> > > persistently loaded.
> > 
> > I'm surprised pmem is this invasive to be honest;   I hadn't expected
> > the need for special memcpy's etc everywhere.  We're bound to miss some.
> > I assume there's separate kernel work needed to make postcopy work;
> > certainly the patches here don't seem to touch the postcopy path.
> 
> This link at
> https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
> that postcopy with memory-backend-file requires kernel support. Can
> you point me the details of the required kernel support, so that I can
> understand what would be needed to NVDIMM postcopy?

I saw test_ramblock_postcopiable() requires the ram block not be
mmap'ed with MAP_SHARED. The only pmem device (i.e., device DAX e.g.,
/dev/dax0.0) that can be safely used as the backend of vNVDIMM must be
shared mapped which is required by kernel, so postcopy does not work
with pmem right now. Even if the private mmap was supported for device
dax, it would still make little sense to private mmap it in QEMU,
because vNVDIMM intends to be non-volatile.

Haozhong

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 11:52     ` Haozhong Zhang
  2018-02-07 12:51       ` Haozhong Zhang
@ 2018-02-07 12:56       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 12:56 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> On 02/07/18 11:38 +0000, Dr. David Alan Gilbert wrote:
> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > When loading a zero page, check whether it will be loaded to
> > > persistent memory If yes, load it by libpmem function
> > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > end of RAM loading, we can guarantee all those zero pages are
> > > persistently loaded.
> > 
> > I'm surprised pmem is this invasive to be honest;   I hadn't expected
> > the need for special memcpy's etc everywhere.  We're bound to miss some.
> > I assume there's separate kernel work needed to make postcopy work;
> > certainly the patches here don't seem to touch the postcopy path.
> 
> This link at
> https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
> that postcopy with memory-backend-file requires kernel support. Can
> you point me the details of the required kernel support, so that I can
> understand what would be needed to NVDIMM postcopy?

I can't, but ask Andrea Arcangeli ( aarcange@redhat.com ); he wrote the
userfault kernel code.  Note that we have a mechanism for atomically
placing a page into memory, so that might also need modifications for
pmem; again check with Andrea.

Dave

> > 
> > > Depending on the host HW/SW configurations, pmem_drain() can be
> > > "sfence".  Therefore, we do not call pmem_drain() after each
> > > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > > overhead.
> > > 
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > ---
> > >  include/qemu/pmem.h |  9 +++++++++
> > >  migration/ram.c     | 34 +++++++++++++++++++++++++++++-----
> > >  2 files changed, 38 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > index 9017596ff0..861d8ecc21 100644
> > > --- a/include/qemu/pmem.h
> > > +++ b/include/qemu/pmem.h
> > > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
> > >      return memcpy(pmemdest, src, len);
> > >  }
> > >  
> > > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > > +{
> > > +    return memset(pmemdest, c, len);
> > > +}
> > > +
> > > +static inline void pmem_drain(void)
> > > +{
> > > +}
> > > +
> > >  #endif /* CONFIG_LIBPMEM */
> > >  
> > >  #endif /* !QEMU_PMEM_H */
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index cb1950f3eb..5a0e503818 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -49,6 +49,7 @@
> > >  #include "qemu/rcu_queue.h"
> > >  #include "migration/colo.h"
> > >  #include "migration/block.h"
> > > +#include "qemu/pmem.h"
> > >  
> > >  /***********************************************************/
> > >  /* ram save/restore */
> > > @@ -2467,6 +2468,20 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
> > >      return block->host + offset;
> > >  }
> > >  
> > > +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t size,
> > > +                                         bool is_pmem)
> > 
> > I don't think there's any advantage of splitting out this _common
> > routine; lets just add the parameter to ram_handle_compressed.
> > 
> > > +{
> > > +    if (!ch && is_zero_range(host, size)) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (!is_pmem) {
> > > +        memset(host, ch, size);
> > > +    } else {
> > > +        pmem_memset_nodrain(host, ch, size);
> > > +    }
> > 
> > I'm wondering if it would be easier to pass in a memsetfunc ptr and call
> > that (defualting to memset if it's NULL).
> 
> Yes, it would be more extensible if we have other special memory
> devices in the future.
> 
> Thank,
> Haozhong
> 
> > 
> > > +}
> > > +
> > >  /**
> > >   * ram_handle_compressed: handle the zero page case
> > >   *
> > > @@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock *block,
> > >   */
> > >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> > >  {
> > > -    if (ch != 0 || !is_zero_range(host, size)) {
> > > -        memset(host, ch, size);
> > > -    }
> > > +    return ram_handle_compressed_common(host, ch, size, false);
> > >  }
> > >  
> > >  static void *do_data_decompress(void *opaque)
> > > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >      bool postcopy_running = postcopy_is_running();
> > >      /* ADVISE is earlier, it shows the source has the postcopy capability on */
> > >      bool postcopy_advised = postcopy_is_advised();
> > > +    bool need_pmem_drain = false;
> > >  
> > >      seq_iter++;
> > >  
> > > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >          ram_addr_t addr, total_ram_bytes;
> > >          void *host = NULL;
> > >          uint8_t ch;
> > > +        RAMBlock *block = NULL;
> > > +        bool is_pmem = false;
> > >  
> > >          addr = qemu_get_be64(f);
> > >          flags = addr & ~TARGET_PAGE_MASK;
> > > @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >  
> > >          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
> > >                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> > > -            RAMBlock *block = ram_block_from_stream(f, flags);
> > > +            block = ram_block_from_stream(f, flags);
> > >  
> > >              host = host_from_ram_block_offset(block, addr);
> > >              if (!host) {
> > > @@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >              }
> > >              ramblock_recv_bitmap_set(block, host);
> > >              trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
> > > +
> > > +            is_pmem = ramblock_is_pmem(block);
> > > +            need_pmem_drain = need_pmem_drain || is_pmem;
> > >          }
> > >  
> > >          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> > > @@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >  
> > >          case RAM_SAVE_FLAG_ZERO:
> > >              ch = qemu_get_byte(f);
> > > -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> > > +            ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE, is_pmem);
> > >              break;
> > >  
> > >          case RAM_SAVE_FLAG_PAGE:
> > > @@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >      }
> > >  
> > >      wait_for_decompress_done();
> > > +
> > > +    if (need_pmem_drain) {
> > > +        pmem_drain();
> > > +    }
> > > +
> > >      rcu_read_unlock();
> > >      trace_ram_load_complete(ret, seq_iter);
> > >      return ret;
> > > -- 
> > > 2.14.1
> > 
> > Dave
> > 
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 12:51       ` Haozhong Zhang
@ 2018-02-07 12:59         ` Dr. David Alan Gilbert
  2018-02-07 14:10         ` Pankaj Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 12:59 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> On 02/07/18 19:52 +0800, Haozhong Zhang wrote:
> > On 02/07/18 11:38 +0000, Dr. David Alan Gilbert wrote:
> > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > > When loading a zero page, check whether it will be loaded to
> > > > persistent memory If yes, load it by libpmem function
> > > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > > end of RAM loading, we can guarantee all those zero pages are
> > > > persistently loaded.
> > > 
> > > I'm surprised pmem is this invasive to be honest;   I hadn't expected
> > > the need for special memcpy's etc everywhere.  We're bound to miss some.
> > > I assume there's separate kernel work needed to make postcopy work;
> > > certainly the patches here don't seem to touch the postcopy path.
> > 
> > This link at
> > https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
> > that postcopy with memory-backend-file requires kernel support. Can
> > you point me the details of the required kernel support, so that I can
> > understand what would be needed to NVDIMM postcopy?
> 
> I saw test_ramblock_postcopiable() requires the ram block not be
> mmap'ed with MAP_SHARED. The only pmem device (i.e., device DAX e.g.,
> /dev/dax0.0) that can be safely used as the backend of vNVDIMM must be
> shared mapped which is required by kernel, so postcopy does not work
> with pmem right now. Even if the private mmap was supported for device
> dax, it would still make little sense to private mmap it in QEMU,
> because vNVDIMM intends to be non-volatile.

I've got patches which do shareable for vhost-user at the moment;
(I've posted a few versions and I'm working on an updated set).
However, it's probably more than just the shareable that needs thinking
about for a device like that.

Dave

> Haozhong
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 12:15     ` Haozhong Zhang
@ 2018-02-07 13:03       ` Dr. David Alan Gilbert
  2018-02-07 13:20         ` Haozhong Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 13:03 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > When loading a compressed page to persistent memory, flush CPU cache
> > > after the data is decompressed. Combined with a call to pmem_drain()
> > > at the end of memory loading, we can guarantee those compressed pages
> > > are persistently loaded to PMEM.
> > 
> > Can you explain why this can use the flush and doesn't need the special
> > memset?
> 
> The best approach to ensure the write persistence is to operate pmem
> all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> the write to pmem in this case is performed by uncompress() which is
> implemented out of QEMU and libpmem. It may or may not use libpmem,
> which is not controlled by QEMU. Therefore, we have to use the less
> optimal approach, that is to flush cache for all pmem addresses that
> uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> uncompress(), and pmem_flush() + pmem_drain() in QEMU.

In what way is it less optimal?
If that's a legal thing to do, then why not just do a pmem_flush +
pmem_drain right at the end of the ram loading and leave all the rest of
the code untouched?

Dave

> Haozhong
> 
> > 
> > Dave
> > 
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > ---
> > >  include/qemu/pmem.h |  4 ++++
> > >  migration/ram.c     | 16 +++++++++++-----
> > >  2 files changed, 15 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > index 77ee1fc4eb..20e3f6e71d 100644
> > > --- a/include/qemu/pmem.h
> > > +++ b/include/qemu/pmem.h
> > > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > >      return memset(pmemdest, c, len);
> > >  }
> > >  
> > > +static inline void pmem_flush(const void *addr, size_t len)
> > > +{
> > > +}
> > > +
> > >  static inline void pmem_drain(void)
> > >  {
> > >  }
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 5a79bbff64..924d2b9537 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -274,6 +274,7 @@ struct DecompressParam {
> > >      void *des;
> > >      uint8_t *compbuf;
> > >      int len;
> > > +    bool is_pmem;
> > >  };
> > >  typedef struct DecompressParam DecompressParam;
> > >  
> > > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> > >      DecompressParam *param = opaque;
> > >      unsigned long pagesize;
> > >      uint8_t *des;
> > > -    int len;
> > > +    int len, rc;
> > >  
> > >      qemu_mutex_lock(&param->mutex);
> > >      while (!param->quit) {
> > > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> > >               * not a problem because the dirty page will be retransferred
> > >               * and uncompress() won't break the data in other pages.
> > >               */
> > > -            uncompress((Bytef *)des, &pagesize,
> > > -                       (const Bytef *)param->compbuf, len);
> > > +            rc = uncompress((Bytef *)des, &pagesize,
> > > +                            (const Bytef *)param->compbuf, len);
> > > +            if (rc == Z_OK && param->is_pmem) {
> > > +                pmem_flush(des, len);
> > > +            }
> > >  
> > >              qemu_mutex_lock(&decomp_done_lock);
> > >              param->done = true;
> > > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> > >  }
> > >  
> > >  static void decompress_data_with_multi_threads(QEMUFile *f,
> > > -                                               void *host, int len)
> > > +                                               void *host, int len,
> > > +                                               bool is_pmem)
> > >  {
> > >      int idx, thread_count;
> > >  
> > > @@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
> > >                  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
> > >                  decomp_param[idx].des = host;
> > >                  decomp_param[idx].len = len;
> > > +                decomp_param[idx].is_pmem = is_pmem;
> > >                  qemu_cond_signal(&decomp_param[idx].cond);
> > >                  qemu_mutex_unlock(&decomp_param[idx].mutex);
> > >                  break;
> > > @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > >                  ret = -EINVAL;
> > >                  break;
> > >              }
> > > -            decompress_data_with_multi_threads(f, host, len);
> > > +            decompress_data_with_multi_threads(f, host, len, is_pmem);
> > >              break;
> > >  
> > >          case RAM_SAVE_FLAG_XBZRLE:
> > > -- 
> > > 2.14.1
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle " Haozhong Zhang
@ 2018-02-07 13:08   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 13:08 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> When loading a xbzrle encoded page to persistent memory, load the data
> via libpmem function pmem_memcpy_nodrain() instead of memcpy().
> Combined with a call to pmem_drain() at the end of memory loading, we
> can guarantee those xbzrle encoded pages are persistently loaded to PMEM.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  migration/ram.c    | 15 ++++++++++-----
>  migration/xbzrle.c | 20 ++++++++++++++++++--
>  migration/xbzrle.h |  1 +
>  3 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 924d2b9537..87f977617d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2388,10 +2388,10 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
>      }
>  }
>  
> -static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
> +static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host, bool is_pmem)
>  {
>      unsigned int xh_len;
> -    int xh_flags;
> +    int xh_flags, rc;
>      uint8_t *loaded_data;
>  
>      /* extract RLE header */
> @@ -2413,8 +2413,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
>      qemu_get_buffer_in_place(f, &loaded_data, xh_len);
>  
>      /* decode RLE */
> -    if (xbzrle_decode_buffer(loaded_data, xh_len, host,
> -                             TARGET_PAGE_SIZE) == -1) {
> +    if (!is_pmem) {
> +        rc = xbzrle_decode_buffer(loaded_data, xh_len, host, TARGET_PAGE_SIZE);
> +    } else {
> +        rc = xbzrle_decode_buffer_to_pmem(loaded_data, xh_len, host,
> +                                          TARGET_PAGE_SIZE);
> +    }
> +    if (rc == -1) {
>          error_report("Failed to load XBZRLE page - decode error!");
>          return -1;
>      }
> @@ -2974,7 +2979,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              break;
>  
>          case RAM_SAVE_FLAG_XBZRLE:
> -            if (load_xbzrle(f, addr, host) < 0) {
> +            if (load_xbzrle(f, addr, host, is_pmem) < 0) {
>                  error_report("Failed to decompress XBZRLE page at "
>                               RAM_ADDR_FMT, addr);
>                  ret = -EINVAL;
> diff --git a/migration/xbzrle.c b/migration/xbzrle.c
> index 1ba482ded9..499d8e1bfb 100644
> --- a/migration/xbzrle.c
> +++ b/migration/xbzrle.c
> @@ -12,6 +12,7 @@
>   */
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
> +#include "qemu/pmem.h"
>  #include "xbzrle.h"
>  
>  /*
> @@ -126,7 +127,8 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
>      return d;
>  }
>  
> -int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
> +static int xbzrle_decode_buffer_common(uint8_t *src, int slen, uint8_t *dst,
> +                                       int dlen, bool is_pmem)
>  {
>      int i = 0, d = 0;
>      int ret;
> @@ -167,10 +169,24 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
>              return -1;
>          }
>  
> -        memcpy(dst + d, src + i, count);
> +        if (!is_pmem) {
> +            memcpy(dst + d, src + i, count);
> +        } else {
> +            pmem_memcpy_nodrain(dst + d, src + i, count);
> +        }
>          d += count;
>          i += count;
>      }
>  
>      return d;
>  }
> +
> +int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
> +{
> +    return xbzrle_decode_buffer_common(src, slen, dst, dlen, false);
> +}
> +
> +int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int dlen)
> +{
> +    return xbzrle_decode_buffer_common(src, slen, dst, dlen, true);
> +}

Again I wouldn't bother with these two wrappers, just add the is_pmem
flag or pointer to xbzrle_decode_buffer.  Note that in xbzrle load this
is doing lots of small copies; I'm not sure if this changes your needs.

Dave

> diff --git a/migration/xbzrle.h b/migration/xbzrle.h
> index a0db507b9c..ac5ae32666 100644
> --- a/migration/xbzrle.h
> +++ b/migration/xbzrle.h
> @@ -18,4 +18,5 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
>                           uint8_t *dst, int dlen);
>  
>  int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
> +int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int dlen);
>  #endif
> -- 
> 2.14.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 13:03       ` Dr. David Alan Gilbert
@ 2018-02-07 13:20         ` Haozhong Zhang
  2018-02-07 13:24           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-07 13:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > > When loading a compressed page to persistent memory, flush CPU cache
> > > > after the data is decompressed. Combined with a call to pmem_drain()
> > > > at the end of memory loading, we can guarantee those compressed pages
> > > > are persistently loaded to PMEM.
> > > 
> > > Can you explain why this can use the flush and doesn't need the special
> > > memset?
> > 
> > The best approach to ensure the write persistence is to operate pmem
> > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> > the write to pmem in this case is performed by uncompress() which is
> > implemented out of QEMU and libpmem. It may or may not use libpmem,
> > which is not controlled by QEMU. Therefore, we have to use the less
> > optimal approach, that is to flush cache for all pmem addresses that
> > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
> 
> In what way is it less optimal?
> If that's a legal thing to do, then why not just do a pmem_flush +
> pmem_drain right at the end of the ram loading and leave all the rest of
> the code untouched?

For example, the implementation pmem_memcpy_nodrain() prefers to use
movnt instructions w/o flush to write pmem if those instructions are
available, and falls back to memcpy() + flush if movnt are not
available, so I suppose the latter is less optimal.

Haozhong

> 
> Dave
> 
> > Haozhong
> > 
> > > 
> > > Dave
> > > 
> > > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > > ---
> > > >  include/qemu/pmem.h |  4 ++++
> > > >  migration/ram.c     | 16 +++++++++++-----
> > > >  2 files changed, 15 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > > index 77ee1fc4eb..20e3f6e71d 100644
> > > > --- a/include/qemu/pmem.h
> > > > +++ b/include/qemu/pmem.h
> > > > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > > >      return memset(pmemdest, c, len);
> > > >  }
> > > >  
> > > > +static inline void pmem_flush(const void *addr, size_t len)
> > > > +{
> > > > +}
> > > > +
> > > >  static inline void pmem_drain(void)
> > > >  {
> > > >  }
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index 5a79bbff64..924d2b9537 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -274,6 +274,7 @@ struct DecompressParam {
> > > >      void *des;
> > > >      uint8_t *compbuf;
> > > >      int len;
> > > > +    bool is_pmem;
> > > >  };
> > > >  typedef struct DecompressParam DecompressParam;
> > > >  
> > > > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> > > >      DecompressParam *param = opaque;
> > > >      unsigned long pagesize;
> > > >      uint8_t *des;
> > > > -    int len;
> > > > +    int len, rc;
> > > >  
> > > >      qemu_mutex_lock(&param->mutex);
> > > >      while (!param->quit) {
> > > > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> > > >               * not a problem because the dirty page will be retransferred
> > > >               * and uncompress() won't break the data in other pages.
> > > >               */
> > > > -            uncompress((Bytef *)des, &pagesize,
> > > > -                       (const Bytef *)param->compbuf, len);
> > > > +            rc = uncompress((Bytef *)des, &pagesize,
> > > > +                            (const Bytef *)param->compbuf, len);
> > > > +            if (rc == Z_OK && param->is_pmem) {
> > > > +                pmem_flush(des, len);
> > > > +            }
> > > >  
> > > >              qemu_mutex_lock(&decomp_done_lock);
> > > >              param->done = true;
> > > > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> > > >  }
> > > >  
> > > >  static void decompress_data_with_multi_threads(QEMUFile *f,
> > > > -                                               void *host, int len)
> > > > +                                               void *host, int len,
> > > > +                                               bool is_pmem)
> > > >  {
> > > >      int idx, thread_count;
> > > >  
> > > > @@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
> > > >                  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
> > > >                  decomp_param[idx].des = host;
> > > >                  decomp_param[idx].len = len;
> > > > +                decomp_param[idx].is_pmem = is_pmem;
> > > >                  qemu_cond_signal(&decomp_param[idx].cond);
> > > >                  qemu_mutex_unlock(&decomp_param[idx].mutex);
> > > >                  break;
> > > > @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > > >                  ret = -EINVAL;
> > > >                  break;
> > > >              }
> > > > -            decompress_data_with_multi_threads(f, host, len);
> > > > +            decompress_data_with_multi_threads(f, host, len, is_pmem);
> > > >              break;
> > > >  
> > > >          case RAM_SAVE_FLAG_XBZRLE:
> > > > -- 
> > > > 2.14.1
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 13:20         ` Haozhong Zhang
@ 2018-02-07 13:24           ` Dr. David Alan Gilbert
  2018-02-07 18:05             ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 13:24 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, Stefan Hajnoczi, Dan Williams

* Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > > > When loading a compressed page to persistent memory, flush CPU cache
> > > > > after the data is decompressed. Combined with a call to pmem_drain()
> > > > > at the end of memory loading, we can guarantee those compressed pages
> > > > > are persistently loaded to PMEM.
> > > > 
> > > > Can you explain why this can use the flush and doesn't need the special
> > > > memset?
> > > 
> > > The best approach to ensure the write persistence is to operate pmem
> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> > > the write to pmem in this case is performed by uncompress() which is
> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
> > > which is not controlled by QEMU. Therefore, we have to use the less
> > > optimal approach, that is to flush cache for all pmem addresses that
> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
> > 
> > In what way is it less optimal?
> > If that's a legal thing to do, then why not just do a pmem_flush +
> > pmem_drain right at the end of the ram loading and leave all the rest of
> > the code untouched?
> 
> For example, the implementation pmem_memcpy_nodrain() prefers to use
> movnt instructions w/o flush to write pmem if those instructions are
> available, and falls back to memcpy() + flush if movnt are not
> available, so I suppose the latter is less optimal.

But if you use normal memcpy calls to copy a few GB of RAM in an
incoming migrate and then do a single flush at the end, isn't that
better?

Dave

> Haozhong
> 
> > 
> > Dave
> > 
> > > Haozhong
> > > 
> > > > 
> > > > Dave
> > > > 
> > > > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > > > ---
> > > > >  include/qemu/pmem.h |  4 ++++
> > > > >  migration/ram.c     | 16 +++++++++++-----
> > > > >  2 files changed, 15 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > > > index 77ee1fc4eb..20e3f6e71d 100644
> > > > > --- a/include/qemu/pmem.h
> > > > > +++ b/include/qemu/pmem.h
> > > > > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > > > >      return memset(pmemdest, c, len);
> > > > >  }
> > > > >  
> > > > > +static inline void pmem_flush(const void *addr, size_t len)
> > > > > +{
> > > > > +}
> > > > > +
> > > > >  static inline void pmem_drain(void)
> > > > >  {
> > > > >  }
> > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > index 5a79bbff64..924d2b9537 100644
> > > > > --- a/migration/ram.c
> > > > > +++ b/migration/ram.c
> > > > > @@ -274,6 +274,7 @@ struct DecompressParam {
> > > > >      void *des;
> > > > >      uint8_t *compbuf;
> > > > >      int len;
> > > > > +    bool is_pmem;
> > > > >  };
> > > > >  typedef struct DecompressParam DecompressParam;
> > > > >  
> > > > > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> > > > >      DecompressParam *param = opaque;
> > > > >      unsigned long pagesize;
> > > > >      uint8_t *des;
> > > > > -    int len;
> > > > > +    int len, rc;
> > > > >  
> > > > >      qemu_mutex_lock(&param->mutex);
> > > > >      while (!param->quit) {
> > > > > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> > > > >               * not a problem because the dirty page will be retransferred
> > > > >               * and uncompress() won't break the data in other pages.
> > > > >               */
> > > > > -            uncompress((Bytef *)des, &pagesize,
> > > > > -                       (const Bytef *)param->compbuf, len);
> > > > > +            rc = uncompress((Bytef *)des, &pagesize,
> > > > > +                            (const Bytef *)param->compbuf, len);
> > > > > +            if (rc == Z_OK && param->is_pmem) {
> > > > > +                pmem_flush(des, len);
> > > > > +            }
> > > > >  
> > > > >              qemu_mutex_lock(&decomp_done_lock);
> > > > >              param->done = true;
> > > > > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> > > > >  }
> > > > >  
> > > > >  static void decompress_data_with_multi_threads(QEMUFile *f,
> > > > > -                                               void *host, int len)
> > > > > +                                               void *host, int len,
> > > > > +                                               bool is_pmem)
> > > > >  {
> > > > >      int idx, thread_count;
> > > > >  
> > > > > @@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
> > > > >                  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
> > > > >                  decomp_param[idx].des = host;
> > > > >                  decomp_param[idx].len = len;
> > > > > +                decomp_param[idx].is_pmem = is_pmem;
> > > > >                  qemu_cond_signal(&decomp_param[idx].cond);
> > > > >                  qemu_mutex_unlock(&decomp_param[idx].mutex);
> > > > >                  break;
> > > > > @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> > > > >                  ret = -EINVAL;
> > > > >                  break;
> > > > >              }
> > > > > -            decompress_data_with_multi_threads(f, host, len);
> > > > > +            decompress_data_with_multi_threads(f, host, len, is_pmem);
> > > > >              break;
> > > > >  
> > > > >          case RAM_SAVE_FLAG_XBZRLE:
> > > > > -- 
> > > > > 2.14.1
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  2018-02-07 12:51       ` Haozhong Zhang
  2018-02-07 12:59         ` Dr. David Alan Gilbert
@ 2018-02-07 14:10         ` Pankaj Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2018-02-07 14:10 UTC (permalink / raw)
  To: Haozhong Zhang, Andrea Arcangeli
  Cc: Dr. David Alan Gilbert, qemu-devel, Eduardo Habkost,
	Igor Mammedov, Paolo Bonzini, mst, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi, Dan Williams


> 
> On 02/07/18 19:52 +0800, Haozhong Zhang wrote:
> > On 02/07/18 11:38 +0000, Dr. David Alan Gilbert wrote:
> > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> > > > When loading a zero page, check whether it will be loaded to
> > > > persistent memory If yes, load it by libpmem function
> > > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > > end of RAM loading, we can guarantee all those zero pages are
> > > > persistently loaded.
> > > 
> > > I'm surprised pmem is this invasive to be honest;   I hadn't expected
> > > the need for special memcpy's etc everywhere.  We're bound to miss some.
> > > I assume there's separate kernel work needed to make postcopy work;
> > > certainly the patches here don't seem to touch the postcopy path.
> > 
> > This link at
> > https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
> > that postcopy with memory-backend-file requires kernel support. Can
> > you point me the details of the required kernel support, so that I can
> > understand what would be needed to NVDIMM postcopy?
> 
> I saw test_ramblock_postcopiable() requires the ram block not be
> mmap'ed with MAP_SHARED. The only pmem device (i.e., device DAX e.g.,
> /dev/dax0.0) that can be safely used as the backend of vNVDIMM must be
> shared mapped which is required by kernel, so postcopy does not work
> with pmem right now. Even if the private mmap was supported for device
> dax, it would still make little sense to private mmap it in QEMU,
> because vNVDIMM intends to be non-volatile.

o.k that's the reason post-copy pages does not work with THP.

If we want to support device DAX with postcopy we need to handle
this use-case? 

Adding 'Andrea' to the thread for more clear answer.


Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 13:24           ` Dr. David Alan Gilbert
@ 2018-02-07 18:05             ` Dan Williams
  2018-02-07 18:08               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2018-02-07 18:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi

On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
>> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
>> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> > > > > When loading a compressed page to persistent memory, flush CPU cache
>> > > > > after the data is decompressed. Combined with a call to pmem_drain()
>> > > > > at the end of memory loading, we can guarantee those compressed pages
>> > > > > are persistently loaded to PMEM.
>> > > >
>> > > > Can you explain why this can use the flush and doesn't need the special
>> > > > memset?
>> > >
>> > > The best approach to ensure the write persistence is to operate pmem
>> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
>> > > the write to pmem in this case is performed by uncompress() which is
>> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> > > which is not controlled by QEMU. Therefore, we have to use the less
>> > > optimal approach, that is to flush cache for all pmem addresses that
>> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
>> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >
>> > In what way is it less optimal?
>> > If that's a legal thing to do, then why not just do a pmem_flush +
>> > pmem_drain right at the end of the ram loading and leave all the rest of
>> > the code untouched?
>>
>> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> movnt instructions w/o flush to write pmem if those instructions are
>> available, and falls back to memcpy() + flush if movnt are not
>> available, so I suppose the latter is less optimal.
>
> But if you use normal memcpy calls to copy a few GB of RAM in an
> incoming migrate and then do a single flush at the end, isn't that
> better?

Not really, because now you've needlessly polluted the cache and are
spending CPU looping over the cachelines that could have been bypassed
with movnt.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 18:05             ` Dan Williams
@ 2018-02-07 18:08               ` Dr. David Alan Gilbert
  2018-02-07 18:31                 ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 18:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi

* Dan Williams (dan.j.williams@intel.com) wrote:
> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> >> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> > > > > When loading a compressed page to persistent memory, flush CPU cache
> >> > > > > after the data is decompressed. Combined with a call to pmem_drain()
> >> > > > > at the end of memory loading, we can guarantee those compressed pages
> >> > > > > are persistently loaded to PMEM.
> >> > > >
> >> > > > Can you explain why this can use the flush and doesn't need the special
> >> > > > memset?
> >> > >
> >> > > The best approach to ensure the write persistence is to operate pmem
> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> >> > > the write to pmem in this case is performed by uncompress() which is
> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
> >> > > which is not controlled by QEMU. Therefore, we have to use the less
> >> > > optimal approach, that is to flush cache for all pmem addresses that
> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
> >> >
> >> > In what way is it less optimal?
> >> > If that's a legal thing to do, then why not just do a pmem_flush +
> >> > pmem_drain right at the end of the ram loading and leave all the rest of
> >> > the code untouched?
> >>
> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
> >> movnt instructions w/o flush to write pmem if those instructions are
> >> available, and falls back to memcpy() + flush if movnt are not
> >> available, so I suppose the latter is less optimal.
> >
> > But if you use normal memcpy calls to copy a few GB of RAM in an
> > incoming migrate and then do a single flush at the end, isn't that
> > better?
> 
> Not really, because now you've needlessly polluted the cache and are
> spending CPU looping over the cachelines that could have been bypassed
> with movnt.

What's different in the pmem case?   Isn't what you've said true in the
normal migrate case as well?

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 18:08               ` Dr. David Alan Gilbert
@ 2018-02-07 18:31                 ` Dan Williams
  2018-02-07 18:37                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2018-02-07 18:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi

On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Dan Williams (dan.j.williams@intel.com) wrote:
>> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
>> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
>> >> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> > > > > When loading a compressed page to persistent memory, flush CPU cache
>> >> > > > > after the data is decompressed. Combined with a call to pmem_drain()
>> >> > > > > at the end of memory loading, we can guarantee those compressed pages
>> >> > > > > are persistently loaded to PMEM.
>> >> > > >
>> >> > > > Can you explain why this can use the flush and doesn't need the special
>> >> > > > memset?
>> >> > >
>> >> > > The best approach to ensure the write persistence is to operate pmem
>> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
>> >> > > the write to pmem in this case is performed by uncompress() which is
>> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> >> > > which is not controlled by QEMU. Therefore, we have to use the less
>> >> > > optimal approach, that is to flush cache for all pmem addresses that
>> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
>> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >> >
>> >> > In what way is it less optimal?
>> >> > If that's a legal thing to do, then why not just do a pmem_flush +
>> >> > pmem_drain right at the end of the ram loading and leave all the rest of
>> >> > the code untouched?
>> >>
>> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> >> movnt instructions w/o flush to write pmem if those instructions are
>> >> available, and falls back to memcpy() + flush if movnt are not
>> >> available, so I suppose the latter is less optimal.
>> >
>> > But if you use normal memcpy calls to copy a few GB of RAM in an
>> > incoming migrate and then do a single flush at the end, isn't that
>> > better?
>>
>> Not really, because now you've needlessly polluted the cache and are
>> spending CPU looping over the cachelines that could have been bypassed
>> with movnt.
>
> What's different in the pmem case?   Isn't what you've said true in the
> normal migrate case as well?
>

In the normal migrate case the memory is volatile so once the copy is
globally visiable you're done. In the pmem case the migration is not
complete until the persistent state is synchronized.

Note, I'm talking in generalities because I don't know the deeper
details of the migrate flow.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 18:31                 ` Dan Williams
@ 2018-02-07 18:37                   ` Dr. David Alan Gilbert
  2018-02-07 22:43                     ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-07 18:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi

* Dan Williams (dan.j.williams@intel.com) wrote:
> On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * Dan Williams (dan.j.williams@intel.com) wrote:
> >> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> >> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
> >> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> >> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
> >> >> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
> >> >> > > > > When loading a compressed page to persistent memory, flush CPU cache
> >> >> > > > > after the data is decompressed. Combined with a call to pmem_drain()
> >> >> > > > > at the end of memory loading, we can guarantee those compressed pages
> >> >> > > > > are persistently loaded to PMEM.
> >> >> > > >
> >> >> > > > Can you explain why this can use the flush and doesn't need the special
> >> >> > > > memset?
> >> >> > >
> >> >> > > The best approach to ensure the write persistence is to operate pmem
> >> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> >> >> > > the write to pmem in this case is performed by uncompress() which is
> >> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
> >> >> > > which is not controlled by QEMU. Therefore, we have to use the less
> >> >> > > optimal approach, that is to flush cache for all pmem addresses that
> >> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> >> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
> >> >> >
> >> >> > In what way is it less optimal?
> >> >> > If that's a legal thing to do, then why not just do a pmem_flush +
> >> >> > pmem_drain right at the end of the ram loading and leave all the rest of
> >> >> > the code untouched?
> >> >>
> >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
> >> >> movnt instructions w/o flush to write pmem if those instructions are
> >> >> available, and falls back to memcpy() + flush if movnt are not
> >> >> available, so I suppose the latter is less optimal.
> >> >
> >> > But if you use normal memcpy calls to copy a few GB of RAM in an
> >> > incoming migrate and then do a single flush at the end, isn't that
> >> > better?
> >>
> >> Not really, because now you've needlessly polluted the cache and are
> >> spending CPU looping over the cachelines that could have been bypassed
> >> with movnt.
> >
> > What's different in the pmem case?   Isn't what you've said true in the
> > normal migrate case as well?
> >
> 
> In the normal migrate case the memory is volatile so once the copy is
> globally visiable you're done. In the pmem case the migration is not
> complete until the persistent state is synchronized.
> 
> Note, I'm talking in generalities because I don't know the deeper
> details of the migrate flow.

On the receive side of a migrate, during a normal precopy migrate
(without xbzrle) nothing is going to be reading that RAM until
the whole RAM load has completed anyway - so we're not benefiting from
it being in the caches either.

In the pmem case, again since nothing is going to be reading from that
RAM until the end anyway, why bother flushing as we go as opposed to at
the end?

Note, I'm also talking generalities because I don't know the deeper
details of the pmem flow. Hmm.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
  2018-02-07 18:37                   ` Dr. David Alan Gilbert
@ 2018-02-07 22:43                     ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2018-02-07 22:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, Xiao Guangrong, Juan Quintela,
	Stefan Hajnoczi

On Wed, Feb 7, 2018 at 10:37 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Dan Williams (dan.j.williams@intel.com) wrote:
>> On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * Dan Williams (dan.j.williams@intel.com) wrote:
>> >> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> >> On 02/07/18 13:03 +0000, Dr. David Alan Gilbert wrote:
>> >> >> > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> >> > > On 02/07/18 11:54 +0000, Dr. David Alan Gilbert wrote:
>> >> >> > > > * Haozhong Zhang (haozhong.zhang@intel.com) wrote:
>> >> >> > > > > When loading a compressed page to persistent memory, flush CPU cache
>> >> >> > > > > after the data is decompressed. Combined with a call to pmem_drain()
>> >> >> > > > > at the end of memory loading, we can guarantee those compressed pages
>> >> >> > > > > are persistently loaded to PMEM.
>> >> >> > > >
>> >> >> > > > Can you explain why this can use the flush and doesn't need the special
>> >> >> > > > memset?
>> >> >> > >
>> >> >> > > The best approach to ensure the write persistence is to operate pmem
>> >> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
>> >> >> > > the write to pmem in this case is performed by uncompress() which is
>> >> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> >> >> > > which is not controlled by QEMU. Therefore, we have to use the less
>> >> >> > > optimal approach, that is to flush cache for all pmem addresses that
>> >> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
>> >> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >> >> >
>> >> >> > In what way is it less optimal?
>> >> >> > If that's a legal thing to do, then why not just do a pmem_flush +
>> >> >> > pmem_drain right at the end of the ram loading and leave all the rest of
>> >> >> > the code untouched?
>> >> >>
>> >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> >> >> movnt instructions w/o flush to write pmem if those instructions are
>> >> >> available, and falls back to memcpy() + flush if movnt are not
>> >> >> available, so I suppose the latter is less optimal.
>> >> >
>> >> > But if you use normal memcpy calls to copy a few GB of RAM in an
>> >> > incoming migrate and then do a single flush at the end, isn't that
>> >> > better?
>> >>
>> >> Not really, because now you've needlessly polluted the cache and are
>> >> spending CPU looping over the cachelines that could have been bypassed
>> >> with movnt.
>> >
>> > What's different in the pmem case?   Isn't what you've said true in the
>> > normal migrate case as well?
>> >
>>
>> In the normal migrate case the memory is volatile so once the copy is
>> globally visiable you're done. In the pmem case the migration is not
>> complete until the persistent state is synchronized.
>>
>> Note, I'm talking in generalities because I don't know the deeper
>> details of the migrate flow.
>
> On the receive side of a migrate, during a normal precopy migrate
> (without xbzrle) nothing is going to be reading that RAM until
> the whole RAM load has completed anyway - so we're not benefiting from
> it being in the caches either.
>
> In the pmem case, again since nothing is going to be reading from that
> RAM until the end anyway, why bother flushing as we go as opposed to at
> the end?

Flushing at the end implies doing a large loop flushing the caches at
the end of the transfer because the x86 ISA only exposes a
line-by-line flush to unprivileged code rather than a full cache flush
like what the kernel can do with wbinvd. So, better to flush as we go
rather than incur the overhead of the loop at the end. I.e. I'm
assuming it is more efficient to do 'movnt' in the first instance and
not worry about the flush loop.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation Haozhong Zhang
@ 2018-02-09 14:27   ` Stefan Hajnoczi
  2018-02-09 14:57     ` Haozhong Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2018-02-09 14:27 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Wed, Feb 07, 2018 at 03:33:27PM +0800, Haozhong Zhang wrote:
> @@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>  {
>      MemoryRegion *mr;
>      PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> +    bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
> +                                            "pmem", NULL);
>      uint64_t backend_offset;
>  
>      nvdimm_validate_rw_label_data(nvdimm, size, offset);
>  
> -    memcpy(nvdimm->label_data + offset, buf, size);
> +    if (!is_pmem) {
> +        memcpy(nvdimm->label_data + offset, buf, size);
> +    } else {
> +        pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
> +    }

Is this enough to prevent label corruption in case of power failure?

pmem_memcpy_persist() is not atomic.  Power failure can result in a mix
of the old and new label data.

If we want this operation to be 100% safe there needs to be some kind of
update protocol that makes the change atomic, like a Label A and Label B
area with a single Label Index field that can be updated atomically to
point to the active Label A/B area.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  2018-02-09 14:27   ` Stefan Hajnoczi
@ 2018-02-09 14:57     ` Haozhong Zhang
  2018-02-12 13:55       ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Haozhong Zhang @ 2018-02-09 14:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Dan Williams

On 02/09/18 14:27 +0000, Stefan Hajnoczi wrote:
> On Wed, Feb 07, 2018 at 03:33:27PM +0800, Haozhong Zhang wrote:
> > @@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
> >  {
> >      MemoryRegion *mr;
> >      PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> > +    bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
> > +                                            "pmem", NULL);
> >      uint64_t backend_offset;
> >  
> >      nvdimm_validate_rw_label_data(nvdimm, size, offset);
> >  
> > -    memcpy(nvdimm->label_data + offset, buf, size);
> > +    if (!is_pmem) {
> > +        memcpy(nvdimm->label_data + offset, buf, size);
> > +    } else {
> > +        pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
> > +    }
> 
> Is this enough to prevent label corruption in case of power failure?
> 
> pmem_memcpy_persist() is not atomic.  Power failure can result in a mix
> of the old and new label data.
> 
> If we want this operation to be 100% safe there needs to be some kind of
> update protocol that makes the change atomic, like a Label A and Label B
> area with a single Label Index field that can be updated atomically to
> point to the active Label A/B area.

All this patch series is to guarantee: if the guest is still alive and
running, all its previous writes to pmem, which were performed by
QEMU, will be still persistent on pmem.

If a power failure happens before QEMU returns to the guest, e.g., in
the middle of above pmem_memcpy_persist(), yes, the guest label data
may be in an inconsistent state, but the guest also has no chance to
progress.  And, that is what could happen in the non-virtualization
environment as well, and it's the responsibility of the (guest) SW to
defend such failures, e.g., by the protocol you mentioned.

Haozhong

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  2018-02-09 14:57     ` Haozhong Zhang
@ 2018-02-12 13:55       ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2018-02-12 13:55 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst,
	Xiao Guangrong, Juan Quintela, dgilbert, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

On Fri, Feb 09, 2018 at 10:57:26PM +0800, Haozhong Zhang wrote:
> On 02/09/18 14:27 +0000, Stefan Hajnoczi wrote:
> > On Wed, Feb 07, 2018 at 03:33:27PM +0800, Haozhong Zhang wrote:
> > > @@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
> > >  {
> > >      MemoryRegion *mr;
> > >      PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> > > +    bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
> > > +                                            "pmem", NULL);
> > >      uint64_t backend_offset;
> > >  
> > >      nvdimm_validate_rw_label_data(nvdimm, size, offset);
> > >  
> > > -    memcpy(nvdimm->label_data + offset, buf, size);
> > > +    if (!is_pmem) {
> > > +        memcpy(nvdimm->label_data + offset, buf, size);
> > > +    } else {
> > > +        pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
> > > +    }
> > 
> > Is this enough to prevent label corruption in case of power failure?
> > 
> > pmem_memcpy_persist() is not atomic.  Power failure can result in a mix
> > of the old and new label data.
> > 
> > If we want this operation to be 100% safe there needs to be some kind of
> > update protocol that makes the change atomic, like a Label A and Label B
> > area with a single Label Index field that can be updated atomically to
> > point to the active Label A/B area.
> 
> All this patch series is to guarantee: if the guest is still alive and
> running, all its previous writes to pmem, which were performed by
> QEMU, will be still persistent on pmem.
> 
> If a power failure happens before QEMU returns to the guest, e.g., in
> the middle of above pmem_memcpy_persist(), yes, the guest label data
> may be in an inconsistent state, but the guest also has no chance to
> progress.  And, that is what could happen in the non-virtualization
> environment as well, and it's the responsibility of the (guest) SW to
> defend such failures, e.g., by the protocol you mentioned.

Thanks for explaining!  I thought the atomic update is done by the ACPI
_LSW DSM method but it turns out the driver must do it.

For anyone else who is interested, a detailed description of the Label
Storage Area is on page 652 of the UEFI Specification version 2.7 and
the Linux driver implements the atomic update algorithm in
drivers/nvdimm/label.c:__pmem_label_update().

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-02-12 13:56 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-07  7:33 [Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory Haozhong Zhang
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters Haozhong Zhang
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 2/8] hostmem-file: add the 'pmem' option Haozhong Zhang
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 3/8] configure: add libpmem support Haozhong Zhang
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation Haozhong Zhang
2018-02-09 14:27   ` Stefan Hajnoczi
2018-02-09 14:57     ` Haozhong Zhang
2018-02-12 13:55       ` Stefan Hajnoczi
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM Haozhong Zhang
2018-02-07 10:17   ` Pankaj Gupta
2018-02-07 11:18     ` Haozhong Zhang
2018-02-07 11:30       ` Pankaj Gupta
2018-02-07 11:38   ` Dr. David Alan Gilbert
2018-02-07 11:52     ` Haozhong Zhang
2018-02-07 12:51       ` Haozhong Zhang
2018-02-07 12:59         ` Dr. David Alan Gilbert
2018-02-07 14:10         ` Pankaj Gupta
2018-02-07 12:56       ` Dr. David Alan Gilbert
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal " Haozhong Zhang
2018-02-07 11:49   ` Dr. David Alan Gilbert
2018-02-07 12:02     ` Haozhong Zhang
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed " Haozhong Zhang
2018-02-07 11:54   ` Dr. David Alan Gilbert
2018-02-07 12:15     ` Haozhong Zhang
2018-02-07 13:03       ` Dr. David Alan Gilbert
2018-02-07 13:20         ` Haozhong Zhang
2018-02-07 13:24           ` Dr. David Alan Gilbert
2018-02-07 18:05             ` Dan Williams
2018-02-07 18:08               ` Dr. David Alan Gilbert
2018-02-07 18:31                 ` Dan Williams
2018-02-07 18:37                   ` Dr. David Alan Gilbert
2018-02-07 22:43                     ` Dan Williams
2018-02-07  7:33 ` [Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle " Haozhong Zhang
2018-02-07 13:08   ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.