All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
@ 2018-01-31  6:02 Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter Haozhong Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
files on ext4/xfs file system mounted with '-o dax').

A description of MAP_SYNC and MAP_SHARED_VALIDATE can be found at
    https://patchwork.kernel.org/patch/10028151/

This patchset enables QEMU to use MAP_SYNC flag for memory-backend-file,
in order to guarantee the guest write persistence to backend files
supporting DAX.

A new auto on/off option 'sync' is added to memory-backend-file:
 - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
        'share=off', QEMU will abort
 - off: never pass MAP_SYNC to mmap(2)
 - auto (default): if MAP_SYNC is supported and 'share=on', work as if
        'sync=on'; otherwise, work as if 'sync=off'

Changes in v4:
 * Add patch 1-3 to switch some functions to a single 'flags'
   parameters. (Michael S. Tsirkin)
 * v3 patch 1-3 become v4 patch 4-6.
 * Patch 4: move definitions of MAP_SYNC and MAP_SHARED_VALIDATE to a
   new header file under include/standard-headers/linux/. (Michael S. Tsirkin)
 * Patch 6: refine the description of the 'sync' option. (Michael S. Tsirkin)

Changes in v3:
 * Patch 1: add MAP_SHARED_VALIDATE in both sync=on and sync=auto
   cases, and add back the retry mechanism. MAP_SYNC will be ignored
   by Linux kernel 4.15 if MAP_SHARED_VALIDATE is missed.
 * Patch 1: define MAP_SYNC and MAP_SHARED_VALIDATE as 0 on non-Linux
   platforms in order to make qemu_ram_mmap() compile on those platforms.
 * Patch 2&3: include more information in error messages of
   memory-backend in hope to help user to identify the error.
   (Dr. David Alan Gilbert)
 * Patch 3: fix typo in the commit message. (Dr. David Alan Gilbert)

Changes in v2:
 * Add 'sync' option to control the use of MAP_SYNC. (Eduardo Habkost)
 * Remove the unnecessary set of MAP_SHARED_VALIDATE in some cases and
   the retry mechanism in qemu_ram_mmap(). (Michael S. Tsirkin)
 * Move OS dependent definitions of MAP_SYNC and MAP_SHARED_VALIDATE
   to osdep.h. (Michael S. Tsirkin)


Haozhong Zhang (6):
  util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter
  exec: switch qemu_ram_alloc_from_{file,fd} to the 'flags' parameter
  memory: switch memory_region_init_ram_from_file() to 'flags' parameter
  util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
  hostmem: add more information in error messages
  hostmem-file: add 'sync' option

 backends/hostmem-file.c               | 51 ++++++++++++++++++++++++++++++++---
 backends/hostmem.c                    | 11 +++++---
 docs/nvdimm.txt                       | 20 +++++++++++++-
 exec.c                                | 15 ++++++-----
 include/exec/memory.h                 | 36 +++++++++++++++++++++++--
 include/exec/ram_addr.h               | 29 ++++++++++++++++++--
 include/qemu/mmap-alloc.h             | 23 +++++++++++++++-
 include/standard-headers/linux/mman.h | 42 +++++++++++++++++++++++++++++
 memory.c                              |  8 +++---
 numa.c                                |  2 +-
 qemu-options.hx                       | 22 ++++++++++++++-
 util/mmap-alloc.c                     | 31 ++++++++++++++++++---
 util/oslib-posix.c                    |  2 +-
 13 files changed, 261 insertions(+), 31 deletions(-)
 create mode 100644 include/standard-headers/linux/mman.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the " Haozhong Zhang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

As more flag parameters besides the existing 'shared' are going to be
added to qemu_ram_mmap(), let's switch 'shared' to a 'flags' parameter
in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Suggested-by: "Michael S. Tsirkin" <mst@redhat.com>
---
 exec.c                    |  2 +-
 include/exec/memory.h     |  3 +++
 include/qemu/mmap-alloc.h | 19 ++++++++++++++++++-
 util/mmap-alloc.c         |  8 +++++---
 util/oslib-posix.c        |  2 +-
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 629a508385..c0a5a52c4a 100644
--- a/exec.c
+++ b/exec.c
@@ -1653,7 +1653,7 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     area = qemu_ram_mmap(fd, memory, block->mr->align,
-                         block->flags & RAM_SHARED);
+                         (block->flags & RAM_SHARED) ? QEMU_RAM_SHARE : 0);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 07c5d6d597..4790cd9e13 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -455,6 +455,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
                                                        uint64_t length,
                                                        void *host),
                                        Error **errp);
+
+#define QEMU_RAM_SHARE      (1UL << 0)
+
 #ifdef __linux__
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 50385e3f81..dc5e8b5efb 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,7 +7,24 @@ size_t qemu_fd_getpagesize(int fd);
 
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
+/**
+ * qemu_ram_mmap: mmap the specified file or device.
+ *
+ * Parameters:
+ *  @fd: the file or the device to mmap
+ *  @size: the number of bytes to be mmaped
+ *  @align: if not zero, specify the alignment of the starting mapping address;
+ *          otherwise, the alignment in use will be determined by QEMU.
+ *  @flags: specifies additional properties of the mapping, which can be one or
+ *          bit-or of following values
+ *          - QEMU_RAM_SHARE: mmap with MAP_SHARED flag
+ *          Other bits are ignored.
+ *
+ * Return:
+ *  On success, return a pointer to the mapped area.
+ *  On failure, return MAP_FAILED.
+ */
+void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags);
 
 void qemu_ram_munmap(void *ptr, size_t size);
 
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 2fd8cbcc6f..cd95566800 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "exec/memory.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -73,7 +74,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return getpagesize();
 }
 
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
+void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags)
 {
     /*
      * Note: this always allocates at least one extra page of virtual address
@@ -90,11 +91,12 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
      * anonymous memory is OK.
      */
     int anonfd = fd == -1 || qemu_fd_getpagesize(fd) == getpagesize() ? -1 : fd;
-    int flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE;
-    void *ptr = mmap(0, total, PROT_NONE, flags | MAP_PRIVATE, anonfd, 0);
+    int mmap_flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE;
+    void *ptr = mmap(0, total, PROT_NONE, mmap_flags | MAP_PRIVATE, anonfd, 0);
 #else
     void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 #endif
+    bool shared = flags & QEMU_RAM_SHARE;
     size_t offset;
     void *ptr1;
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 77369c92ce..2a78cfb67e 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -130,7 +130,7 @@ void *qemu_memalign(size_t alignment, size_t size)
 void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, 0);
 
     if (ptr == MAP_FAILED) {
         return NULL;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the 'flags' parameter
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to " Haozhong Zhang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to qemu_ram_alloc_from_{file,fd}(), let's swith 'share' to a
'flags' parameters in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 exec.c                  | 15 ++++++++-------
 include/exec/ram_addr.h | 25 +++++++++++++++++++++++--
 memory.c                |  8 ++++++--
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/exec.c b/exec.c
index c0a5a52c4a..0b46b03d87 100644
--- a/exec.c
+++ b/exec.c
@@ -1607,6 +1607,7 @@ static void *file_ram_alloc(RAMBlock *block,
                             ram_addr_t memory,
                             int fd,
                             bool truncate,
+                            uint64_t flags,
                             Error **errp)
 {
     void *area;
@@ -1652,8 +1653,7 @@ static void *file_ram_alloc(RAMBlock *block,
         perror("ftruncate");
     }
 
-    area = qemu_ram_mmap(fd, memory, block->mr->align,
-                         (block->flags & RAM_SHARED) ? QEMU_RAM_SHARE : 0);
+    area = qemu_ram_mmap(fd, memory, block->mr->align, flags);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
@@ -2000,7 +2000,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
-                                 bool share, int fd,
+                                 uint64_t flags, int fd,
                                  Error **errp)
 {
     RAMBlock *new_block;
@@ -2042,8 +2042,9 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->mr = mr;
     new_block->used_length = size;
     new_block->max_length = size;
-    new_block->flags = share ? RAM_SHARED : 0;
-    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, errp);
+    new_block->flags = (flags & QEMU_RAM_SHARE) ? RAM_SHARED : 0;
+    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, flags,
+                                     errp);
     if (!new_block->host) {
         g_free(new_block);
         return NULL;
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-                                   bool share, const char *mem_path,
+                                   uint64_t flags, const char *mem_path,
                                    Error **errp)
 {
     int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+    block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
     if (!block) {
         if (created) {
             unlink(mem_path);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7633ef6342..e24aae75a2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd: Allocate a ram block from the specified back
+ *                         file or device
+ *
+ * Parameters:
+ *  @size: the size in bytes of the ram block
+ *  @mr: the memory region where the ram block is
+ *  @flags: specify the properties of the ram block, which can be one
+ *          or bit-or of following values
+ *          - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *          Other bits are ignored.
+ *  @mem_path or @fd: specify the back file or device
+ *  @errp: pointer to Error*, to store an error if it happens
+ *
+ * Return:
+ *  On success, return a pointer to the ram block.
+ *  On failure, return NULL.
+ */
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-                                   bool share, const char *mem_path,
+                                   uint64_t flags, const char *mem_path,
                                    Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
-                                 bool share, int fd,
+                                 uint64_t flags, int fd,
                                  Error **errp);
+
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
 RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp);
diff --git a/memory.c b/memory.c
index 449a1429b9..1ac4ebcaca 100644
--- a/memory.c
+++ b/memory.c
@@ -1580,7 +1580,9 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
     mr->align = align;
-    mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp);
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr,
+                                             share ? QEMU_RAM_SHARE : 0,
+                                             path, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1596,7 +1598,9 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+    mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
+                                           share ? QEMU_RAM_SHARE : 0,
+                                           fd, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to 'flags' parameter
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the " Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Haozhong Zhang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to memory_region_init_ram_from_file(), let's switch 'share' to
a 'flags' parameter in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 backends/hostmem-file.c | 3 ++-
 include/exec/memory.h   | 7 +++++--
 memory.c                | 6 ++----
 numa.c                  | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e319ec1ad8..67ecfed895 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -59,7 +59,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         path = object_get_canonical_path(OBJECT(backend));
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
-                                 backend->size, fb->align, fb->share,
+                                 backend->size, fb->align,
+                                 fb->share ? QEMU_RAM_SHARE : 0,
                                  fb->mem_path, errp);
         g_free(path);
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4790cd9e13..6b547da6a3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -470,7 +470,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ *         of following values:
+ *         - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ *         Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -482,7 +485,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       const char *name,
                                       uint64_t size,
                                       uint64_t align,
-                                      bool share,
+                                      uint64_t flags,
                                       const char *path,
                                       Error **errp);
 
diff --git a/memory.c b/memory.c
index 1ac4ebcaca..a4f19a5d30 100644
--- a/memory.c
+++ b/memory.c
@@ -1571,7 +1571,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       const char *name,
                                       uint64_t size,
                                       uint64_t align,
-                                      bool share,
+                                      uint64_t flags,
                                       const char *path,
                                       Error **errp)
 {
@@ -1580,9 +1580,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
     mr->align = align;
-    mr->ram_block = qemu_ram_alloc_from_file(size, mr,
-                                             share ? QEMU_RAM_SHARE : 0,
-                                             path, errp);
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr, flags, path, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
diff --git a/numa.c b/numa.c
index 83675a03f3..fa202a376d 100644
--- a/numa.c
+++ b/numa.c
@@ -456,7 +456,7 @@ static void allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner,
     if (mem_path) {
 #ifdef __linux__
         Error *err = NULL;
-        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, false,
+        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, 0,
                                          mem_path, &err);
         if (err) {
             error_report_err(err);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
                   ` (2 preceding siblings ...)
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to " Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages Haozhong Zhang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

When a file supporting DAX is used as vNVDIMM backend, mmap it with
MAP_SYNC flag in addition can guarantee the persistence of guest write
to the backend file without other QEMU actions (e.g., periodic fsync()
by QEMU).

A set of QEMU_RAM_SYNC_{AUTO,ON,OFF} flags are added to qemu_ram_mmap():

- If QEMU_RAM_SYNC_ON is present, qemu_ram_mmap() will try to pass
  MAP_SYNC to mmap(). It will then fail if the host OS or the backend
  file do not support MAP_SYNC, or MAP_SYNC is conflict with other
  flags.

- If QEMU_RAM_SYNC_OFF is present, qemu_ram_mmap() will never pass
  MAP_SYNC to mmap().

- If QEMU_RAM_SYNC_AUTO is present, and
  * if the host OS and the backend file support MAP_SYNC, and MAP_SYNC
    is not conflict with other flags, qemu_ram_mmap() will work as if
    QEMU_RAM_SYNC_ON is present;
  * otherwise, qemu_ram_mmap() will work as if QEMU_RAM_SYNC_OFF is
    present.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 include/exec/memory.h                 | 26 ++++++++++++++++++++++
 include/exec/ram_addr.h               |  4 ++++
 include/qemu/mmap-alloc.h             |  4 ++++
 include/standard-headers/linux/mman.h | 42 +++++++++++++++++++++++++++++++++++
 util/mmap-alloc.c                     | 23 ++++++++++++++++++-
 5 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 include/standard-headers/linux/mman.h

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6b547da6a3..96a60e9c1d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -458,6 +458,28 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
 
 #define QEMU_RAM_SHARE      (1UL << 0)
 
+#define QEMU_RAM_SYNC_SHIFT 1
+#define QEMU_RAM_SYNC_MASK  0x6
+#define QEMU_RAM_SYNC_OFF   ((0UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+#define QEMU_RAM_SYNC_ON    ((1UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+#define QEMU_RAM_SYNC_AUTO  ((2UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+
+static inline uint64_t qemu_ram_sync_flags(OnOffAuto v)
+{
+    return v == ON_OFF_AUTO_OFF ? QEMU_RAM_SYNC_OFF :
+           v == ON_OFF_AUTO_ON ? QEMU_RAM_SYNC_ON : QEMU_RAM_SYNC_AUTO;
+}
+
+static inline OnOffAuto qemu_ram_sync_val(uint64_t flags)
+{
+    unsigned int v = (flags & QEMU_RAM_SYNC_MASK) >> QEMU_RAM_SYNC_SHIFT;
+
+    assert(v < 3);
+
+    return v == 0 ? ON_OFF_AUTO_OFF :
+           v == 1 ? ON_OFF_AUTO_ON : ON_OFF_AUTO_AUTO;
+}
+
 #ifdef __linux__
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
@@ -473,6 +495,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @flags: specify properties of this memory region, which can be one or bit-or
  *         of following values:
  *         - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ *         - One of
+ *           QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *           QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *           QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  *         Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index e24aae75a2..a2cc5a9f60 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -84,6 +84,10 @@ unsigned long last_ram_page(void);
  *  @flags: specify the properties of the ram block, which can be one
  *          or bit-or of following values
  *          - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *          - One of
+ *            QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *            QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *            QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  *          Other bits are ignored.
  *  @mem_path or @fd: specify the back file or device
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index dc5e8b5efb..74346bdd3a 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -18,6 +18,10 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @flags: specifies additional properties of the mapping, which can be one or
  *          bit-or of following values
  *          - QEMU_RAM_SHARE: mmap with MAP_SHARED flag
+ *          - One of
+ *            QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *            QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *            QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  *          Other bits are ignored.
  *
  * Return:
diff --git a/include/standard-headers/linux/mman.h b/include/standard-headers/linux/mman.h
new file mode 100644
index 0000000000..033332ad4f
--- /dev/null
+++ b/include/standard-headers/linux/mman.h
@@ -0,0 +1,42 @@
+/*
+ * Definitions of Linux-specific mmap flags.
+ *
+ * Copyright Intel Corporation, 2018
+ *
+ * Author: Haozhong Zhang <haozhong.zhang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+
+#ifndef _LINUX_MMAN_H
+#define _LINUX_MMAN_H
+
+/*
+ * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
+ * 4.15, so they may not be defined when compiling on older kernels.
+ */
+#ifdef CONFIG_LINUX
+
+#include <sys/mman.h>
+
+#ifndef MAP_SHARED_VALIDATE
+#define MAP_SHARED_VALIDATE   0x3
+#endif
+
+#ifndef MAP_SYNC
+#define MAP_SYNC              0x80000
+#endif
+
+#define QEMU_HAS_MAP_SYNC     true
+
+#else  /* !CONFIG_LINUX */
+
+#define MAP_SHARED_VALIDATE   0x0
+#define MAP_SYNC              0x0
+
+#define QEMU_HAS_MAP_SYNC     false
+
+#endif /* CONFIG_LINUX */
+
+#endif /* !_LINUX_MMAN_H */
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index cd95566800..6df2f6d2c4 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -14,6 +14,7 @@
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
 #include "exec/memory.h"
+#include "standard-headers/linux/mman.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -97,6 +98,8 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags)
     void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 #endif
     bool shared = flags & QEMU_RAM_SHARE;
+    OnOffAuto sync = qemu_ram_sync_val(flags);
+    int mmap_xflags = 0;
     size_t offset;
     void *ptr1;
 
@@ -108,13 +111,31 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags)
     /* Always align to host page size */
     assert(align >= getpagesize());
 
+    if (!QEMU_HAS_MAP_SYNC || !shared) {
+        if (sync == ON_OFF_AUTO_ON) {
+            return MAP_FAILED;
+        }
+        sync = ON_OFF_AUTO_OFF;
+    }
+    if (sync != ON_OFF_AUTO_OFF) {
+        /* MAP_SYNC is only available with MAP_SHARED_VALIDATE. */
+        mmap_xflags |= MAP_SYNC | MAP_SHARED_VALIDATE;
+    }
+
     offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
+ retry_mmap_fd:
     ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
                 MAP_FIXED |
                 (fd == -1 ? MAP_ANONYMOUS : 0) |
-                (shared ? MAP_SHARED : MAP_PRIVATE),
+                (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
                 fd, 0);
     if (ptr1 == MAP_FAILED) {
+        if (sync == ON_OFF_AUTO_AUTO) {
+            mmap_xflags &= ~(MAP_SYNC | MAP_SHARED_VALIDATE);
+            sync = ON_OFF_AUTO_OFF;
+            goto retry_mmap_fd;
+        }
+
         munmap(ptr, total);
         return MAP_FAILED;
     }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
                   ` (3 preceding siblings ...)
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option Haozhong Zhang
  2018-01-31 22:25 ` [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Dan Williams
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

When there are multiple memory backends in use, including the object type
name, ID and the property name in the error message can help users to
locate the error.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Suggested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---
 backends/hostmem-file.c |  9 ++++++---
 backends/hostmem.c      | 11 +++++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 67ecfed895..df06b547a6 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -80,7 +80,8 @@ static void set_mem_path(Object *o, const char *str, Error **errp)
     HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
 
     if (host_memory_backend_mr_inited(backend)) {
-        error_setg(errp, "cannot change property value");
+        error_setg(errp, "cannot change property 'mem-path' of %s '%s'",
+                   object_get_typename(o), backend->id);
         return;
     }
     g_free(fb->mem_path);
@@ -100,7 +101,8 @@ static void file_memory_backend_set_share(Object *o, bool value, Error **errp)
     HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
 
     if (host_memory_backend_mr_inited(backend)) {
-        error_setg(errp, "cannot change property value");
+        error_setg(errp, "cannot change property 'share' of %s '%s'",
+                   object_get_typename(o), backend->id);
         return;
     }
     fb->share = value;
@@ -137,7 +139,8 @@ static void file_memory_backend_set_align(Object *o, Visitor *v,
     uint64_t val;
 
     if (host_memory_backend_mr_inited(backend)) {
-        error_setg(&local_err, "cannot change property value");
+        error_setg(&local_err, "cannot change property '%s' of %s '%s'",
+                   name, object_get_typename(o), backend->id);
         goto out;
     }
 
diff --git a/backends/hostmem.c b/backends/hostmem.c
index ee2c2d5bfd..6853d19bc5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -46,7 +46,8 @@ host_memory_backend_set_size(Object *obj, Visitor *v, const char *name,
     uint64_t value;
 
     if (host_memory_backend_mr_inited(backend)) {
-        error_setg(&local_err, "cannot change property value");
+        error_setg(&local_err, "cannot change property %s of %s '%s'",
+                   name, object_get_typename(obj), backend->id);
         goto out;
     }
 
@@ -55,8 +56,9 @@ host_memory_backend_set_size(Object *obj, Visitor *v, const char *name,
         goto out;
     }
     if (!value) {
-        error_setg(&local_err, "Property '%s.%s' doesn't take value '%"
-                   PRIu64 "'", object_get_typename(obj), name, value);
+        error_setg(&local_err,
+                   "property '%s' of %s '%s' doesn't take value '%" PRIu64 "'",
+                   name, object_get_typename(obj), backend->id, value);
         goto out;
     }
     backend->size = value;
@@ -363,7 +365,8 @@ static void set_id(Object *o, const char *str, Error **errp)
     HostMemoryBackend *backend = MEMORY_BACKEND(o);
 
     if (backend->id) {
-        error_setg(errp, "cannot change property value");
+        error_setg(errp, "cannot change property 'id' of %s '%s'",
+                   object_get_typename(o), backend->id);
         return;
     }
     backend->id = g_strdup(str);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
                   ` (4 preceding siblings ...)
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages Haozhong Zhang
@ 2018-01-31  6:02 ` Haozhong Zhang
  2018-01-31 22:25 ` [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Dan Williams
  6 siblings, 0 replies; 18+ messages in thread
From: Haozhong Zhang @ 2018-01-31  6:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Igor Mammedov, Paolo Bonzini, mst, dgilbert,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams, Haozhong Zhang

This option controls whether QEMU mmap(2) the memory backend file with
MAP_SYNC flag, which can fully guarantee the guest write persistence
to the backend, if MAP_SYNC flag is supported by the host kernel
(Linux kernel 4.15 and later) and the backend is a file supporting
DAX (e.g., file on ext4/xfs file system mounted with '-o dax').

It can take one of following values:
 - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
        'share=off', QEMU will abort
 - off: never pass MAP_SYNC to mmap(2)
 - auto (default): if MAP_SYNC is supported and 'share=on', work as if
        'sync=on'; otherwise, work as if 'sync=off'

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---
 backends/hostmem-file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 docs/nvdimm.txt         | 20 +++++++++++++++++++-
 qemu-options.hx         | 22 +++++++++++++++++++++-
 3 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index df06b547a6..ade80d76f1 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -15,6 +15,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/sysemu.h"
 #include "qom/object_interfaces.h"
+#include "qapi-visit.h"
 
 /* hostmem-file.c */
 /**
@@ -35,6 +36,7 @@ struct HostMemoryBackendFile {
     bool discard_data;
     char *mem_path;
     uint64_t align;
+    OnOffAuto sync;
 };
 
 static void
@@ -60,7 +62,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
                                  backend->size, fb->align,
-                                 fb->share ? QEMU_RAM_SHARE : 0,
+                                 (fb->share ? QEMU_RAM_SHARE : 0) |
+                                 qemu_ram_sync_flags(fb->sync),
                                  fb->mem_path, errp);
         g_free(path);
     }
@@ -154,6 +157,39 @@ static void file_memory_backend_set_align(Object *o, Visitor *v,
     error_propagate(errp, local_err);
 }
 
+static void file_memory_backend_get_sync(
+    Object *obj, Visitor *v, const char *name, void *opaque, Error **errp)
+{
+    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj);
+    OnOffAuto value = fb->sync;
+
+    visit_type_OnOffAuto(v, name, &value, errp);
+}
+
+static void file_memory_backend_set_sync(
+    Object *obj, Visitor *v, const char *name, void *opaque, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj);
+    Error *local_err = NULL;
+    OnOffAuto value;
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(&local_err, "cannot change property '%s' of %s '%s'",
+                   name, object_get_typename(obj), backend->id);
+        goto out;
+    }
+
+    visit_type_OnOffAuto(v, name, &value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+    fb->sync = value;
+
+ out:
+    error_propagate(errp, local_err);
+}
+
 static void file_backend_unparent(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -188,6 +224,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
         file_memory_backend_get_align,
         file_memory_backend_set_align,
         NULL, NULL, &error_abort);
+    object_class_property_add(oc, "sync", "OnOffAuto",
+        file_memory_backend_get_sync, file_memory_backend_set_sync,
+        NULL, NULL, &error_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..5e9cee5f5e 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -142,11 +142,29 @@ backend of vNVDIMM:
 Guest Data Persistence
 ----------------------
 
+vNVDIMM is designed and implemented to guarantee the guest data
+persistence on the backends even on the host crash and power
+failures. However, there are still some requirements and limitations
+as explained below.
+
 Though QEMU supports multiple types of vNVDIMM backends on Linux,
-currently the only one that can guarantee the guest write persistence
+if MAP_SYNC is not supported by the host kernel and the backends,
+the only backend that can guarantee the guest write persistence
 is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
 which all guest access do not involve any host-side kernel cache.
 
+mmap(2) flag MAP_SYNC is added since Linux kernel 4.15. On such
+systems, QEMU can mmap(2) the backend with MAP_SYNC, which can
+guarantee the guest write persistence to vNVDIMM. Besides the host
+kernel support, enabling MAP_SYNC in QEMU also requires:
+
+ - the backend is a file supporting DAX, e.g., a file on an ext4 or
+   xfs file system mounted with '-o dax',
+
+ - 'sync' option of memory-backend-file is not 'off', and
+
+ - 'share' option of memory-backend-file is 'on'.
+
 When using other types of backends, it's suggested to set 'unarmed'
 option of '-device nvdimm' to 'on', which sets the unarmed flag of the
 guest NVDIMM region mapping structure.  This unarmed flag indicates
diff --git a/qemu-options.hx b/qemu-options.hx
index 8ce427da78..4e8ec5ed97 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3957,7 +3957,7 @@ property must be set.  These objects are placed in the
 
 @table @option
 
-@item -object memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align}
+@item -object memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align},sync=@var{on|off|auto}
 
 Creates a memory file backend object, which can be used to back
 the guest RAM with huge pages.
@@ -4017,6 +4017,26 @@ requires an alignment different than the default one used by QEMU, eg
 the device DAX /dev/dax0.0 requires 2M alignment rather than 4K. In
 such cases, users can specify the required alignment via this option.
 
+The @option{sync} option specifies whether QEMU mmap(2) @option{mem-path}
+with MAP_SYNC flag, which can guarantee the guest write persistence to
+@option{mem-path} even on the host crash and power failures. MAP_SYNC
+requires supports from both the host kernel (since Linux kernel 4.15)
+and @option{mem-path} (only files supporting DAX). It can take one of
+following values:
+
+@table @option
+@item @var{on}
+try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
+@option{share}=@var{off}, QEMU will abort
+
+@item @var{off}
+never pass MAP_SYNC to mmap(2)
+
+@item @var{auto} (default)
+if MAP_SYNC is supported and @option{share}=@var{on}, work as if
+@option{sync}=@var{on}; otherwise, work as if @option{sync}=@var{off}
+@end table
+
 @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
 
 Creates a memory backend object, which can be used to back the guest RAM.
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
                   ` (5 preceding siblings ...)
  2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option Haozhong Zhang
@ 2018-01-31 22:25 ` Dan Williams
  2018-02-01  0:02   ` Haozhong Zhang
  6 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-01-31 22:25 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi

On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> files on ext4/xfs file system mounted with '-o dax').

Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
metadata is in sync after a fault. However, that does not make
filesystem-DAX safe for use with QEMU, because we still need to
coordinate DMA with fileystem operations. There is no way to do that
coordination from within a guest. QEMU needs to use device-dax if the
guest might ever perform DMA to a virtual-pmem range. See this patch
set for more details on the DAX vs DMA problem [1]. I think we need to
enforce this in the host kernel. I.e. do not allow file backed DAX
pages to be mapped in EPT entries unless / until we have a solution to
the DMA synchronization problem. Apologies for not noticing this
earlier.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-01-31 22:25 ` [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Dan Williams
@ 2018-02-01  0:02   ` Haozhong Zhang
  2018-02-01  0:08     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Haozhong Zhang @ 2018-02-01  0:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi

On 01/31/18 14:25 -0800, Dan Williams wrote:
> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > files on ext4/xfs file system mounted with '-o dax').
> 
> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> metadata is in sync after a fault. However, that does not make
> filesystem-DAX safe for use with QEMU, because we still need to
> coordinate DMA with fileystem operations. There is no way to do that
> coordination from within a guest. QEMU needs to use device-dax if the
> guest might ever perform DMA to a virtual-pmem range. See this patch
> set for more details on the DAX vs DMA problem [1]. I think we need to
> enforce this in the host kernel. I.e. do not allow file backed DAX
> pages to be mapped in EPT entries unless / until we have a solution to
> the DMA synchronization problem. Apologies for not noticing this
> earlier.

QEMU does not truncate or punch holes of the file once it has been
mmap()'ed. Does the problem [1] still exist in such case?

Thanks,
Haozhong

> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  0:02   ` Haozhong Zhang
@ 2018-02-01  0:08     ` Dan Williams
  2018-02-01  0:24       ` Haozhong Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-02-01  0:08 UTC (permalink / raw)
  To: Dan Williams, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Paolo Bonzini, Michael S. Tsirkin, dgilbert, Xiao Guangrong,
	Stefan Hajnoczi

On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 01/31/18 14:25 -0800, Dan Williams wrote:
>> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> > files on ext4/xfs file system mounted with '-o dax').
>>
>> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> metadata is in sync after a fault. However, that does not make
>> filesystem-DAX safe for use with QEMU, because we still need to
>> coordinate DMA with fileystem operations. There is no way to do that
>> coordination from within a guest. QEMU needs to use device-dax if the
>> guest might ever perform DMA to a virtual-pmem range. See this patch
>> set for more details on the DAX vs DMA problem [1]. I think we need to
>> enforce this in the host kernel. I.e. do not allow file backed DAX
>> pages to be mapped in EPT entries unless / until we have a solution to
>> the DMA synchronization problem. Apologies for not noticing this
>> earlier.
>
> QEMU does not truncate or punch holes of the file once it has been
> mmap()'ed. Does the problem [1] still exist in such case?

Something else on the system might. The only agent that could enforce
protection is the kernel, and the kernel will likely just disallow
passing addresses from filesystem-dax vmas through to a guest
altogether. I think there's even a problem in the non-DAX case unless
KVM is pinning pages while they are handed out to a guest. The problem
is that we don't have a page cache page to pin in the DAX case.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  0:08     ` Dan Williams
@ 2018-02-01  0:24       ` Haozhong Zhang
  2018-02-01  0:32         ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Haozhong Zhang @ 2018-02-01  0:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi

On 01/31/18 16:08 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> > files on ext4/xfs file system mounted with '-o dax').
> >>
> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> metadata is in sync after a fault. However, that does not make
> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> coordinate DMA with fileystem operations. There is no way to do that
> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> the DMA synchronization problem. Apologies for not noticing this
> >> earlier.
> >
> > QEMU does not truncate or punch holes of the file once it has been
> > mmap()'ed. Does the problem [1] still exist in such case?
> 
> Something else on the system might. The only agent that could enforce
> protection is the kernel, and the kernel will likely just disallow
> passing addresses from filesystem-dax vmas through to a guest
> altogether. I think there's even a problem in the non-DAX case unless
> KVM is pinning pages while they are handed out to a guest. The problem
> is that we don't have a page cache page to pin in the DAX case.
> 

Does it mean any user-space code like
  ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
  // make DMA to ptr
is unsafe?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  0:24       ` Haozhong Zhang
@ 2018-02-01  0:32         ` Dan Williams
  2018-02-01  2:29           ` Haozhong Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-02-01  0:32 UTC (permalink / raw)
  To: Dan Williams, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Paolo Bonzini, Michael S. Tsirkin, dgilbert, Xiao Guangrong,
	Stefan Hajnoczi

On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 01/31/18 16:08 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> >> > files on ext4/xfs file system mounted with '-o dax').
>> >>
>> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> metadata is in sync after a fault. However, that does not make
>> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> the DMA synchronization problem. Apologies for not noticing this
>> >> earlier.
>> >
>> > QEMU does not truncate or punch holes of the file once it has been
>> > mmap()'ed. Does the problem [1] still exist in such case?
>>
>> Something else on the system might. The only agent that could enforce
>> protection is the kernel, and the kernel will likely just disallow
>> passing addresses from filesystem-dax vmas through to a guest
>> altogether. I think there's even a problem in the non-DAX case unless
>> KVM is pinning pages while they are handed out to a guest. The problem
>> is that we don't have a page cache page to pin in the DAX case.
>>
>
> Does it mean any user-space code like
>   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>   // make DMA to ptr
> is unsafe?

Yes, it is currently unsafe because there is no coordination with the
filesytem if it decides to make block layout changes. We can fix that
in the non-virtualization case by having the filesystem wait for DMA
completion callbacks (i.e. what for all pages to be idle), but as far
as I can see we can't do the same coordination for DMA initiated by a
guest device driver.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  0:32         ` Dan Williams
@ 2018-02-01  2:29           ` Haozhong Zhang
  2018-02-01  3:02             ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Haozhong Zhang @ 2018-02-01  2:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi,
	Alex Williamson

+ vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.

On 01/31/18 16:32 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >>
> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> metadata is in sync after a fault. However, that does not make
> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> earlier.
> >> >
> >> > QEMU does not truncate or punch holes of the file once it has been
> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >>
> >> Something else on the system might. The only agent that could enforce
> >> protection is the kernel, and the kernel will likely just disallow
> >> passing addresses from filesystem-dax vmas through to a guest
> >> altogether. I think there's even a problem in the non-DAX case unless
> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> is that we don't have a page cache page to pin in the DAX case.
> >>
> >
> > Does it mean any user-space code like
> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >   // make DMA to ptr
> > is unsafe?
> 
> Yes, it is currently unsafe because there is no coordination with the
> filesytem if it decides to make block layout changes. We can fix that
> in the non-virtualization case by having the filesystem wait for DMA
> completion callbacks (i.e. what for all pages to be idle), but as far
> as I can see we can't do the same coordination for DMA initiated by a
> guest device driver.
> 

I think that fix [1] also works for KVM/QEMU. The guest DMA are
performed on two types of devices:

1. For emulated devices, the guest DMA requests are trapped and
   actually performed by QEMU on the host side. The host side fix [1]
   can cover this case.

2. For passthrough devices, vfio pins all pages, including those
   backed by dax mode files, used by the guest if any device is
   passthroughed to it. If I read the commit message in [2] correctly,
   operations that change the page-to-file offset association of pages
   from dax mode files will be deferred until the reference count of
   the affected pages becomes 1.  That is, if any passthrough device
   is used with a VM, the changes of page-to-file offset will not be
   able to happen until the VM is shutdown, so the fix [1] still takes
   effect here.

Another question is how a user-space application (e.g., QEMU) knows
whether it's safe to mmap a file on the DAX file system?

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013713.html


Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  2:29           ` Haozhong Zhang
@ 2018-02-01  3:02             ` Dan Williams
  2018-02-01  3:11               ` Dan Williams
                                 ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-01  3:02 UTC (permalink / raw)
  To: Dan Williams, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Paolo Bonzini, Michael S. Tsirkin, dgilbert, Xiao Guangrong,
	Stefan Hajnoczi, Alex Williamson

On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
>
> On 01/31/18 16:32 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >> >> <haozhong.zhang@intel.com> wrote:
>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>> >> >>
>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> >> metadata is in sync after a fault. However, that does not make
>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> >> the DMA synchronization problem. Apologies for not noticing this
>> >> >> earlier.
>> >> >
>> >> > QEMU does not truncate or punch holes of the file once it has been
>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>> >>
>> >> Something else on the system might. The only agent that could enforce
>> >> protection is the kernel, and the kernel will likely just disallow
>> >> passing addresses from filesystem-dax vmas through to a guest
>> >> altogether. I think there's even a problem in the non-DAX case unless
>> >> KVM is pinning pages while they are handed out to a guest. The problem
>> >> is that we don't have a page cache page to pin in the DAX case.
>> >>
>> >
>> > Does it mean any user-space code like
>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>> >   // make DMA to ptr
>> > is unsafe?
>>
>> Yes, it is currently unsafe because there is no coordination with the
>> filesytem if it decides to make block layout changes. We can fix that
>> in the non-virtualization case by having the filesystem wait for DMA
>> completion callbacks (i.e. what for all pages to be idle), but as far
>> as I can see we can't do the same coordination for DMA initiated by a
>> guest device driver.
>>
>
> I think that fix [1] also works for KVM/QEMU. The guest DMA are
> performed on two types of devices:
>
> 1. For emulated devices, the guest DMA requests are trapped and
>    actually performed by QEMU on the host side. The host side fix [1]
>    can cover this case.
>
> 2. For passthrough devices, vfio pins all pages, including those
>    backed by dax mode files, used by the guest if any device is
>    passthroughed to it. If I read the commit message in [2] correctly,
>    operations that change the page-to-file offset association of pages
>    from dax mode files will be deferred until the reference count of
>    the affected pages becomes 1.  That is, if any passthrough device
>    is used with a VM, the changes of page-to-file offset will not be
>    able to happen until the VM is shutdown, so the fix [1] still takes
>    effect here.

This sounds like a longterm mapping under control of vfio and not the
filesystem. See get_user_pages_longterm(), it is a problem if pages
are pinned indefinitely especially DAX. It sounds like vfio is in the
same boat as RDMA and cannot support long lived pins of DAX pages. As
of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
fix will be to create a "memory-registration with lease" semantic
available for RDMA so that the kernel can forcibly revoke page pinning
to perform physical layout changes. In the near it seems
vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
that filesystem-dax mappings are explicitly disallowed.

> Another question is how a user-space application (e.g., QEMU) knows
> whether it's safe to mmap a file on the DAX file system?

I think we fix vaddr_get_pfn() to start failing for DAX mappings
unless/until we can add a "with lease" mechanism. Userspace will know
when it is safe again when vfio stops failing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  3:02             ` Dan Williams
@ 2018-02-01  3:11               ` Dan Williams
  2018-02-01 10:17               ` Haozhong Zhang
  2018-02-01 17:05               ` Michael S. Tsirkin
  2 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-01  3:11 UTC (permalink / raw)
  To: Dan Williams, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Paolo Bonzini, Michael S. Tsirkin, dgilbert, Xiao Guangrong,
	Stefan Hajnoczi, Alex Williamson, Michal Hocko, lsf-pc

[ adding Michal and lsf-pci ]

On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
>> + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
>>
>> On 01/31/18 16:32 -0800, Dan Williams wrote:
>>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>> <haozhong.zhang@intel.com> wrote:
>>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>> >> <haozhong.zhang@intel.com> wrote:
>>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>> >> >> <haozhong.zhang@intel.com> wrote:
>>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>>> >> >>
>>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>>> >> >> metadata is in sync after a fault. However, that does not make
>>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>>> >> >> the DMA synchronization problem. Apologies for not noticing this
>>> >> >> earlier.
>>> >> >
>>> >> > QEMU does not truncate or punch holes of the file once it has been
>>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>>> >>
>>> >> Something else on the system might. The only agent that could enforce
>>> >> protection is the kernel, and the kernel will likely just disallow
>>> >> passing addresses from filesystem-dax vmas through to a guest
>>> >> altogether. I think there's even a problem in the non-DAX case unless
>>> >> KVM is pinning pages while they are handed out to a guest. The problem
>>> >> is that we don't have a page cache page to pin in the DAX case.
>>> >>
>>> >
>>> > Does it mean any user-space code like
>>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>>> >   // make DMA to ptr
>>> > is unsafe?
>>>
>>> Yes, it is currently unsafe because there is no coordination with the
>>> filesytem if it decides to make block layout changes. We can fix that
>>> in the non-virtualization case by having the filesystem wait for DMA
>>> completion callbacks (i.e. what for all pages to be idle), but as far
>>> as I can see we can't do the same coordination for DMA initiated by a
>>> guest device driver.
>>>
>>
>> I think that fix [1] also works for KVM/QEMU. The guest DMA are
>> performed on two types of devices:
>>
>> 1. For emulated devices, the guest DMA requests are trapped and
>>    actually performed by QEMU on the host side. The host side fix [1]
>>    can cover this case.
>>
>> 2. For passthrough devices, vfio pins all pages, including those
>>    backed by dax mode files, used by the guest if any device is
>>    passthroughed to it. If I read the commit message in [2] correctly,
>>    operations that change the page-to-file offset association of pages
>>    from dax mode files will be deferred until the reference count of
>>    the affected pages becomes 1.  That is, if any passthrough device
>>    is used with a VM, the changes of page-to-file offset will not be
>>    able to happen until the VM is shutdown, so the fix [1] still takes
>>    effect here.
>
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.
>
>> Another question is how a user-space application (e.g., QEMU) knows
>> whether it's safe to mmap a file on the DAX file system?
>
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.

Btw, there is an LSF/MM topic proposal on this subject [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  3:02             ` Dan Williams
  2018-02-01  3:11               ` Dan Williams
@ 2018-02-01 10:17               ` Haozhong Zhang
  2018-02-01 17:43                 ` Alex Williamson
  2018-02-01 17:05               ` Michael S. Tsirkin
  2 siblings, 1 reply; 18+ messages in thread
From: Haozhong Zhang @ 2018-02-01 10:17 UTC (permalink / raw)
  To: Dan Williams, Paolo Bonzini, Radim Krčmář,
	Alex Williamson
  Cc: kvm, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi

On 01/31/18 19:02 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> >
> > On 01/31/18 16:32 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >> >>
> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> >> metadata is in sync after a fault. However, that does not make
> >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> >> earlier.
> >> >> >
> >> >> > QEMU does not truncate or punch holes of the file once it has been
> >> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >> >>
> >> >> Something else on the system might. The only agent that could enforce
> >> >> protection is the kernel, and the kernel will likely just disallow
> >> >> passing addresses from filesystem-dax vmas through to a guest
> >> >> altogether. I think there's even a problem in the non-DAX case unless
> >> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> >> is that we don't have a page cache page to pin in the DAX case.
> >> >>
> >> >
> >> > Does it mean any user-space code like
> >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >> >   // make DMA to ptr
> >> > is unsafe?
> >>
> >> Yes, it is currently unsafe because there is no coordination with the
> >> filesytem if it decides to make block layout changes. We can fix that
> >> in the non-virtualization case by having the filesystem wait for DMA
> >> completion callbacks (i.e. what for all pages to be idle), but as far
> >> as I can see we can't do the same coordination for DMA initiated by a
> >> guest device driver.
> >>
> >
> > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > performed on two types of devices:
> >
> > 1. For emulated devices, the guest DMA requests are trapped and
> >    actually performed by QEMU on the host side. The host side fix [1]
> >    can cover this case.
> >
> > 2. For passthrough devices, vfio pins all pages, including those
> >    backed by dax mode files, used by the guest if any device is
> >    passthroughed to it. If I read the commit message in [2] correctly,
> >    operations that change the page-to-file offset association of pages
> >    from dax mode files will be deferred until the reference count of
> >    the affected pages becomes 1.  That is, if any passthrough device
> >    is used with a VM, the changes of page-to-file offset will not be
> >    able to happen until the VM is shutdown, so the fix [1] still takes
> >    effect here.
> 
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.

It seems that KVM and VFIO need to switch to get_user_pages_longterm()
which fails getting pages backed by dax mode files.

However, as get_user_pages() and its variants in the current KVM and
VFIO may be called after a VM starts running, e.g., handling EPT
violation on demand, and hotplugging a passthrough device to VM,
simply switching to the longterm version would cause VM crash in those
cases. Therefore, it also needs to patch or document in QEMU to not
use dax files with memory-backend-file. Paolo, Radim and Alex, what do
you think?

Thanks,
Haozhong

> 
> > Another question is how a user-space application (e.g., QEMU) knows
> > whether it's safe to mmap a file on the DAX file system?
> 
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01  3:02             ` Dan Williams
  2018-02-01  3:11               ` Dan Williams
  2018-02-01 10:17               ` Haozhong Zhang
@ 2018-02-01 17:05               ` Michael S. Tsirkin
  2 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2018-02-01 17:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Qemu Developers, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	dgilbert, Xiao Guangrong, Stefan Hajnoczi, Alex Williamson

On Wed, Jan 31, 2018 at 07:02:27PM -0800, Dan Williams wrote:
> > Another question is how a user-space application (e.g., QEMU) knows
> > whether it's safe to mmap a file on the DAX file system?
> 
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.

I read some of the discussion around that but could not figure out what
exactly does happen if a file is truncated by a malicious userspace.
Could you let me know pls? Thanks!

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
  2018-02-01 10:17               ` Haozhong Zhang
@ 2018-02-01 17:43                 ` Alex Williamson
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2018-02-01 17:43 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Dan Williams, Paolo Bonzini, Radim Krčmář,
	kvm, Qemu Developers, Eduardo Habkost, Igor Mammedov,
	Michael S. Tsirkin, dgilbert, Xiao Guangrong, Stefan Hajnoczi

On Thu, 1 Feb 2018 18:17:44 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> On 01/31/18 19:02 -0800, Dan Williams wrote:
> > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:  
> > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> > >
> > > On 01/31/18 16:32 -0800, Dan Williams wrote:  
> > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> > >> <haozhong.zhang@intel.com> wrote:  
> > >> > On 01/31/18 16:08 -0800, Dan Williams wrote:  
> > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> > >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:  
> > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> > >> >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > >> >> >> > files on ext4/xfs file system mounted with '-o dax').  
> > >> >> >>
> > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> > >> >> >> metadata is in sync after a fault. However, that does not make
> > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> > >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> > >> >> >> the DMA synchronization problem. Apologies for not noticing this
> > >> >> >> earlier.  
> > >> >> >
> > >> >> > QEMU does not truncate or punch holes of the file once it has been
> > >> >> > mmap()'ed. Does the problem [1] still exist in such case?  
> > >> >>
> > >> >> Something else on the system might. The only agent that could enforce
> > >> >> protection is the kernel, and the kernel will likely just disallow
> > >> >> passing addresses from filesystem-dax vmas through to a guest
> > >> >> altogether. I think there's even a problem in the non-DAX case unless
> > >> >> KVM is pinning pages while they are handed out to a guest. The problem
> > >> >> is that we don't have a page cache page to pin in the DAX case.
> > >> >>  
> > >> >
> > >> > Does it mean any user-space code like
> > >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> > >> >   // make DMA to ptr
> > >> > is unsafe?  
> > >>
> > >> Yes, it is currently unsafe because there is no coordination with the
> > >> filesytem if it decides to make block layout changes. We can fix that
> > >> in the non-virtualization case by having the filesystem wait for DMA
> > >> completion callbacks (i.e. what for all pages to be idle), but as far
> > >> as I can see we can't do the same coordination for DMA initiated by a
> > >> guest device driver.
> > >>  
> > >
> > > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > > performed on two types of devices:
> > >
> > > 1. For emulated devices, the guest DMA requests are trapped and
> > >    actually performed by QEMU on the host side. The host side fix [1]
> > >    can cover this case.
> > >
> > > 2. For passthrough devices, vfio pins all pages, including those
> > >    backed by dax mode files, used by the guest if any device is
> > >    passthroughed to it. If I read the commit message in [2] correctly,
> > >    operations that change the page-to-file offset association of pages
> > >    from dax mode files will be deferred until the reference count of
> > >    the affected pages becomes 1.  That is, if any passthrough device
> > >    is used with a VM, the changes of page-to-file offset will not be
> > >    able to happen until the VM is shutdown, so the fix [1] still takes
> > >    effect here.  
> > 
> > This sounds like a longterm mapping under control of vfio and not the
> > filesystem. See get_user_pages_longterm(), it is a problem if pages
> > are pinned indefinitely especially DAX. It sounds like vfio is in the
> > same boat as RDMA and cannot support long lived pins of DAX pages. As
> > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> > fix will be to create a "memory-registration with lease" semantic
> > available for RDMA so that the kernel can forcibly revoke page pinning
> > to perform physical layout changes. In the near it seems
> > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> > that filesystem-dax mappings are explicitly disallowed.  
> 
> It seems that KVM and VFIO need to switch to get_user_pages_longterm()
> which fails getting pages backed by dax mode files.
> 
> However, as get_user_pages() and its variants in the current KVM and
> VFIO may be called after a VM starts running, e.g., handling EPT
> violation on demand, and hotplugging a passthrough device to VM,
> simply switching to the longterm version would cause VM crash in those
> cases. Therefore, it also needs to patch or document in QEMU to not
> use dax files with memory-backend-file. Paolo, Radim and Alex, what do
> you think?

Yeah, it looks like vaddr_get_pfn() needs to do its own vma_is_fsdax()
check or convert it to the _longterm gup variant.  On hot-adding an
assigned device to a VM, QEMU should just fail the initfn of the
device, which would be non-fatal to the VM.  OTOH, if one of these
problem mappings can be hot added to the VM, such as via memory
hotplug, I think the mapping failure would be fatal to the VM.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-01 17:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the " Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to " Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option Haozhong Zhang
2018-01-31 22:25 ` [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Dan Williams
2018-02-01  0:02   ` Haozhong Zhang
2018-02-01  0:08     ` Dan Williams
2018-02-01  0:24       ` Haozhong Zhang
2018-02-01  0:32         ` Dan Williams
2018-02-01  2:29           ` Haozhong Zhang
2018-02-01  3:02             ` Dan Williams
2018-02-01  3:11               ` Dan Williams
2018-02-01 10:17               ` Haozhong Zhang
2018-02-01 17:43                 ` Alex Williamson
2018-02-01 17:05               ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.