All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property
@ 2021-02-09 13:49 David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 1/9] softmmu/physmem: drop "shared" parameter from ram_block_add() David Hildenbrand
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Some cleanups previously sent in other context (resizeable allocations),
followed by RAM_NORESERVE, implementing it under POSIX using MAP_NORESERVE,
and letting users configure it for memory backens using the "reserve"
property (default: true).

MAP_NORESERVE under Linux has in the context of QEMU an effect on
1) Private anonymous memory
-> memory-backend-ram,id=mem0,size=10G
2) Private file-based mappings
-> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
3) Private/shared hugetlbfs memory
-> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M

With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
huge pages (3) for the whole memory region.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
"this mapping might be very sparse". This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

virtio-mem currently only supports anonymous memory, in the future we want
to also support shared file-based and shared hugetlbfs mappings. We most
probably won't be supporting private mappings as they can end up behaving
very weird when it comes to memory consumption.

Future work for virtio-mem I am currently working on includes
1. Introducing a prealloc option for virtio-mem (e.g., using fallocate()
   when plugging blocks) to fail nicely when running out of
   backing storage like huge pages.
2. Supporting resizable RAM block/memoryr egions, such that we won't always
   expose a large, sparse memory region to the VM.
3. Protecting unplugged memory e.g., using userfaultfd.
4. (resizeable allocations / optimized mmap handling when resizing RAM
    blocks)

Cc: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Liam Merwick <liam.merwick@oracle.com>

David Hildenbrand (9):
  softmmu/physmem: drop "shared" parameter from ram_block_add()
  util/mmap-alloc: factor out calculation of the pagesize for the guard
    page
  util/mmap-alloc: factor out reserving of a memory region to
    mmap_reserve()
  util/mmap-alloc: factor out activating of memory to mmap_activate()
  softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd()
  softmmu/memory: pass ram_flags into
    memory_region_init_ram_shared_nomigrate()
  memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  hostmem: wire up RAM_NORESERVE via "reserve" property

 backends/hostmem-file.c                       |  11 +-
 backends/hostmem-memfd.c                      |   8 +-
 backends/hostmem-ram.c                        |   7 +-
 backends/hostmem.c                            |  33 +++
 hw/m68k/next-cube.c                           |   4 +-
 hw/misc/ivshmem.c                             |   4 +-
 include/exec/cpu-common.h                     |   1 +
 include/exec/memory.h                         |  43 ++--
 include/exec/ram_addr.h                       |   9 +-
 include/qemu/mmap-alloc.h                     |   4 +-
 include/qemu/osdep.h                          |   3 +-
 include/sysemu/hostmem.h                      |   2 +-
 include/sysemu/kvm.h                          |   3 +-
 migration/ram.c                               |   3 +-
 .../memory-region-housekeeping.cocci          |   8 +-
 softmmu/memory.c                              |  27 ++-
 softmmu/physmem.c                             |  48 +++--
 target/s390x/kvm.c                            |   6 +-
 util/mmap-alloc.c                             | 188 ++++++++++++------
 util/oslib-posix.c                            |   5 +-
 util/oslib-win32.c                            |  13 +-
 21 files changed, 281 insertions(+), 149 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v1 1/9] softmmu/physmem: drop "shared" parameter from ram_block_add()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 2/9] util/mmap-alloc: factor out calculation of the pagesize for the guard page David Hildenbrand
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Properly store it in the flags of the ram block instead (and the flag
even already exists and is used).

E.g., qemu_ram_is_shared() now properly succeeds on all ram blocks that are
actually shared.

Reviewed-by: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 243c3097d3..6b6c3be605 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1938,7 +1938,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
+static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     RAMBlock *block;
     RAMBlock *last_block = NULL;
@@ -1961,7 +1961,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
             }
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align, shared);
+                                             &new_block->mr->align,
+                                             qemu_ram_is_shared(new_block));
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2085,7 +2086,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    ram_block_add(new_block, &local_err, ram_flags & RAM_SHARED);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2148,10 +2149,13 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     if (host) {
         new_block->flags |= RAM_PREALLOC;
     }
+    if (share) {
+        new_block->flags |= RAM_SHARED;
+    }
     if (resizeable) {
         new_block->flags |= RAM_RESIZEABLE;
     }
-    ram_block_add(new_block, &local_err, share);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 2/9] util/mmap-alloc: factor out calculation of the pagesize for the guard page
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 1/9] softmmu/physmem: drop "shared" parameter from ram_block_add() David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 3/9] util/mmap-alloc: factor out reserving of a memory region to mmap_reserve() David Hildenbrand
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Let's factor out calculating the size of the guard page and rename the
variable to make it clearer that this pagesize only applies to the
guard page.

Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 890fda6a35..8bdf1f9df8 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_guard_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+    /* Mappings in the same segment must share the same page size */
+    return qemu_fd_getpagesize(fd);
+#else
+    return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
@@ -89,12 +99,12 @@ void *qemu_ram_mmap(int fd,
                     bool shared,
                     bool is_pmem)
 {
+    const size_t guard_pagesize = mmap_guard_pagesize(fd);
     int prot;
     int flags;
     int map_sync_flags = 0;
     int guardfd;
     size_t offset;
-    size_t pagesize;
     size_t total;
     void *guardptr;
     void *ptr;
@@ -115,8 +125,7 @@ void *qemu_ram_mmap(int fd,
      * anonymous memory is OK.
      */
     flags = MAP_PRIVATE;
-    pagesize = qemu_fd_getpagesize(fd);
-    if (fd == -1 || pagesize == qemu_real_host_page_size) {
+    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
         guardfd = -1;
         flags |= MAP_ANONYMOUS;
     } else {
@@ -125,7 +134,6 @@ void *qemu_ram_mmap(int fd,
     }
 #else
     guardfd = -1;
-    pagesize = qemu_real_host_page_size;
     flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -137,7 +145,7 @@ void *qemu_ram_mmap(int fd,
 
     assert(is_power_of_2(align));
     /* Always align to host page size */
-    assert(align >= pagesize);
+    assert(align >= guard_pagesize);
 
     flags = MAP_FIXED;
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
@@ -191,8 +199,8 @@ void *qemu_ram_mmap(int fd,
      * a guard page guarding against potential buffer overflows.
      */
     total -= offset;
-    if (total > size + pagesize) {
-        munmap(ptr + size + pagesize, total - size - pagesize);
+    if (total > size + guard_pagesize) {
+        munmap(ptr + size + guard_pagesize, total - size - guard_pagesize);
     }
 
     return ptr;
@@ -200,15 +208,8 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-    size_t pagesize;
-
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-        pagesize = qemu_fd_getpagesize(fd);
-#else
-        pagesize = qemu_real_host_page_size;
-#endif
-        munmap(ptr, size + pagesize);
+        munmap(ptr, size + mmap_guard_pagesize(fd));
     }
 }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 3/9] util/mmap-alloc: factor out reserving of a memory region to mmap_reserve()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 1/9] softmmu/physmem: drop "shared" parameter from ram_block_add() David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 2/9] util/mmap-alloc: factor out calculation of the pagesize for the guard page David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 4/9] util/mmap-alloc: factor out activating of memory to mmap_activate() David Hildenbrand
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

We want to reserve a memory region without actually populating memory.
Let's factor that out.

Reviewed-by: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 8bdf1f9df8..5c2bfe4c99 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+    int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+    /*
+     * On ppc64 mappings in the same segment (aka slice) must share the same
+     * page size. Since we will be re-allocating part of this segment
+     * from the supplied fd, we should make sure to use the same page size, to
+     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+     * avoid allocating backing store memory.
+     * We do this unless we are using the system page size, in which case
+     * anonymous memory is OK.
+     */
+    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+        fd = -1;
+        flags |= MAP_ANONYMOUS;
+    } else {
+        flags |= MAP_NORESERVE;
+    }
+#else
+    fd = -1;
+    flags |= MAP_ANONYMOUS;
+#endif
+
+    return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -103,7 +135,6 @@ void *qemu_ram_mmap(int fd,
     int prot;
     int flags;
     int map_sync_flags = 0;
-    int guardfd;
     size_t offset;
     size_t total;
     void *guardptr;
@@ -115,30 +146,7 @@ void *qemu_ram_mmap(int fd,
      */
     total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-    /* On ppc64 mappings in the same segment (aka slice) must share the same
-     * page size. Since we will be re-allocating part of this segment
-     * from the supplied fd, we should make sure to use the same page size, to
-     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
-     * avoid allocating backing store memory.
-     * We do this unless we are using the system page size, in which case
-     * anonymous memory is OK.
-     */
-    flags = MAP_PRIVATE;
-    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
-        guardfd = -1;
-        flags |= MAP_ANONYMOUS;
-    } else {
-        guardfd = fd;
-        flags |= MAP_NORESERVE;
-    }
-#else
-    guardfd = -1;
-    flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-    guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+    guardptr = mmap_reserve(total, fd);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 4/9] util/mmap-alloc: factor out activating of memory to mmap_activate()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-02-09 13:49 ` [PATCH v1 3/9] util/mmap-alloc: factor out reserving of a memory region to mmap_reserve() David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

We want to activate memory within a reserved memory region, to make it
accessible. Let's factor that out.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 91 +++++++++++++++++++++++++----------------------
 1 file changed, 48 insertions(+), 43 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 5c2bfe4c99..b50dc86a3c 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,51 @@ static void *mmap_reserve(size_t size, int fd)
     return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Activate memory in a reserved region from the given fd (if any), to make
+ * it accessible.
+ */
+static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
+                           bool shared, bool is_pmem)
+{
+    const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
+    int map_sync_flags = 0;
+    int flags = MAP_FIXED;
+    void *activated_ptr;
+
+    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    if (shared && is_pmem) {
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+    }
+
+    activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd, 0);
+    if (activated_ptr == MAP_FAILED && map_sync_flags) {
+        if (errno == ENOTSUP) {
+            char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+            char *file_name = g_malloc0(PATH_MAX);
+            int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+            if (len < 0) {
+                len = 0;
+            }
+            file_name[len] = '\0';
+            fprintf(stderr, "Warning: requesting persistence across crashes "
+                    "for backend file %s failed. Proceeding without "
+                    "persistence, data might become corrupted in case of host "
+                    "crash.\n", file_name);
+            g_free(proc_link);
+            g_free(file_name);
+        }
+        /*
+         * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+         * again without these flags to handle backwards compatibility.
+         */
+        activated_ptr = mmap(ptr, size, prot, flags, fd, 0);
+    }
+    return activated_ptr;
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -132,13 +177,8 @@ void *qemu_ram_mmap(int fd,
                     bool is_pmem)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
-    int prot;
-    int flags;
-    int map_sync_flags = 0;
-    size_t offset;
-    size_t total;
-    void *guardptr;
-    void *ptr;
+    size_t offset, total;
+    void *ptr, *guardptr;
 
     /*
      * Note: this always allocates at least one extra page of virtual address
@@ -155,44 +195,9 @@ void *qemu_ram_mmap(int fd,
     /* Always align to host page size */
     assert(align >= guard_pagesize);
 
-    flags = MAP_FIXED;
-    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
-        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-    }
-
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
-
-    ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, fd, 0);
-
-    if (ptr == MAP_FAILED && map_sync_flags) {
-        if (errno == ENOTSUP) {
-            char *proc_link, *file_name;
-            int len;
-            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-            file_name = g_malloc0(PATH_MAX);
-            len = readlink(proc_link, file_name, PATH_MAX - 1);
-            if (len < 0) {
-                len = 0;
-            }
-            file_name[len] = '\0';
-            fprintf(stderr, "Warning: requesting persistence across crashes "
-                    "for backend file %s failed. Proceeding without "
-                    "persistence, data might become corrupted in case of host "
-                    "crash.\n", file_name);
-            g_free(proc_link);
-            g_free(file_name);
-        }
-        /*
-         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
-         * we will remove these flags to handle compatibility.
-         */
-        ptr = mmap(guardptr + offset, size, prot, flags, fd, 0);
-    }
-
+    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-02-09 13:49 ` [PATCH v1 4/9] util/mmap-alloc: factor out activating of memory to mmap_activate() David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-03-02 17:17   ` Peter Xu
  2021-02-09 13:49 ` [PATCH v1 6/9] softmmu/memory: pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
to clean up and prepare for more flags.

Simplify the documentation of passed ram flags: Looking at our
documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
repetitive.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-memfd.c | 7 ++++---
 hw/misc/ivshmem.c        | 4 ++--
 include/exec/memory.h    | 9 +++------
 include/exec/ram_addr.h  | 6 +-----
 softmmu/memory.c         | 7 +++----
 5 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index e5626d4330..54db08de0d 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
+    uint32_t ram_flags;
     char *name;
     int fd;
 
@@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-                                   name, backend->size,
-                                   backend->share, fd, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+                                   backend->size, ram_flags, fd, errp);
     g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 0505b52c98..4cdce136e1 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -494,8 +494,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, Error **errp)
     size = buf.st_size;
 
     /* mmap the region and map into the BAR2 */
-    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-                                   "ivshmem.bar2", size, true, fd, &local_err);
+    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2",
+                                   size, RAM_SHARED, fd, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e58c09f130..7b635b334f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -969,10 +969,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: Memory region features:
- *             - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag
- *             - RAM_PMEM: the memory is persistent memory
- *             Other bits are ignored now.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -998,7 +995,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @fd: the fd to mmap.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -1009,7 +1006,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     struct Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     Error **errp);
 #endif
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 40b16609ab..86fa712134 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,11 +104,7 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: specify the properties of the ram block, which can be one
- *              or bit-or of following values
- *              - RAM_SHARED: mmap the backing file or device with MAP_SHARED
- *              - RAM_PMEM: the backend @mem_path or @fd is persistent memory
- *              Other bits are ignored.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/softmmu/memory.c b/softmmu/memory.c
index c0c814fbb9..25120ecae7 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1610,7 +1610,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     struct Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     Error **errp)
 {
@@ -1619,9 +1619,8 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
-                                           share ? RAM_SHARED : 0,
-                                           fd, false, &err);
+    mr->ram_block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, false,
+                                           &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 6/9] softmmu/memory: pass ram_flags into memory_region_init_ram_shared_nomigrate()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-02-09 13:49 ` [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-03-02 17:17   ` Peter Xu
  2021-02-09 13:49   ` David Hildenbrand
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Let's forward ram_flags instead, renaming
memory_region_init_ram_shared_nomigrate() into
memory_region_init_ram_flags_nomigrate(). Forward flags to
qemu_ram_alloc() and qemu_ram_alloc_internal().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-ram.c                        |  6 +++--
 hw/m68k/next-cube.c                           |  4 ++--
 include/exec/memory.h                         | 24 +++++++++----------
 include/exec/ram_addr.h                       |  2 +-
 .../memory-region-housekeeping.cocci          |  8 +++----
 softmmu/memory.c                              | 20 ++++++++--------
 softmmu/physmem.c                             | 24 ++++++++-----------
 7 files changed, 43 insertions(+), 45 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 5cc53e76c9..741e701062 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -19,6 +19,7 @@
 static void
 ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
+    uint32_t ram_flags;
     char *name;
 
     if (!backend->size) {
@@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), name,
-                           backend->size, backend->share, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
+                                           backend->size, ram_flags, errp);
     g_free(name);
 }
 
diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c
index 92b45d760f..59ccae0d5e 100644
--- a/hw/m68k/next-cube.c
+++ b/hw/m68k/next-cube.c
@@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine)
     sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x02100000);
 
     /* BMAP memory */
-    memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
-                                            true, &error_fatal);
+    memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
+                                           RAM_SHARED, &error_fatal);
     memory_region_add_subregion(sysmem, 0x020c0000, bmapm1);
     /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */
     memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7b635b334f..7d2db168c7 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -906,27 +906,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       Error **errp);
 
 /**
- * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
- *                                           Accesses into the region will
- *                                           modify memory directly.
+ * memory_region_init_ram_flags_nomigrate:  Initialize RAM memory region.
+ *                                          Accesses into the region will
+ *                                          modify memory directly.
  *
  * @mr: the #MemoryRegion to be initialized.
  * @owner: the object that tracks the region's reference count
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @share: allow remapping RAM to different addresses
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Note that this function is similar to memory_region_init_ram_nomigrate.
- * The only difference is part of the RAM region can be remapped.
+ * Note that this function does not do anything to cause the data in the
+ * RAM memory region to be migrated; that is the responsibility of the caller.
  */
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             struct Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp);
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            struct Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp);
 
 /**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 86fa712134..ce9e140c54 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags, MemoryRegion *mr,
                          Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
                                     void (*resized)(const char*,
diff --git a/scripts/coccinelle/memory-region-housekeeping.cocci b/scripts/coccinelle/memory-region-housekeeping.cocci
index c768d8140a..29651ebde9 100644
--- a/scripts/coccinelle/memory-region-housekeeping.cocci
+++ b/scripts/coccinelle/memory-region-housekeeping.cocci
@@ -127,8 +127,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, obj, E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, obj, E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, obj, E2, E3, E4, E5);
 )
   ...+>
 }
@@ -152,8 +152,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, OBJECT(dev), E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
 )
   ...+>
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 25120ecae7..78c9dd6d5e 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1532,22 +1532,22 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
 }
 
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp)
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp)
 {
     Error *err = NULL;
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, share, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, ram_flags, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
@@ -1682,7 +1682,7 @@ void memory_region_init_rom_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
     mr->readonly = true;
 }
 
@@ -1702,7 +1702,7 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, false,  mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 6b6c3be605..2243e2a87a 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2129,12 +2129,14 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
                                                   uint64_t length,
                                                   void *host),
-                                  void *host, bool resizeable, bool share,
+                                  void *host, uint32_t ram_flags,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
 
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
     new_block = g_malloc0(sizeof(*new_block));
@@ -2146,15 +2148,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     new_block->fd = -1;
     new_block->page_size = qemu_real_host_page_size;
     new_block->host = host;
+    new_block->flags = ram_flags;
     if (host) {
         new_block->flags |= RAM_PREALLOC;
     }
-    if (share) {
-        new_block->flags |= RAM_SHARED;
-    }
-    if (resizeable) {
-        new_block->flags |= RAM_RESIZEABLE;
-    }
     ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
@@ -2167,15 +2164,14 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                    MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, host, false,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, host, 0, mr, errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
-                                   share, mr, errp);
+    assert((ram_flags & ~RAM_SHARED) == 0);
+    return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2184,8 +2180,8 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
+                                   RAM_RESIZEABLE, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
@ 2021-02-09 13:49   ` David Hildenbrand
  2021-02-09 13:49 ` [PATCH v1 2/9] util/mmap-alloc: factor out calculation of the pagesize for the guard page David Hildenbrand
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Peter Xu, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Cornelia Huck, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

  RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or
  huge pages on Linux) is skipped: will bail out if not supported. When not
  set, the OS might reserve swap space (or huge pages on Linux), depending
  on OS support.

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add the POSIX mmap implementation next and
allow specifying RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Cc: Juan Quintela <quintela@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: kvm@vger.kernel.org
Cc: qemu-s390x@nongnu.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h     | 16 +++++++++++++---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/mmap-alloc.h |  4 +++-
 include/qemu/osdep.h      |  3 ++-
 include/sysemu/kvm.h      |  3 ++-
 migration/ram.c           |  3 +--
 softmmu/physmem.c         | 23 ++++++++++++++++-------
 target/s390x/kvm.c        |  6 +++++-
 util/mmap-alloc.c         |  9 ++++++++-
 util/oslib-posix.c        |  5 +++--
 util/oslib-win32.c        | 13 ++++++++++++-
 12 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d2db168c7..587d14257c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -157,6 +157,14 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages Linux) is skipped: will bail out if not supported. When not set, the
+ * OS might reserve swap space (or huge pages on Linux), depending on OS
+ * support.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -915,7 +923,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -969,7 +977,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -995,7 +1004,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @errp: pointer to Error*, to store an error if it happens.
  *
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ce9e140c54..1325c7760e 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 8b7a5c70f3..a996d9b15a 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -17,6 +17,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @readonly: true for a read-only mapping, false for read/write.
  *  @shared: map has RAM_SHARED flag.
  *  @is_pmem: map has RAM_PMEM flag.
+ *  @noreserve: map has RAM_NORESERVE flag.
  *
  * Return:
  *  On success, return a pointer to the mapped area.
@@ -27,7 +28,8 @@ void *qemu_ram_mmap(int fd,
                     size_t align,
                     bool readonly,
                     bool shared,
-                    bool is_pmem);
+                    bool is_pmem,
+                    bool noreserve);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..d6d8ef0999 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -343,7 +343,8 @@ extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c5546bdecc..4a0a7a4e89 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -251,7 +251,8 @@ int kvm_on_sigbus(int code, void *addr);
 
 /* interface with exec.c */
 
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared));
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared,
+                                       bool noreserve));
 
 /* internal API */
 
diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..dd8daad386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3322,8 +3322,7 @@ int colo_init_ram_cache(void)
     WITH_RCU_READ_LOCK_GUARD() {
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             block->colo_cache = qemu_anon_ram_alloc(block->used_length,
-                                                    NULL,
-                                                    false);
+                                                    NULL, false, false);
             if (!block->colo_cache) {
                 error_report("%s: Can't alloc memory for COLO cache of block %s,"
                              "size 0x" RAM_ADDR_FMT, __func__, block->idstr,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 2243e2a87a..9820d845c0 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1144,7 +1144,8 @@ static int subpage_register(subpage_t *mmio, uint32_t start, uint32_t end,
                             uint16_t section);
 static subpage_t *subpage_init(FlatView *fv, hwaddr base);
 
-static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
+static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared,
+                               bool noreserve) =
                                qemu_anon_ram_alloc;
 
 /*
@@ -1152,7 +1153,8 @@ static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
  * Accelerators with unusual needs may need this.  Hopefully, we can
  * get rid of it eventually.
  */
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared))
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align,
+                        bool shared, bool noreserve))
 {
     phys_mem_alloc = alloc;
 }
@@ -1593,7 +1595,8 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM);
+                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
+                         block->flags & RAM_NORESERVE);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
@@ -1713,6 +1716,11 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_noreserve(RAMBlock *rb)
+{
+    return rb->flags & RAM_NORESERVE;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -1962,7 +1970,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
                                              &new_block->mr->align,
-                                             qemu_ram_is_shared(new_block));
+                                             qemu_ram_is_shared(new_block),
+                                             qemu_ram_is_noreserve(new_block));
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2033,7 +2042,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2135,7 +2144,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     RAMBlock *new_block;
     Error *local_err = NULL;
 
-    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_NORESERVE)) == 0);
 
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
@@ -2170,7 +2179,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~RAM_SHARED) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index dc27fa36c9..bd5178bf81 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
  * to grow. We also have to use MAP parameters that avoid
  * read-only mapping of guest pages.
  */
-static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
+                               bool noreserve)
 {
     static void *mem;
 
     if (mem) {
         /* we only support one allocation, which is enough for initial ram */
         return NULL;
+    } else if (noreserve) {
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL
     }
 
     mem = mmap((void *) 0x800000000ULL, size,
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index b50dc86a3c..bb99843106 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
                     size_t align,
                     bool readonly,
                     bool shared,
-                    bool is_pmem)
+                    bool is_pmem,
+                    bool noreserve)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
     size_t offset, total;
     void *ptr, *guardptr;
 
+    if (noreserve) {
+        error_report("Skipping reservation of swap space is not supported");
+        return MAP_FAILED;
+    }
+
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index bf57d3b030..7c9d870723 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -227,10 +227,11 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
+                          bool noreserve)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, noreserve);
 
     if (ptr == MAP_FAILED) {
         return NULL;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b8012bb..8cafe44179 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include <malloc.h>
 
 /* this must come after including "trace.h" */
@@ -77,10 +78,20 @@ static int get_allocation_granularity(void)
     return system_info.dwAllocationGranularity;
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve)
 {
     void *ptr;
 
+    if (noreserve) {
+        /*
+         * We need a MEM_COMMIT before accessing any memory in a MEM_RESERVE
+         * area; we cannot easily mimic POSIX MAP_NORESERVE semantics.
+         */
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL;
+    }
+
     ptr = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
     trace_qemu_anon_ram_alloc(size, ptr);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-02-09 13:49   ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Kotrasinski, kvm, Michael S. Tsirkin, Peter Xu,
	Paolo Bonzini, Juan Quintela, David Hildenbrand, Halil Pasic,
	Christian Borntraeger, Murilo Opsfelder Araujo,
	Philippe Mathieu-Daudé,
	Thomas Huth, Eduardo Habkost, Stefan Weil, Richard Henderson,
	Dr. David Alan Gilbert, Greg Kurz, qemu-s390x, Stefan Hajnoczi,
	Igor Mammedov, Cornelia Huck

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

  RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or
  huge pages on Linux) is skipped: will bail out if not supported. When not
  set, the OS might reserve swap space (or huge pages on Linux), depending
  on OS support.

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add the POSIX mmap implementation next and
allow specifying RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Cc: Juan Quintela <quintela@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: kvm@vger.kernel.org
Cc: qemu-s390x@nongnu.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h     | 16 +++++++++++++---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/mmap-alloc.h |  4 +++-
 include/qemu/osdep.h      |  3 ++-
 include/sysemu/kvm.h      |  3 ++-
 migration/ram.c           |  3 +--
 softmmu/physmem.c         | 23 ++++++++++++++++-------
 target/s390x/kvm.c        |  6 +++++-
 util/mmap-alloc.c         |  9 ++++++++-
 util/oslib-posix.c        |  5 +++--
 util/oslib-win32.c        | 13 ++++++++++++-
 12 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d2db168c7..587d14257c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -157,6 +157,14 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages Linux) is skipped: will bail out if not supported. When not set, the
+ * OS might reserve swap space (or huge pages on Linux), depending on OS
+ * support.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -915,7 +923,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -969,7 +977,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -995,7 +1004,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @errp: pointer to Error*, to store an error if it happens.
  *
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ce9e140c54..1325c7760e 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 8b7a5c70f3..a996d9b15a 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -17,6 +17,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @readonly: true for a read-only mapping, false for read/write.
  *  @shared: map has RAM_SHARED flag.
  *  @is_pmem: map has RAM_PMEM flag.
+ *  @noreserve: map has RAM_NORESERVE flag.
  *
  * Return:
  *  On success, return a pointer to the mapped area.
@@ -27,7 +28,8 @@ void *qemu_ram_mmap(int fd,
                     size_t align,
                     bool readonly,
                     bool shared,
-                    bool is_pmem);
+                    bool is_pmem,
+                    bool noreserve);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..d6d8ef0999 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -343,7 +343,8 @@ extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c5546bdecc..4a0a7a4e89 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -251,7 +251,8 @@ int kvm_on_sigbus(int code, void *addr);
 
 /* interface with exec.c */
 
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared));
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared,
+                                       bool noreserve));
 
 /* internal API */
 
diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..dd8daad386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3322,8 +3322,7 @@ int colo_init_ram_cache(void)
     WITH_RCU_READ_LOCK_GUARD() {
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             block->colo_cache = qemu_anon_ram_alloc(block->used_length,
-                                                    NULL,
-                                                    false);
+                                                    NULL, false, false);
             if (!block->colo_cache) {
                 error_report("%s: Can't alloc memory for COLO cache of block %s,"
                              "size 0x" RAM_ADDR_FMT, __func__, block->idstr,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 2243e2a87a..9820d845c0 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1144,7 +1144,8 @@ static int subpage_register(subpage_t *mmio, uint32_t start, uint32_t end,
                             uint16_t section);
 static subpage_t *subpage_init(FlatView *fv, hwaddr base);
 
-static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
+static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared,
+                               bool noreserve) =
                                qemu_anon_ram_alloc;
 
 /*
@@ -1152,7 +1153,8 @@ static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
  * Accelerators with unusual needs may need this.  Hopefully, we can
  * get rid of it eventually.
  */
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared))
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align,
+                        bool shared, bool noreserve))
 {
     phys_mem_alloc = alloc;
 }
@@ -1593,7 +1595,8 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM);
+                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
+                         block->flags & RAM_NORESERVE);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
@@ -1713,6 +1716,11 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_noreserve(RAMBlock *rb)
+{
+    return rb->flags & RAM_NORESERVE;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -1962,7 +1970,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
                                              &new_block->mr->align,
-                                             qemu_ram_is_shared(new_block));
+                                             qemu_ram_is_shared(new_block),
+                                             qemu_ram_is_noreserve(new_block));
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2033,7 +2042,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2135,7 +2144,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     RAMBlock *new_block;
     Error *local_err = NULL;
 
-    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_NORESERVE)) == 0);
 
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
@@ -2170,7 +2179,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~RAM_SHARED) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index dc27fa36c9..bd5178bf81 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
  * to grow. We also have to use MAP parameters that avoid
  * read-only mapping of guest pages.
  */
-static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
+                               bool noreserve)
 {
     static void *mem;
 
     if (mem) {
         /* we only support one allocation, which is enough for initial ram */
         return NULL;
+    } else if (noreserve) {
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL
     }
 
     mem = mmap((void *) 0x800000000ULL, size,
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index b50dc86a3c..bb99843106 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
                     size_t align,
                     bool readonly,
                     bool shared,
-                    bool is_pmem)
+                    bool is_pmem,
+                    bool noreserve)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
     size_t offset, total;
     void *ptr, *guardptr;
 
+    if (noreserve) {
+        error_report("Skipping reservation of swap space is not supported");
+        return MAP_FAILED;
+    }
+
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index bf57d3b030..7c9d870723 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -227,10 +227,11 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
+                          bool noreserve)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, noreserve);
 
     if (ptr == MAP_FAILED) {
         return NULL;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b8012bb..8cafe44179 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include <malloc.h>
 
 /* this must come after including "trace.h" */
@@ -77,10 +78,20 @@ static int get_allocation_granularity(void)
     return system_info.dwAllocationGranularity;
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve)
 {
     void *ptr;
 
+    if (noreserve) {
+        /*
+         * We need a MEM_COMMIT before accessing any memory in a MEM_RESERVE
+         * area; we cannot easily mimic POSIX MAP_NORESERVE semantics.
+         */
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL;
+    }
+
     ptr = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
     trace_qemu_anon_ram_alloc(size, ptr);
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-02-09 13:49   ` David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-03-02 17:51   ` Peter Xu
  2021-02-09 13:49 ` [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property David Hildenbrand
  2021-03-02 13:12 ` [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem " David Hildenbrand
  9 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Let's support RAM_NORESERVE via MAP_NORESERVE. At least on Linux,
the flag has no effect on shared mappings - except for anonymous memory
and hugetlbfs.

Linux man page:
  "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
  space is reserved, one has the guarantee that it is possible to modify
  the mapping. When swap space is not reserved one might get SIGSEGV
  upon a write if no physical memory is available. See also the discussion
  of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
  2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

  1) Without MAP_NORESERVE *or* with memory overcommit under Linux
     disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
     accounting/reservation happens:
      For a file backed map
       SHARED or READ-only - 0 cost (the file is the map not swap)
       PRIVATE WRITABLE - size of mapping per instance

      For an anonymous or /dev/zero map
       SHARED   - size of mapping
       PRIVATE READ-only - 0 cost (but of little use)
       PRIVATE WRITABLE - size of mapping per instance

  2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

  1) Without MAP_NORESERVE, huge pages are reserved.

  2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c |  1 +
 util/mmap-alloc.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 9820d845c0..f45a85add1 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2248,6 +2248,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 abort();
             } else {
                 flags = MAP_FIXED;
+                flags |= (block->flags & RAM_NORESERVE) ? MAP_NORESERVE : 0;
                 if (block->fd >= 0) {
                     flags |= (block->flags & RAM_SHARED ?
                               MAP_SHARED : MAP_PRIVATE);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index bb99843106..f77d9ca574 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
@@ -120,7 +121,7 @@ static void *mmap_reserve(size_t size, int fd)
  * it accessible.
  */
 static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
-                           bool shared, bool is_pmem)
+                           bool shared, bool is_pmem, bool noreserve)
 {
     const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
     int map_sync_flags = 0;
@@ -129,6 +130,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
 
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    flags |= noreserve ? MAP_NORESERVE : 0;
     if (shared && is_pmem) {
         map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
     }
@@ -170,6 +172,43 @@ static inline size_t mmap_guard_pagesize(int fd)
 #endif
 }
 
+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, bool shared)
+{
+#if defined(__linux__)
+    gchar *content = NULL;
+    const char *endptr;
+    unsigned int tmp;
+
+    /* hugetlbfs behaves differently */
+    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+        return true;
+    }
+
+    /* only private shared mappings are accounted (ignoring /dev/zero) */
+    if (fd != -1 && shared) {
+        return true;
+    }
+
+    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
+        !qemu_strtoui(content, &endptr, 0, &tmp) &&
+        (!endptr || *endptr == '\n')) {
+        if (tmp == 2) {
+            error_report("Skipping reservation of swap space is not supported: "
+                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
+            return false;
+        }
+        return true;
+    }
+    /* this interface has been around since Linux 2.6 */
+    error_report("Skipping reservation of swap space is not supported: "
+                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
+    return false;
+#else
+    return true;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
@@ -182,8 +221,7 @@ void *qemu_ram_mmap(int fd,
     size_t offset, total;
     void *ptr, *guardptr;
 
-    if (noreserve) {
-        error_report("Skipping reservation of swap space is not supported");
+    if (noreserve && !map_noreserve_effective(fd, shared)) {
         return MAP_FAILED;
     }
 
@@ -204,7 +242,8 @@ void *qemu_ram_mmap(int fd,
 
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem);
+    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+                        noreserve);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-02-09 13:49 ` [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
@ 2021-02-09 13:49 ` David Hildenbrand
  2021-03-02 17:55   ` Peter Xu
  2021-03-02 13:12 ` [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem " David Hildenbrand
  9 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-02-09 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

Let's provide a way to control the use of RAM_NORESERVE via memory
backends using the "reserve" property which defaults to true (old
behavior).

Only POSIX supports setting the flag (and Linux support is checked at
runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
Windows will bail out.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-file.c  | 11 ++++++-----
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 backends/hostmem.c       | 33 +++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  2 +-
 5 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 733408e076..a2d7e83f9c 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                object_get_typename(OBJECT(backend)));
 #else
     HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
+    uint32_t ram_flags;
     gchar *name;
 
     if (!backend->size) {
@@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
-                                     name,
-                                     backend->size, fb->align,
-                                     (backend->share ? RAM_SHARED : 0) |
-                                     (fb->is_pmem ? RAM_PMEM : 0),
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
+                                     backend->size, fb->align, ram_flags,
                                      fb->mem_path, fb->readonly, errp);
     g_free(name);
 #endif
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 54db08de0d..5412e67bad 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                    backend->size, ram_flags, fd, errp);
     g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 741e701062..b8e55cdbd0 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
                                            backend->size, ram_flags, errp);
     g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index be0c3b079f..c979595115 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
     Error *local_err = NULL;
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
+    if (!backend->reserve && value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+
     if (!host_memory_backend_mr_inited(backend)) {
         backend->prealloc = value;
         return;
@@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
+    backend->reserve = true;
     backend->prealloc_threads = 1;
 }
 
@@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
     backend->share = value;
 }
 
+static bool host_memory_backend_get_reserve(Object *o, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    return backend->reserve;
+}
+
+static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property value");
+        return;
+    }
+    if (backend->prealloc && !value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+    backend->reserve = value;
+}
+
 static bool
 host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
 {
@@ -494,6 +522,11 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_share, host_memory_backend_set_share);
     object_class_property_set_description(oc, "share",
         "Mark the memory as private to QEMU or shared");
+    object_class_property_add_bool(oc, "reserve",
+        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
+    object_class_property_set_description(oc, "reserve",
+        "Reserve swap space (or huge pages under Linux) for the whole memory"
+        " backend, if supported by the OS.");
     object_class_property_add_bool(oc, "x-use-canonical-path-for-ramblock-id",
         host_memory_backend_get_use_canonical_path,
         host_memory_backend_set_use_canonical_path);
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index df5644723a..9ff5c16963 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -64,7 +64,7 @@ struct HostMemoryBackend {
     /* protected */
     uint64_t size;
     bool merge, dump, use_canonical_path;
-    bool prealloc, is_mapped, share;
+    bool prealloc, is_mapped, share, reserve;
     uint32_t prealloc_threads;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property
  2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (8 preceding siblings ...)
  2021-02-09 13:49 ` [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property David Hildenbrand
@ 2021-03-02 13:12 ` David Hildenbrand
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 13:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Thomas Huth, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Halil Pasic, Christian Borntraeger, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 09.02.21 14:49, David Hildenbrand wrote:
> Some cleanups previously sent in other context (resizeable allocations),
> followed by RAM_NORESERVE, implementing it under POSIX using MAP_NORESERVE,
> and letting users configure it for memory backens using the "reserve"
> property (default: true).
> 
> MAP_NORESERVE under Linux has in the context of QEMU an effect on
> 1) Private anonymous memory
> -> memory-backend-ram,id=mem0,size=10G
> 2) Private file-based mappings
> -> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
> 3) Private/shared hugetlbfs memory
> -> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M
> 
> With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
> huge pages (3) for the whole memory region.
> 
> The target use case is virtio-mem, which dynamically exposes memory
> inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
> "this mapping might be very sparse". This essentially allows
> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
> virtio-mem and also supporting hugetlbfs in the future.
> 
> virtio-mem currently only supports anonymous memory, in the future we want
> to also support shared file-based and shared hugetlbfs mappings. We most
> probably won't be supporting private mappings as they can end up behaving
> very weird when it comes to memory consumption.
> 
> Future work for virtio-mem I am currently working on includes
> 1. Introducing a prealloc option for virtio-mem (e.g., using fallocate()
>     when plugging blocks) to fail nicely when running out of
>     backing storage like huge pages.
> 2. Supporting resizable RAM block/memoryr egions, such that we won't always
>     expose a large, sparse memory region to the VM.
> 3. Protecting unplugged memory e.g., using userfaultfd.
> 4. (resizeable allocations / optimized mmap handling when resizing RAM
>      blocks)

Ping

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd()
  2021-02-09 13:49 ` [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
@ 2021-03-02 17:17   ` Peter Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Feb 09, 2021 at 02:49:35PM +0100, David Hildenbrand wrote:
> Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
> to clean up and prepare for more flags.
> 
> Simplify the documentation of passed ram flags: Looking at our
> documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
> repetitive.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 6/9] softmmu/memory: pass ram_flags into memory_region_init_ram_shared_nomigrate()
  2021-02-09 13:49 ` [PATCH v1 6/9] softmmu/memory: pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
@ 2021-03-02 17:17   ` Peter Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Feb 09, 2021 at 02:49:36PM +0100, David Hildenbrand wrote:
> Let's forward ram_flags instead, renaming
> memory_region_init_ram_shared_nomigrate() into
> memory_region_init_ram_flags_nomigrate(). Forward flags to
> qemu_ram_alloc() and qemu_ram_alloc_internal().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-02-09 13:49   ` David Hildenbrand
@ 2021-03-02 17:32     ` Peter Xu
  -1 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Cornelia Huck, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>   * to grow. We also have to use MAP parameters that avoid
>   * read-only mapping of guest pages.
>   */
> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
> +                               bool noreserve)
>  {
>      static void *mem;
>  
>      if (mem) {
>          /* we only support one allocation, which is enough for initial ram */
>          return NULL;
> +    } else if (noreserve) {
> +        error_report("Skipping reservation of swap space is not supported.");
> +        return NULL

Semicolon missing.

>      }
>  
>      mem = mmap((void *) 0x800000000ULL, size,
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index b50dc86a3c..bb99843106 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -20,6 +20,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/mmap-alloc.h"
>  #include "qemu/host-utils.h"
> +#include "qemu/error-report.h"
>  
>  #define HUGETLBFS_MAGIC       0x958458f6
>  
> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>                      size_t align,
>                      bool readonly,
>                      bool shared,
> -                    bool is_pmem)
> +                    bool is_pmem,
> +                    bool noreserve)

Maybe at some point we should use flags too here to cover all bools.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-02 17:32     ` Peter Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>   * to grow. We also have to use MAP parameters that avoid
>   * read-only mapping of guest pages.
>   */
> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
> +                               bool noreserve)
>  {
>      static void *mem;
>  
>      if (mem) {
>          /* we only support one allocation, which is enough for initial ram */
>          return NULL;
> +    } else if (noreserve) {
> +        error_report("Skipping reservation of swap space is not supported.");
> +        return NULL

Semicolon missing.

>      }
>  
>      mem = mmap((void *) 0x800000000ULL, size,
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index b50dc86a3c..bb99843106 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -20,6 +20,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/mmap-alloc.h"
>  #include "qemu/host-utils.h"
> +#include "qemu/error-report.h"
>  
>  #define HUGETLBFS_MAGIC       0x958458f6
>  
> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>                      size_t align,
>                      bool readonly,
>                      bool shared,
> -                    bool is_pmem)
> +                    bool is_pmem,
> +                    bool noreserve)

Maybe at some point we should use flags too here to cover all bools.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-02-09 13:49 ` [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
@ 2021-03-02 17:51   ` Peter Xu
  2021-03-02 19:01     ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:
> +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
> +static bool map_noreserve_effective(int fd, bool shared)
> +{
> +#if defined(__linux__)
> +    gchar *content = NULL;
> +    const char *endptr;
> +    unsigned int tmp;
> +
> +    /* hugetlbfs behaves differently */
> +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
> +        return true;
> +    }
> +
> +    /* only private shared mappings are accounted (ignoring /dev/zero) */
> +    if (fd != -1 && shared) {
> +        return true;
> +    }
> +
> +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
> +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
> +        (!endptr || *endptr == '\n')) {
> +        if (tmp == 2) {
> +            error_report("Skipping reservation of swap space is not supported: "
> +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
> +            return false;
> +        }
> +        return true;
> +    }
> +    /* this interface has been around since Linux 2.6 */
> +    error_report("Skipping reservation of swap space is not supported: "
> +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
> +    return false;
> +#else
> +    return true;
> +#endif
> +}

I feel like this helper wants to fail gracefully for some conditions.  Could
you elaborate one example and attach to the commit log?

I'm also wondering whether it would worth to check the global value.  Even if
overcommit is globally disabled, do we (as an application process) need to care
about it?  I think the MAP_NORESERVE would simply be silently ignored by the
kernel and that seems to be design of it, otherwise would all apps who uses
MAP_NORESERVE would need to do similar things too?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property
  2021-02-09 13:49 ` [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property David Hildenbrand
@ 2021-03-02 17:55   ` Peter Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 17:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Feb 09, 2021 at 02:49:39PM +0100, David Hildenbrand wrote:
> Let's provide a way to control the use of RAM_NORESERVE via memory
> backends using the "reserve" property which defaults to true (old
> behavior).
> 
> Only POSIX supports setting the flag (and Linux support is checked at
> runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
> Windows will bail out.
> 
> The target use case is virtio-mem, which dynamically exposes memory
> inside a large, sparse memory area to the VM. This essentially allows
> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
> virtio-mem and also supporting hugetlbfs in the future.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-03-02 17:51   ` Peter Xu
@ 2021-03-02 19:01     ` David Hildenbrand
  2021-03-02 21:44       ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 19:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 02.03.21 18:51, Peter Xu wrote:
> On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:
>> +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
>> +static bool map_noreserve_effective(int fd, bool shared)
>> +{
>> +#if defined(__linux__)
>> +    gchar *content = NULL;
>> +    const char *endptr;
>> +    unsigned int tmp;
>> +
>> +    /* hugetlbfs behaves differently */
>> +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
>> +        return true;
>> +    }
>> +
>> +    /* only private shared mappings are accounted (ignoring /dev/zero) */
>> +    if (fd != -1 && shared) {
>> +        return true;
>> +    }
>> +
>> +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
>> +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
>> +        (!endptr || *endptr == '\n')) {
>> +        if (tmp == 2) {
>> +            error_report("Skipping reservation of swap space is not supported: "
>> +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
>> +            return false;
>> +        }
>> +        return true;
>> +    }
>> +    /* this interface has been around since Linux 2.6 */
>> +    error_report("Skipping reservation of swap space is not supported: "
>> +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
>> +    return false;
>> +#else
>> +    return true;
>> +#endif
>> +}
> 
> I feel like this helper wants to fail gracefully for some conditions.  Could
> you elaborate one example and attach to the commit log?

Sure. The case is "/proc/sys/vm/overcommit_memory == 2" (never overcommit)

MAP_NORESERVE is without effect and sparse memory regions are somewhat 
impossible.

> 
> I'm also wondering whether it would worth to check the global value.  Even if
> overcommit is globally disabled, do we (as an application process) need to care
> about it?  I think the MAP_NORESERVE would simply be silently ignored by the
> kernel and that seems to be design of it, otherwise would all apps who uses > MAP_NORESERVE would need to do similar things too?

Right, I want to catch the "gets silently ignored" part, because someone 
requested "reserved=off" (!default) but does not actually get what he 
asked for.

As one example, glibc manages heaps via:

a) Creating a new heap: mmap(PROT_NONE, MAP_NORESERVE) the maximum size, 
then mprotect(PROT_READ|PROT_WRITE) the initial heap size. Even if 
MAP_NORESERVE is ignored, only !PROT_NONE memory ever gets committed 
("reserve swap space") in Linux.

b) Growing the heap via mprotect(PROT_READ|PROT_WRITE) within the 
existing mmap. This will commit memory in case MAP_NORESERVE got ignored.

c) Shrinking the heap ("discard memory") via MADV_DONTNEED *unless* 
"/proc/sys/vm/overcommit_memory == 2" - the only way to undo 
mprotect(PROT_READ|PROT_WRITE) and to un-commit memory is by doing a 
mmap(PROT_NONE, MAP_FIXED) over the problematic region.

If you're interested, you can take a look at:

malloc/arena.c
sysdeps/unix/sysv/linux/malloc-sysdep.h:check_may_shrink_heap()


Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-02 17:32     ` Peter Xu
@ 2021-03-02 19:02       ` David Hildenbrand
  -1 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 19:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Cornelia Huck, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On 02.03.21 18:32, Peter Xu wrote:
> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>    * to grow. We also have to use MAP parameters that avoid
>>    * read-only mapping of guest pages.
>>    */
>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>> +                               bool noreserve)
>>   {
>>       static void *mem;
>>   
>>       if (mem) {
>>           /* we only support one allocation, which is enough for initial ram */
>>           return NULL;
>> +    } else if (noreserve) {
>> +        error_report("Skipping reservation of swap space is not supported.");
>> +        return NULL
> 
> Semicolon missing.

Thanks for catching that!

> 
>>       }
>>   
>>       mem = mmap((void *) 0x800000000ULL, size,
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index b50dc86a3c..bb99843106 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -20,6 +20,7 @@
>>   #include "qemu/osdep.h"
>>   #include "qemu/mmap-alloc.h"
>>   #include "qemu/host-utils.h"
>> +#include "qemu/error-report.h"
>>   
>>   #define HUGETLBFS_MAGIC       0x958458f6
>>   
>> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>>                       size_t align,
>>                       bool readonly,
>>                       bool shared,
>> -                    bool is_pmem)
>> +                    bool is_pmem,
>> +                    bool noreserve)
> 
> Maybe at some point we should use flags too here to cover all bools.
> 

Right. I guess the main point was to not reuse RAM_XXX.

Should I introduce RAM_MMAP_XXX ?

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-02 19:02       ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 19:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 02.03.21 18:32, Peter Xu wrote:
> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>    * to grow. We also have to use MAP parameters that avoid
>>    * read-only mapping of guest pages.
>>    */
>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>> +                               bool noreserve)
>>   {
>>       static void *mem;
>>   
>>       if (mem) {
>>           /* we only support one allocation, which is enough for initial ram */
>>           return NULL;
>> +    } else if (noreserve) {
>> +        error_report("Skipping reservation of swap space is not supported.");
>> +        return NULL
> 
> Semicolon missing.

Thanks for catching that!

> 
>>       }
>>   
>>       mem = mmap((void *) 0x800000000ULL, size,
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index b50dc86a3c..bb99843106 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -20,6 +20,7 @@
>>   #include "qemu/osdep.h"
>>   #include "qemu/mmap-alloc.h"
>>   #include "qemu/host-utils.h"
>> +#include "qemu/error-report.h"
>>   
>>   #define HUGETLBFS_MAGIC       0x958458f6
>>   
>> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>>                       size_t align,
>>                       bool readonly,
>>                       bool shared,
>> -                    bool is_pmem)
>> +                    bool is_pmem,
>> +                    bool noreserve)
> 
> Maybe at some point we should use flags too here to cover all bools.
> 

Right. I guess the main point was to not reuse RAM_XXX.

Should I introduce RAM_MMAP_XXX ?

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-02 19:02       ` David Hildenbrand
@ 2021-03-02 20:54         ` Peter Xu
  -1 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 20:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Cornelia Huck, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On Tue, Mar 02, 2021 at 08:02:34PM +0100, David Hildenbrand wrote:
> > > @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
> > >                       size_t align,
> > >                       bool readonly,
> > >                       bool shared,
> > > -                    bool is_pmem)
> > > +                    bool is_pmem,
> > > +                    bool noreserve)
> > 
> > Maybe at some point we should use flags too here to cover all bools.
> > 
> 
> Right. I guess the main point was to not reuse RAM_XXX.
> 
> Should I introduce RAM_MMAP_XXX ?

Maybe we can directly use MAP_*?  Since I see qemu_ram_mmap() should only exist
with CONFIG_POSIX.  However indeed I see no sign to extend more bools in the
near future either, so maybe also fine to keep it as is, as 4 bools still looks
okay - your call. :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-02 20:54         ` Peter Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2021-03-02 20:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Mar 02, 2021 at 08:02:34PM +0100, David Hildenbrand wrote:
> > > @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
> > >                       size_t align,
> > >                       bool readonly,
> > >                       bool shared,
> > > -                    bool is_pmem)
> > > +                    bool is_pmem,
> > > +                    bool noreserve)
> > 
> > Maybe at some point we should use flags too here to cover all bools.
> > 
> 
> Right. I guess the main point was to not reuse RAM_XXX.
> 
> Should I introduce RAM_MMAP_XXX ?

Maybe we can directly use MAP_*?  Since I see qemu_ram_mmap() should only exist
with CONFIG_POSIX.  However indeed I see no sign to extend more bools in the
near future either, so maybe also fine to keep it as is, as 4 bools still looks
okay - your call. :)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-02 20:54         ` Peter Xu
@ 2021-03-02 20:58           ` David Hildenbrand
  -1 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 20:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Cornelia Huck, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On 02.03.21 21:54, Peter Xu wrote:
> On Tue, Mar 02, 2021 at 08:02:34PM +0100, David Hildenbrand wrote:
>>>> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>>>>                        size_t align,
>>>>                        bool readonly,
>>>>                        bool shared,
>>>> -                    bool is_pmem)
>>>> +                    bool is_pmem,
>>>> +                    bool noreserve)
>>>
>>> Maybe at some point we should use flags too here to cover all bools.
>>>
>>
>> Right. I guess the main point was to not reuse RAM_XXX.
>>
>> Should I introduce RAM_MMAP_XXX ?
> 
> Maybe we can directly use MAP_*?  Since I see qemu_ram_mmap() should only exist

I think the issue is that there is for example no flag that corresponds 
to "is_pmem" - and the fallback logic in our mmap code to make "is_pmem" 
still work is a little bit more involved. In addition, "readonly" 
translates to PROT_READ ...

> with CONFIG_POSIX.  However indeed I see no sign to extend more bools in the
> near future either, so maybe also fine to keep it as is, as 4 bools still looks
> okay - your call. :)

Well, I had the same idea when I added yet another bool :) But I guess 
we won't be adding a lot of additional flags in the near future. 
(MAP_POPULATE? ;) fortunately we use a different approach to populate 
memory)

I'll think about it, not sure yet if this is worth proper flags. Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-02 20:58           ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-02 20:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 02.03.21 21:54, Peter Xu wrote:
> On Tue, Mar 02, 2021 at 08:02:34PM +0100, David Hildenbrand wrote:
>>>> @@ -174,12 +175,18 @@ void *qemu_ram_mmap(int fd,
>>>>                        size_t align,
>>>>                        bool readonly,
>>>>                        bool shared,
>>>> -                    bool is_pmem)
>>>> +                    bool is_pmem,
>>>> +                    bool noreserve)
>>>
>>> Maybe at some point we should use flags too here to cover all bools.
>>>
>>
>> Right. I guess the main point was to not reuse RAM_XXX.
>>
>> Should I introduce RAM_MMAP_XXX ?
> 
> Maybe we can directly use MAP_*?  Since I see qemu_ram_mmap() should only exist

I think the issue is that there is for example no flag that corresponds 
to "is_pmem" - and the fallback logic in our mmap code to make "is_pmem" 
still work is a little bit more involved. In addition, "readonly" 
translates to PROT_READ ...

> with CONFIG_POSIX.  However indeed I see no sign to extend more bools in the
> near future either, so maybe also fine to keep it as is, as 4 bools still looks
> okay - your call. :)

Well, I had the same idea when I added yet another bool :) But I guess 
we won't be adding a lot of additional flags in the near future. 
(MAP_POPULATE? ;) fortunately we use a different approach to populate 
memory)

I'll think about it, not sure yet if this is worth proper flags. Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-03-02 19:01     ` David Hildenbrand
@ 2021-03-02 21:44       ` Peter Xu
  2021-03-03 10:14         ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2021-03-02 21:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Mar 02, 2021 at 08:01:11PM +0100, David Hildenbrand wrote:
> On 02.03.21 18:51, Peter Xu wrote:
> > On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:
> > > +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
> > > +static bool map_noreserve_effective(int fd, bool shared)
> > > +{
> > > +#if defined(__linux__)
> > > +    gchar *content = NULL;
> > > +    const char *endptr;
> > > +    unsigned int tmp;
> > > +
> > > +    /* hugetlbfs behaves differently */
> > > +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
> > > +        return true;
> > > +    }
> > > +
> > > +    /* only private shared mappings are accounted (ignoring /dev/zero) */
> > > +    if (fd != -1 && shared) {
> > > +        return true;
> > > +    }

[1]

> > > +
> > > +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
> > > +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
> > > +        (!endptr || *endptr == '\n')) {
> > > +        if (tmp == 2) {
> > > +            error_report("Skipping reservation of swap space is not supported: "
> > > +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
> > > +            return false;
> > > +        }
> > > +        return true;
> > > +    }
> > > +    /* this interface has been around since Linux 2.6 */
> > > +    error_report("Skipping reservation of swap space is not supported: "
> > > +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
> > > +    return false;
> > > +#else
> > > +    return true;
> > > +#endif
> > > +}
> > 
> > I feel like this helper wants to fail gracefully for some conditions.  Could
> > you elaborate one example and attach to the commit log?
> 
> Sure. The case is "/proc/sys/vm/overcommit_memory == 2" (never overcommit)
> 
> MAP_NORESERVE is without effect and sparse memory regions are somewhat
> impossible.
> 
> > 
> > I'm also wondering whether it would worth to check the global value.  Even if
> > overcommit is globally disabled, do we (as an application process) need to care
> > about it?  I think the MAP_NORESERVE would simply be silently ignored by the
> > kernel and that seems to be design of it, otherwise would all apps who uses > MAP_NORESERVE would need to do similar things too?
> 
> Right, I want to catch the "gets silently ignored" part, because someone
> requested "reserved=off" (!default) but does not actually get what he asked
> for.
> 
> As one example, glibc manages heaps via:
> 
> a) Creating a new heap: mmap(PROT_NONE, MAP_NORESERVE) the maximum size,
> then mprotect(PROT_READ|PROT_WRITE) the initial heap size. Even if
> MAP_NORESERVE is ignored, only !PROT_NONE memory ever gets committed
> ("reserve swap space") in Linux.
> 
> b) Growing the heap via mprotect(PROT_READ|PROT_WRITE) within the existing
> mmap. This will commit memory in case MAP_NORESERVE got ignored.
> 
> c) Shrinking the heap ("discard memory") via MADV_DONTNEED *unless*
> "/proc/sys/vm/overcommit_memory == 2" - the only way to undo
> mprotect(PROT_READ|PROT_WRITE) and to un-commit memory is by doing a
> mmap(PROT_NONE, MAP_FIXED) over the problematic region.
> 
> If you're interested, you can take a look at:
> 
> malloc/arena.c
> sysdeps/unix/sysv/linux/malloc-sysdep.h:check_may_shrink_heap()

Thanks for the context.  It's interesting to know libc has such special heap
operations.

Glibc shrinks heap to save memory for the no-over-commit case, however in our
case currently we'd like to fail some users using global_overcommit=2 but
reserve=off - it means even if we don't fail the user, mmap() could also fail
if it's overcommitted. Even if this mmap() didn't fail, it'll fail very easily
later on iiuc, right?

I think it's fine to have that early failure, it just seems less helpful than
what glibc was doing which shrinks active memory for real, meanwhile there
seems to encode some very detailed OS information into this helper, so just
less charming.

Btw above [1] "fd != -1 && shared" looks weird to me.

Firstly it'll bypass overcommit_memory==2 check and return true directly, is
that right?  I thought the global will be meaningful for all memories except
hugetlbfs (in do_mmap() of Linux).

Meanwhile, I don't see why file-backed share memories is so special too..  From
your commit message, I'm not sure whether you wanted to return false instead,
however that's still not the case IIUC, since e.g. /dev/shmem still does
accounting iiuc, while MAP_NORESERVE will skip it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-03-02 21:44       ` Peter Xu
@ 2021-03-03 10:14         ` David Hildenbrand
  2021-03-03 17:05           ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 10:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 02.03.21 22:44, Peter Xu wrote:
> On Tue, Mar 02, 2021 at 08:01:11PM +0100, David Hildenbrand wrote:
>> On 02.03.21 18:51, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:
>>>> +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
>>>> +static bool map_noreserve_effective(int fd, bool shared)
>>>> +{
>>>> +#if defined(__linux__)
>>>> +    gchar *content = NULL;
>>>> +    const char *endptr;
>>>> +    unsigned int tmp;
>>>> +
>>>> +    /* hugetlbfs behaves differently */
>>>> +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    /* only private shared mappings are accounted (ignoring /dev/zero) */
>>>> +    if (fd != -1 && shared) {
>>>> +        return true;
>>>> +    }
> 
> [1]
> 
>>>> +
>>>> +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
>>>> +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
>>>> +        (!endptr || *endptr == '\n')) {
>>>> +        if (tmp == 2) {
>>>> +            error_report("Skipping reservation of swap space is not supported: "
>>>> +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
>>>> +            return false;
>>>> +        }
>>>> +        return true;
>>>> +    }
>>>> +    /* this interface has been around since Linux 2.6 */
>>>> +    error_report("Skipping reservation of swap space is not supported: "
>>>> +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
>>>> +    return false;
>>>> +#else
>>>> +    return true;
>>>> +#endif
>>>> +}
>>>
>>> I feel like this helper wants to fail gracefully for some conditions.  Could
>>> you elaborate one example and attach to the commit log?
>>
>> Sure. The case is "/proc/sys/vm/overcommit_memory == 2" (never overcommit)
>>
>> MAP_NORESERVE is without effect and sparse memory regions are somewhat
>> impossible.
>>
>>>
>>> I'm also wondering whether it would worth to check the global value.  Even if
>>> overcommit is globally disabled, do we (as an application process) need to care
>>> about it?  I think the MAP_NORESERVE would simply be silently ignored by the
>>> kernel and that seems to be design of it, otherwise would all apps who uses > MAP_NORESERVE would need to do similar things too?
>>
>> Right, I want to catch the "gets silently ignored" part, because someone
>> requested "reserved=off" (!default) but does not actually get what he asked
>> for.
>>
>> As one example, glibc manages heaps via:
>>
>> a) Creating a new heap: mmap(PROT_NONE, MAP_NORESERVE) the maximum size,
>> then mprotect(PROT_READ|PROT_WRITE) the initial heap size. Even if
>> MAP_NORESERVE is ignored, only !PROT_NONE memory ever gets committed
>> ("reserve swap space") in Linux.
>>
>> b) Growing the heap via mprotect(PROT_READ|PROT_WRITE) within the existing
>> mmap. This will commit memory in case MAP_NORESERVE got ignored.
>>
>> c) Shrinking the heap ("discard memory") via MADV_DONTNEED *unless*
>> "/proc/sys/vm/overcommit_memory == 2" - the only way to undo
>> mprotect(PROT_READ|PROT_WRITE) and to un-commit memory is by doing a
>> mmap(PROT_NONE, MAP_FIXED) over the problematic region.
>>
>> If you're interested, you can take a look at:
>>
>> malloc/arena.c
>> sysdeps/unix/sysv/linux/malloc-sysdep.h:check_may_shrink_heap()
> 
> Thanks for the context.  It's interesting to know libc has such special heap
> operations.
> 
> Glibc shrinks heap to save memory for the no-over-commit case, however in our
> case currently we'd like to fail some users using global_overcommit=2 but
> reserve=off - it means even if we don't fail the user, mmap() could also fail
> if it's overcommitted. Even if this mmap() didn't fail, it'll fail very easily
> later on iiuc, right?

Here is the issue I want to catch.

Assume you create a VM with a virtio-mem device that can grow big in the 
future. You specify reserved=off. mmap() succeeds and you assume the 
very sparse memory region does not actually commit memory. But it 
commits all memory in the sparse mmap.

Assume you want to create another VM. It just fails although you still 
have plenty of free memory - because the previous MAP_NORESERVE got 
ignored.

I want to warn the user right away that the configuration is messed up 
and that "reserved=off" is not effective.

For anonymous memory, "reserved=off" will start really being useful when 
having a way to dynamically reserve swap space.

> 
> I think it's fine to have that early failure, it just seems less helpful than
> what glibc was doing which shrinks active memory for real, meanwhile there
> seems to encode some very detailed OS information into this helper, so just
> less charming.

It's not nice, but the messed-up Linux implementation is to blame. Just 
read the Linux man page of the mmap where there is an explicit link to 
"/proc/sys/vm/overcommit_memory" for this very reason.

> 
> Btw above [1] "fd != -1 && shared" looks weird to me.

As we never have shared anonymous memory (and that's good, because it is 
super weird) in QEMU, we can simplify to

"if (shared) { return true; }"

> 
> Firstly it'll bypass overcommit_memory==2 check and return true directly, is
> that right?  I thought the global will be meaningful for all memories except
> hugetlbfs (in do_mmap() of Linux).

See the description in the patch. On basically all (except anonymous 
shared) shared memory we don't ever reserve swap space.

See mmap_region():

if (accountable_mapping(file, vm_flags)) {
	charged = len >> PAGE_SHIFT;
	if (security_vm_enough_memory_mm(mm, charged))
		return -ENOMEM;
	vm_flags |= VM_ACCOUNT;
}

whereby

accountable_mapping():

if (file && is_file_hugepages(file))
	return 0;
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;


in do_mmap(), we only affect if VM_NORESERVE is set or not.


And this makes perfect sense: Most* shared mappings don't need swap 
space because they can just be written back to the original backing 
storage (file). This is different to private mappings, which cannot be 
written back to the file - so we need swap space.

> 
> Meanwhile, I don't see why file-backed share memories is so special too..  From

Just think about 100000 processes mapping the same file shared. Would 
you want to reserve 100000 the same swap space for the same file?

Although you never* ever really need swap space because you can just 
writeback to the file instead of swapping.

> your commit message, I'm not sure whether you wanted to return false instead,
> however that's still not the case IIUC, since e.g. /dev/shmem still does
> accounting iiuc, while MAP_NORESERVE will skip it.

*except /dev/shmem. It actually needs swap space because the actual file 
resides in tmpfs->RAM but Linux will never ever commit that memory.

Main reason is: because you don't know when/how much to commit.

a) When creating/truncating the file we don't know if it's going to be
    sparse. So how much should we commit?
b) When mapping the file, the same comment as above applies: you don't
    want to reserve per-mapper but instead per-file.

You can create and map gigantic memfd/shmem files even with 
"/proc/sys/vm/overcommit_memory == 2", because we will never ever commit 
any of that memory. Yes, shmem is dangerous.

Shared mappings behave like always having MAP_NORESERVE, thus the 
"return true;"

It's interesting that you also raise this point: I also want to propose 
dynamic reservation of swap space for shmem in the future. Instead of 
being per process, this would have to be per file and similar allow to 
coordinate with the kernel how much memory we are actually intending to 
use (->commit) such that the kernel can properly account and reject if 
it wouldn't be possible. If only a single day would have more than 24 
hours :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-02 19:02       ` David Hildenbrand
@ 2021-03-03 11:35         ` Cornelia Huck
  -1 siblings, 0 replies; 41+ messages in thread
From: Cornelia Huck @ 2021-03-03 11:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On Tue, 2 Mar 2021 20:02:34 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 02.03.21 18:32, Peter Xu wrote:
> > On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:  
> >> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
> >>    * to grow. We also have to use MAP parameters that avoid
> >>    * read-only mapping of guest pages.
> >>    */
> >> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
> >> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
> >> +                               bool noreserve)
> >>   {
> >>       static void *mem;
> >>   
> >>       if (mem) {
> >>           /* we only support one allocation, which is enough for initial ram */
> >>           return NULL;
> >> +    } else if (noreserve) {
> >> +        error_report("Skipping reservation of swap space is not supported.");
> >> +        return NULL  
> > 
> > Semicolon missing.  
> 
> Thanks for catching that!

Regardless of that (and this patch set), can we finally get rid of
legacy_s390_alloc? We already fence off running with a kernel prior to
3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
still relevant? This seems to be a generation 10 feature; do we
realistically expect anyone running this on e.g. a z/VM host that
doesn't provide ESOP?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 11:35         ` Cornelia Huck
  0 siblings, 0 replies; 41+ messages in thread
From: Cornelia Huck @ 2021-03-03 11:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Igor Kotrasinski, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On Tue, 2 Mar 2021 20:02:34 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 02.03.21 18:32, Peter Xu wrote:
> > On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:  
> >> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
> >>    * to grow. We also have to use MAP parameters that avoid
> >>    * read-only mapping of guest pages.
> >>    */
> >> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
> >> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
> >> +                               bool noreserve)
> >>   {
> >>       static void *mem;
> >>   
> >>       if (mem) {
> >>           /* we only support one allocation, which is enough for initial ram */
> >>           return NULL;
> >> +    } else if (noreserve) {
> >> +        error_report("Skipping reservation of swap space is not supported.");
> >> +        return NULL  
> > 
> > Semicolon missing.  
> 
> Thanks for catching that!

Regardless of that (and this patch set), can we finally get rid of
legacy_s390_alloc? We already fence off running with a kernel prior to
3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
still relevant? This seems to be a generation 10 feature; do we
realistically expect anyone running this on e.g. a z/VM host that
doesn't provide ESOP?



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-03 11:35         ` Cornelia Huck
@ 2021-03-03 11:37           ` David Hildenbrand
  -1 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 11:37 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, Thomas Huth, kvm,
	qemu-s390x

On 03.03.21 12:35, Cornelia Huck wrote:
> On Tue, 2 Mar 2021 20:02:34 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 02.03.21 18:32, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>     * read-only mapping of guest pages.
>>>>     */
>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>> +                               bool noreserve)
>>>>    {
>>>>        static void *mem;
>>>>    
>>>>        if (mem) {
>>>>            /* we only support one allocation, which is enough for initial ram */
>>>>            return NULL;
>>>> +    } else if (noreserve) {
>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>> +        return NULL
>>>
>>> Semicolon missing.
>>
>> Thanks for catching that!
> 
> Regardless of that (and this patch set), can we finally get rid of
> legacy_s390_alloc? We already fence off running with a kernel prior to
> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
> still relevant? This seems to be a generation 10 feature; do we
> realistically expect anyone running this on e.g. a z/VM host that
> doesn't provide ESOP?

Good question - last time I asked that question (~2 years ago) I was 
told that such z/VM environemnts are still relevant.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 11:37           ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 11:37 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Igor Kotrasinski, Eduardo Habkost, kvm,
	Michael S. Tsirkin, qemu-s390x, Stefan Weil,
	Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On 03.03.21 12:35, Cornelia Huck wrote:
> On Tue, 2 Mar 2021 20:02:34 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 02.03.21 18:32, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>     * read-only mapping of guest pages.
>>>>     */
>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>> +                               bool noreserve)
>>>>    {
>>>>        static void *mem;
>>>>    
>>>>        if (mem) {
>>>>            /* we only support one allocation, which is enough for initial ram */
>>>>            return NULL;
>>>> +    } else if (noreserve) {
>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>> +        return NULL
>>>
>>> Semicolon missing.
>>
>> Thanks for catching that!
> 
> Regardless of that (and this patch set), can we finally get rid of
> legacy_s390_alloc? We already fence off running with a kernel prior to
> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
> still relevant? This seems to be a generation 10 feature; do we
> realistically expect anyone running this on e.g. a z/VM host that
> doesn't provide ESOP?

Good question - last time I asked that question (~2 years ago) I was 
told that such z/VM environemnts are still relevant.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-03 11:35         ` Cornelia Huck
@ 2021-03-03 11:39           ` Thomas Huth
  -1 siblings, 0 replies; 41+ messages in thread
From: Thomas Huth @ 2021-03-03 11:39 UTC (permalink / raw)
  To: Cornelia Huck, David Hildenbrand
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, kvm, qemu-s390x

On 03/03/2021 12.35, Cornelia Huck wrote:
> On Tue, 2 Mar 2021 20:02:34 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 02.03.21 18:32, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>     * read-only mapping of guest pages.
>>>>     */
>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>> +                               bool noreserve)
>>>>    {
>>>>        static void *mem;
>>>>    
>>>>        if (mem) {
>>>>            /* we only support one allocation, which is enough for initial ram */
>>>>            return NULL;
>>>> +    } else if (noreserve) {
>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>> +        return NULL
>>>
>>> Semicolon missing.
>>
>> Thanks for catching that!
> 
> Regardless of that (and this patch set), can we finally get rid of
> legacy_s390_alloc? We already fence off running with a kernel prior to
> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
> still relevant? This seems to be a generation 10 feature; do we
> realistically expect anyone running this on e.g. a z/VM host that
> doesn't provide ESOP?

Looking at the support charts ( 
https://www.ibm.com/support/pages/ibm-mainframe-life-cycle-history ), the 
z10 is already unsupported. So if all newer mainframes have ESOP, I guess it 
should be fine to get rid of this code now.

  Thomas


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 11:39           ` Thomas Huth
  0 siblings, 0 replies; 41+ messages in thread
From: Thomas Huth @ 2021-03-03 11:39 UTC (permalink / raw)
  To: Cornelia Huck, David Hildenbrand
  Cc: Igor Kotrasinski, Eduardo Habkost, kvm, Michael S. Tsirkin,
	qemu-s390x, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Halil Pasic, Christian Borntraeger, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On 03/03/2021 12.35, Cornelia Huck wrote:
> On Tue, 2 Mar 2021 20:02:34 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 02.03.21 18:32, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>     * read-only mapping of guest pages.
>>>>     */
>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>> +                               bool noreserve)
>>>>    {
>>>>        static void *mem;
>>>>    
>>>>        if (mem) {
>>>>            /* we only support one allocation, which is enough for initial ram */
>>>>            return NULL;
>>>> +    } else if (noreserve) {
>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>> +        return NULL
>>>
>>> Semicolon missing.
>>
>> Thanks for catching that!
> 
> Regardless of that (and this patch set), can we finally get rid of
> legacy_s390_alloc? We already fence off running with a kernel prior to
> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
> still relevant? This seems to be a generation 10 feature; do we
> realistically expect anyone running this on e.g. a z/VM host that
> doesn't provide ESOP?

Looking at the support charts ( 
https://www.ibm.com/support/pages/ibm-mainframe-life-cycle-history ), the 
z10 is already unsupported. So if all newer mainframes have ESOP, I guess it 
should be fine to get rid of this code now.

  Thomas



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-03 11:39           ` Thomas Huth
@ 2021-03-03 11:41             ` David Hildenbrand
  -1 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 11:41 UTC (permalink / raw)
  To: Thomas Huth, Cornelia Huck
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, kvm, qemu-s390x

On 03.03.21 12:39, Thomas Huth wrote:
> On 03/03/2021 12.35, Cornelia Huck wrote:
>> On Tue, 2 Mar 2021 20:02:34 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 02.03.21 18:32, Peter Xu wrote:
>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>>      * to grow. We also have to use MAP parameters that avoid
>>>>>      * read-only mapping of guest pages.
>>>>>      */
>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>> +                               bool noreserve)
>>>>>     {
>>>>>         static void *mem;
>>>>>     
>>>>>         if (mem) {
>>>>>             /* we only support one allocation, which is enough for initial ram */
>>>>>             return NULL;
>>>>> +    } else if (noreserve) {
>>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>>> +        return NULL
>>>>
>>>> Semicolon missing.
>>>
>>> Thanks for catching that!
>>
>> Regardless of that (and this patch set), can we finally get rid of
>> legacy_s390_alloc? We already fence off running with a kernel prior to
>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>> still relevant? This seems to be a generation 10 feature; do we
>> realistically expect anyone running this on e.g. a z/VM host that
>> doesn't provide ESOP?
> 
> Looking at the support charts (
> https://www.ibm.com/support/pages/ibm-mainframe-life-cycle-history ), the
> z10 is already unsupported. So if all newer mainframes have ESOP, I guess it
> should be fine to get rid of this code now.

I remember this was a z/VM issue, which would not provide this facility 
to its guests.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 11:41             ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 11:41 UTC (permalink / raw)
  To: Thomas Huth, Cornelia Huck
  Cc: Igor Kotrasinski, Eduardo Habkost, kvm, Michael S. Tsirkin,
	qemu-s390x, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Halil Pasic, Christian Borntraeger, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On 03.03.21 12:39, Thomas Huth wrote:
> On 03/03/2021 12.35, Cornelia Huck wrote:
>> On Tue, 2 Mar 2021 20:02:34 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 02.03.21 18:32, Peter Xu wrote:
>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t offset, void *hostbuf,
>>>>>      * to grow. We also have to use MAP parameters that avoid
>>>>>      * read-only mapping of guest pages.
>>>>>      */
>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>> +                               bool noreserve)
>>>>>     {
>>>>>         static void *mem;
>>>>>     
>>>>>         if (mem) {
>>>>>             /* we only support one allocation, which is enough for initial ram */
>>>>>             return NULL;
>>>>> +    } else if (noreserve) {
>>>>> +        error_report("Skipping reservation of swap space is not supported.");
>>>>> +        return NULL
>>>>
>>>> Semicolon missing.
>>>
>>> Thanks for catching that!
>>
>> Regardless of that (and this patch set), can we finally get rid of
>> legacy_s390_alloc? We already fence off running with a kernel prior to
>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>> still relevant? This seems to be a generation 10 feature; do we
>> realistically expect anyone running this on e.g. a z/VM host that
>> doesn't provide ESOP?
> 
> Looking at the support charts (
> https://www.ibm.com/support/pages/ibm-mainframe-life-cycle-history ), the
> z10 is already unsupported. So if all newer mainframes have ESOP, I guess it
> should be fine to get rid of this code now.

I remember this was a z/VM issue, which would not provide this facility 
to its guests.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-03 11:37           ` David Hildenbrand
@ 2021-03-03 12:12             ` Thomas Huth
  -1 siblings, 0 replies; 41+ messages in thread
From: Thomas Huth @ 2021-03-03 12:12 UTC (permalink / raw)
  To: David Hildenbrand, Cornelia Huck
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, kvm, qemu-s390x

On 03/03/2021 12.37, David Hildenbrand wrote:
> On 03.03.21 12:35, Cornelia Huck wrote:
>> On Tue, 2 Mar 2021 20:02:34 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 02.03.21 18:32, Peter Xu wrote:
>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t 
>>>>> offset, void *hostbuf,
>>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>>     * read-only mapping of guest pages.
>>>>>     */
>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>> +                               bool noreserve)
>>>>>    {
>>>>>        static void *mem;
>>>>>        if (mem) {
>>>>>            /* we only support one allocation, which is enough for 
>>>>> initial ram */
>>>>>            return NULL;
>>>>> +    } else if (noreserve) {
>>>>> +        error_report("Skipping reservation of swap space is not 
>>>>> supported.");
>>>>> +        return NULL
>>>>
>>>> Semicolon missing.
>>>
>>> Thanks for catching that!
>>
>> Regardless of that (and this patch set), can we finally get rid of
>> legacy_s390_alloc? We already fence off running with a kernel prior to
>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>> still relevant? This seems to be a generation 10 feature; do we
>> realistically expect anyone running this on e.g. a z/VM host that
>> doesn't provide ESOP?
> 
> Good question - last time I asked that question (~2 years ago) I was told 
> that such z/VM environemnts are still relevant.

Now that you've mentioned it ... I've even wrote a blog post about z/VM and 
ESOP some years ago:

 
http://people.redhat.com/~thuth/blog/qemu/2017/04/05/s390x-selinux-problem.html

So if I've got that right again, the z/VM ESOP problem only exists on 
versions older than 6.3. And according to 
https://www.ibm.com/support/lifecycle/search?q=z%2FVM those old versions are 
now unsupported since June 2017 ... thus I guess it's valid to assume that 
nobody is running such an old z/VM version anymore (at least not to use it 
as an environment to run nested KVM guests).

  Thomas


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 12:12             ` Thomas Huth
  0 siblings, 0 replies; 41+ messages in thread
From: Thomas Huth @ 2021-03-03 12:12 UTC (permalink / raw)
  To: David Hildenbrand, Cornelia Huck
  Cc: Igor Kotrasinski, Eduardo Habkost, kvm, Michael S. Tsirkin,
	qemu-s390x, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Halil Pasic, Christian Borntraeger, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On 03/03/2021 12.37, David Hildenbrand wrote:
> On 03.03.21 12:35, Cornelia Huck wrote:
>> On Tue, 2 Mar 2021 20:02:34 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 02.03.21 18:32, Peter Xu wrote:
>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t 
>>>>> offset, void *hostbuf,
>>>>>     * to grow. We also have to use MAP parameters that avoid
>>>>>     * read-only mapping of guest pages.
>>>>>     */
>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>> +                               bool noreserve)
>>>>>    {
>>>>>        static void *mem;
>>>>>        if (mem) {
>>>>>            /* we only support one allocation, which is enough for 
>>>>> initial ram */
>>>>>            return NULL;
>>>>> +    } else if (noreserve) {
>>>>> +        error_report("Skipping reservation of swap space is not 
>>>>> supported.");
>>>>> +        return NULL
>>>>
>>>> Semicolon missing.
>>>
>>> Thanks for catching that!
>>
>> Regardless of that (and this patch set), can we finally get rid of
>> legacy_s390_alloc? We already fence off running with a kernel prior to
>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>> still relevant? This seems to be a generation 10 feature; do we
>> realistically expect anyone running this on e.g. a z/VM host that
>> doesn't provide ESOP?
> 
> Good question - last time I asked that question (~2 years ago) I was told 
> that such z/VM environemnts are still relevant.

Now that you've mentioned it ... I've even wrote a blog post about z/VM and 
ESOP some years ago:

 
http://people.redhat.com/~thuth/blog/qemu/2017/04/05/s390x-selinux-problem.html

So if I've got that right again, the z/VM ESOP problem only exists on 
versions older than 6.3. And according to 
https://www.ibm.com/support/lifecycle/search?q=z%2FVM those old versions are 
now unsupported since June 2017 ... thus I guess it's valid to assume that 
nobody is running such an old z/VM version anymore (at least not to use it 
as an environment to run nested KVM guests).

  Thomas



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-03 12:12             ` Thomas Huth
@ 2021-03-03 12:24               ` David Hildenbrand
  -1 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 12:24 UTC (permalink / raw)
  To: Thomas Huth, Cornelia Huck
  Cc: Peter Xu, qemu-devel, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Christian Borntraeger, Halil Pasic,
	Igor Kotrasinski, Juan Quintela, Stefan Weil, kvm, qemu-s390x

On 03.03.21 13:12, Thomas Huth wrote:
> On 03/03/2021 12.37, David Hildenbrand wrote:
>> On 03.03.21 12:35, Cornelia Huck wrote:
>>> On Tue, 2 Mar 2021 20:02:34 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 02.03.21 18:32, Peter Xu wrote:
>>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t
>>>>>> offset, void *hostbuf,
>>>>>>      * to grow. We also have to use MAP parameters that avoid
>>>>>>      * read-only mapping of guest pages.
>>>>>>      */
>>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>>> +                               bool noreserve)
>>>>>>     {
>>>>>>         static void *mem;
>>>>>>         if (mem) {
>>>>>>             /* we only support one allocation, which is enough for
>>>>>> initial ram */
>>>>>>             return NULL;
>>>>>> +    } else if (noreserve) {
>>>>>> +        error_report("Skipping reservation of swap space is not
>>>>>> supported.");
>>>>>> +        return NULL
>>>>>
>>>>> Semicolon missing.
>>>>
>>>> Thanks for catching that!
>>>
>>> Regardless of that (and this patch set), can we finally get rid of
>>> legacy_s390_alloc? We already fence off running with a kernel prior to
>>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>>> still relevant? This seems to be a generation 10 feature; do we
>>> realistically expect anyone running this on e.g. a z/VM host that
>>> doesn't provide ESOP?
>>
>> Good question - last time I asked that question (~2 years ago) I was told
>> that such z/VM environemnts are still relevant.
> 
> Now that you've mentioned it ... I've even wrote a blog post about z/VM and
> ESOP some years ago:
> 
>   
> http://people.redhat.com/~thuth/blog/qemu/2017/04/05/s390x-selinux-problem.html
> 
> So if I've got that right again, the z/VM ESOP problem only exists on
> versions older than 6.3. And according to
> https://www.ibm.com/support/lifecycle/search?q=z%2FVM those old versions are
> now unsupported since June 2017 ... thus I guess it's valid to assume that
> nobody is running such an old z/VM version anymore (at least not to use it
> as an environment to run nested KVM guests).

Thanks for that info, I'll send a patch proposing to rip it out - that 
will make things nicer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-03 12:24               ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-03 12:24 UTC (permalink / raw)
  To: Thomas Huth, Cornelia Huck
  Cc: Igor Kotrasinski, Eduardo Habkost, kvm, Michael S. Tsirkin,
	qemu-s390x, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Halil Pasic, Christian Borntraeger, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Philippe Mathieu-Daudé,
	Juan Quintela

On 03.03.21 13:12, Thomas Huth wrote:
> On 03/03/2021 12.37, David Hildenbrand wrote:
>> On 03.03.21 12:35, Cornelia Huck wrote:
>>> On Tue, 2 Mar 2021 20:02:34 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 02.03.21 18:32, Peter Xu wrote:
>>>>> On Tue, Feb 09, 2021 at 02:49:37PM +0100, David Hildenbrand wrote:
>>>>>> @@ -899,13 +899,17 @@ int kvm_s390_mem_op_pv(S390CPU *cpu, uint64_t
>>>>>> offset, void *hostbuf,
>>>>>>      * to grow. We also have to use MAP parameters that avoid
>>>>>>      * read-only mapping of guest pages.
>>>>>>      */
>>>>>> -static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
>>>>>> +static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared,
>>>>>> +                               bool noreserve)
>>>>>>     {
>>>>>>         static void *mem;
>>>>>>         if (mem) {
>>>>>>             /* we only support one allocation, which is enough for
>>>>>> initial ram */
>>>>>>             return NULL;
>>>>>> +    } else if (noreserve) {
>>>>>> +        error_report("Skipping reservation of swap space is not
>>>>>> supported.");
>>>>>> +        return NULL
>>>>>
>>>>> Semicolon missing.
>>>>
>>>> Thanks for catching that!
>>>
>>> Regardless of that (and this patch set), can we finally get rid of
>>> legacy_s390_alloc? We already fence off running with a kernel prior to
>>> 3.15, and KVM_CAP_S390_COW depends on ESOP -- are non-ESOP kvm hosts
>>> still relevant? This seems to be a generation 10 feature; do we
>>> realistically expect anyone running this on e.g. a z/VM host that
>>> doesn't provide ESOP?
>>
>> Good question - last time I asked that question (~2 years ago) I was told
>> that such z/VM environemnts are still relevant.
> 
> Now that you've mentioned it ... I've even wrote a blog post about z/VM and
> ESOP some years ago:
> 
>   
> http://people.redhat.com/~thuth/blog/qemu/2017/04/05/s390x-selinux-problem.html
> 
> So if I've got that right again, the z/VM ESOP problem only exists on
> versions older than 6.3. And according to
> https://www.ibm.com/support/lifecycle/search?q=z%2FVM those old versions are
> now unsupported since June 2017 ... thus I guess it's valid to assume that
> nobody is running such an old z/VM version anymore (at least not to use it
> as an environment to run nested KVM guests).

Thanks for that info, I'll send a patch proposing to rip it out - that 
will make things nicer.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-03-03 10:14         ` David Hildenbrand
@ 2021-03-03 17:05           ` Peter Xu
  2021-03-04 16:15             ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2021-03-03 17:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Wed, Mar 03, 2021 at 11:14:10AM +0100, David Hildenbrand wrote:
> On 02.03.21 22:44, Peter Xu wrote:
> > On Tue, Mar 02, 2021 at 08:01:11PM +0100, David Hildenbrand wrote:
> > > On 02.03.21 18:51, Peter Xu wrote:
> > > > On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:
> > > > > +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
> > > > > +static bool map_noreserve_effective(int fd, bool shared)
> > > > > +{
> > > > > +#if defined(__linux__)
> > > > > +    gchar *content = NULL;
> > > > > +    const char *endptr;
> > > > > +    unsigned int tmp;
> > > > > +
> > > > > +    /* hugetlbfs behaves differently */
> > > > > +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
> > > > > +        return true;
> > > > > +    }
> > > > > +
> > > > > +    /* only private shared mappings are accounted (ignoring /dev/zero) */
> > > > > +    if (fd != -1 && shared) {
> > > > > +        return true;
> > > > > +    }
> > 
> > [1]
> > 
> > > > > +
> > > > > +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
> > > > > +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
> > > > > +        (!endptr || *endptr == '\n')) {
> > > > > +        if (tmp == 2) {
> > > > > +            error_report("Skipping reservation of swap space is not supported: "
> > > > > +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
> > > > > +            return false;
> > > > > +        }
> > > > > +        return true;
> > > > > +    }
> > > > > +    /* this interface has been around since Linux 2.6 */
> > > > > +    error_report("Skipping reservation of swap space is not supported: "
> > > > > +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
> > > > > +    return false;
> > > > > +#else
> > > > > +    return true;
> > > > > +#endif
> > > > > +}
> > > > 
> > > > I feel like this helper wants to fail gracefully for some conditions.  Could
> > > > you elaborate one example and attach to the commit log?
> > > 
> > > Sure. The case is "/proc/sys/vm/overcommit_memory == 2" (never overcommit)
> > > 
> > > MAP_NORESERVE is without effect and sparse memory regions are somewhat
> > > impossible.
> > > 
> > > > 
> > > > I'm also wondering whether it would worth to check the global value.  Even if
> > > > overcommit is globally disabled, do we (as an application process) need to care
> > > > about it?  I think the MAP_NORESERVE would simply be silently ignored by the
> > > > kernel and that seems to be design of it, otherwise would all apps who uses > MAP_NORESERVE would need to do similar things too?
> > > 
> > > Right, I want to catch the "gets silently ignored" part, because someone
> > > requested "reserved=off" (!default) but does not actually get what he asked
> > > for.
> > > 
> > > As one example, glibc manages heaps via:
> > > 
> > > a) Creating a new heap: mmap(PROT_NONE, MAP_NORESERVE) the maximum size,
> > > then mprotect(PROT_READ|PROT_WRITE) the initial heap size. Even if
> > > MAP_NORESERVE is ignored, only !PROT_NONE memory ever gets committed
> > > ("reserve swap space") in Linux.
> > > 
> > > b) Growing the heap via mprotect(PROT_READ|PROT_WRITE) within the existing
> > > mmap. This will commit memory in case MAP_NORESERVE got ignored.
> > > 
> > > c) Shrinking the heap ("discard memory") via MADV_DONTNEED *unless*
> > > "/proc/sys/vm/overcommit_memory == 2" - the only way to undo
> > > mprotect(PROT_READ|PROT_WRITE) and to un-commit memory is by doing a
> > > mmap(PROT_NONE, MAP_FIXED) over the problematic region.
> > > 
> > > If you're interested, you can take a look at:
> > > 
> > > malloc/arena.c
> > > sysdeps/unix/sysv/linux/malloc-sysdep.h:check_may_shrink_heap()
> > 
> > Thanks for the context.  It's interesting to know libc has such special heap
> > operations.
> > 
> > Glibc shrinks heap to save memory for the no-over-commit case, however in our
> > case currently we'd like to fail some users using global_overcommit=2 but
> > reserve=off - it means even if we don't fail the user, mmap() could also fail
> > if it's overcommitted. Even if this mmap() didn't fail, it'll fail very easily
> > later on iiuc, right?
> 
> Here is the issue I want to catch.
> 
> Assume you create a VM with a virtio-mem device that can grow big in the
> future. You specify reserved=off. mmap() succeeds and you assume the very
> sparse memory region does not actually commit memory. But it commits all
> memory in the sparse mmap.
> 
> Assume you want to create another VM. It just fails although you still have
> plenty of free memory - because the previous MAP_NORESERVE got ignored.

Right.  So I think that's moving the failure earlier.  Again I am not sure how
much it would help but it should be slightly better than a latter failure.

> 
> I want to warn the user right away that the configuration is messed up and
> that "reserved=off" is not effective.
> 
> For anonymous memory, "reserved=off" will start really being useful when
> having a way to dynamically reserve swap space.
> 
> > 
> > I think it's fine to have that early failure, it just seems less helpful than
> > what glibc was doing which shrinks active memory for real, meanwhile there
> > seems to encode some very detailed OS information into this helper, so just
> > less charming.
> 
> It's not nice, but the messed-up Linux implementation is to blame. Just read
> the Linux man page of the mmap where there is an explicit link to
> "/proc/sys/vm/overcommit_memory" for this very reason.
> 
> > 
> > Btw above [1] "fd != -1 && shared" looks weird to me.
> 
> As we never have shared anonymous memory (and that's good, because it is
> super weird) in QEMU, we can simplify to
> 
> "if (shared) { return true; }"

Yes this looks easier to read.  Maybe you can assert the fd in the block too if
you are pretty sure of it.

> 
> > 
> > Firstly it'll bypass overcommit_memory==2 check and return true directly, is
> > that right?  I thought the global will be meaningful for all memories except
> > hugetlbfs (in do_mmap() of Linux).
> 
> See the description in the patch. On basically all (except anonymous shared)
> shared memory we don't ever reserve swap space.
> 
> See mmap_region():
> 
> if (accountable_mapping(file, vm_flags)) {
> 	charged = len >> PAGE_SHIFT;
> 	if (security_vm_enough_memory_mm(mm, charged))
> 		return -ENOMEM;
> 	vm_flags |= VM_ACCOUNT;
> }
> 
> whereby
> 
> accountable_mapping():
> 
> if (file && is_file_hugepages(file))
> 	return 0;
> return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> 
> 
> in do_mmap(), we only affect if VM_NORESERVE is set or not.
> 
> 
> And this makes perfect sense: Most* shared mappings don't need swap space
> because they can just be written back to the original backing storage
> (file). This is different to private mappings, which cannot be written back
> to the file - so we need swap space.
> 
> > 
> > Meanwhile, I don't see why file-backed share memories is so special too..  From
> 
> Just think about 100000 processes mapping the same file shared. Would you
> want to reserve 100000 the same swap space for the same file?
> 
> Although you never* ever really need swap space because you can just
> writeback to the file instead of swapping.
> 
> > your commit message, I'm not sure whether you wanted to return false instead,
> > however that's still not the case IIUC, since e.g. /dev/shmem still does
> > accounting iiuc, while MAP_NORESERVE will skip it.
> 
> *except /dev/shmem. It actually needs swap space because the actual file
> resides in tmpfs->RAM but Linux will never ever commit that memory.
> 
> Main reason is: because you don't know when/how much to commit.
> 
> a) When creating/truncating the file we don't know if it's going to be
>    sparse. So how much should we commit?
> b) When mapping the file, the same comment as above applies: you don't
>    want to reserve per-mapper but instead per-file.
> 
> You can create and map gigantic memfd/shmem files even with
> "/proc/sys/vm/overcommit_memory == 2", because we will never ever commit any
> of that memory. Yes, shmem is dangerous.
> 
> Shared mappings behave like always having MAP_NORESERVE, thus the "return
> true;"
> 
> It's interesting that you also raise this point: I also want to propose
> dynamic reservation of swap space for shmem in the future. Instead of being
> per process, this would have to be per file and similar allow to coordinate
> with the kernel how much memory we are actually intending to use (->commit)
> such that the kernel can properly account and reject if it wouldn't be
> possible. If only a single day would have more than 24 hours :)

Thanks, all you said looks sane.

I can't say I fully understand what you'd like to propose.  To me, a sane user
of /dev/shm should really fallocate() on that before using, but I could also be
wrong.

Another trivial suggestion is you can also add these function pointers into the
comment of the helper, e.g., accountable_mapping() as you referenced.  So it
would be easier for people to understand where these code came from.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
  2021-03-03 17:05           ` Peter Xu
@ 2021-03-04 16:15             ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2021-03-04 16:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Thomas Huth, Cornelia Huck, Eduardo Habkost, Michael S. Tsirkin,
	Stefan Weil, Murilo Opsfelder Araujo, Richard Henderson,
	Dr. David Alan Gilbert, Juan Quintela, qemu-devel, Halil Pasic,
	Christian Borntraeger, Greg Kurz, Stefan Hajnoczi, Igor Mammedov,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

>> I want to warn the user right away that the configuration is messed up and
>> that "reserved=off" is not effective.
>>
>> For anonymous memory, "reserved=off" will start really being useful when
>> having a way to dynamically reserve swap space.
>>
>>>
>>> I think it's fine to have that early failure, it just seems less helpful than
>>> what glibc was doing which shrinks active memory for real, meanwhile there
>>> seems to encode some very detailed OS information into this helper, so just
>>> less charming.
>>
>> It's not nice, but the messed-up Linux implementation is to blame. Just read
>> the Linux man page of the mmap where there is an explicit link to
>> "/proc/sys/vm/overcommit_memory" for this very reason.
>>
>>>
>>> Btw above [1] "fd != -1 && shared" looks weird to me.
>>
>> As we never have shared anonymous memory (and that's good, because it is
>> super weird) in QEMU, we can simplify to
>>
>> "if (shared) { return true; }"
> 
> Yes this looks easier to read.  Maybe you can assert the fd in the block too if
> you are pretty sure of it.

Yes, and I'll also add more comments to clarify.

[...]

>> It's interesting that you also raise this point: I also want to propose
>> dynamic reservation of swap space for shmem in the future. Instead of being
>> per process, this would have to be per file and similar allow to coordinate
>> with the kernel how much memory we are actually intending to use (->commit)
>> such that the kernel can properly account and reject if it wouldn't be
>> possible. If only a single day would have more than 24 hours :)
> 
> Thanks, all you said looks sane.
> 
> I can't say I fully understand what you'd like to propose.  To me, a sane user
> of /dev/shm should really fallocate() on that before using, but I could also be
> wrong.

Even then, no swap space gets reserved / memory gets accounted; as 
accounting is a bit expensive we don't want to do that on fault/discard 
time. It's still a bit of a puzzle.

Also, fallocate() gets pretty useless once you bring memory overcommit / 
virtio-balloon into the picture :(

> 
> Another trivial suggestion is you can also add these function pointers into the
> comment of the helper, e.g., accountable_mapping() as you referenced.  So it
> would be easier for people to understand where these code came from.

Yes, I'll try adding more comments.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-03-04 16:18 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-09 13:49 [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 1/9] softmmu/physmem: drop "shared" parameter from ram_block_add() David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 2/9] util/mmap-alloc: factor out calculation of the pagesize for the guard page David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 3/9] util/mmap-alloc: factor out reserving of a memory region to mmap_reserve() David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 4/9] util/mmap-alloc: factor out activating of memory to mmap_activate() David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 5/9] softmmu/memory: pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
2021-03-02 17:17   ` Peter Xu
2021-02-09 13:49 ` [PATCH v1 6/9] softmmu/memory: pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
2021-03-02 17:17   ` Peter Xu
2021-02-09 13:49 ` [PATCH v1 7/9] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() David Hildenbrand
2021-02-09 13:49   ` David Hildenbrand
2021-03-02 17:32   ` Peter Xu
2021-03-02 17:32     ` Peter Xu
2021-03-02 19:02     ` David Hildenbrand
2021-03-02 19:02       ` David Hildenbrand
2021-03-02 20:54       ` Peter Xu
2021-03-02 20:54         ` Peter Xu
2021-03-02 20:58         ` David Hildenbrand
2021-03-02 20:58           ` David Hildenbrand
2021-03-03 11:35       ` Cornelia Huck
2021-03-03 11:35         ` Cornelia Huck
2021-03-03 11:37         ` David Hildenbrand
2021-03-03 11:37           ` David Hildenbrand
2021-03-03 12:12           ` Thomas Huth
2021-03-03 12:12             ` Thomas Huth
2021-03-03 12:24             ` David Hildenbrand
2021-03-03 12:24               ` David Hildenbrand
2021-03-03 11:39         ` Thomas Huth
2021-03-03 11:39           ` Thomas Huth
2021-03-03 11:41           ` David Hildenbrand
2021-03-03 11:41             ` David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
2021-03-02 17:51   ` Peter Xu
2021-03-02 19:01     ` David Hildenbrand
2021-03-02 21:44       ` Peter Xu
2021-03-03 10:14         ` David Hildenbrand
2021-03-03 17:05           ` Peter Xu
2021-03-04 16:15             ` David Hildenbrand
2021-02-09 13:49 ` [PATCH v1 9/9] hostmem: wire up RAM_NORESERVE via "reserve" property David Hildenbrand
2021-03-02 17:55   ` Peter Xu
2021-03-02 13:12 ` [PATCH v1 0/9] RAM_NORESERVE, MAP_NORESERVE and hostmem " David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.