qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property
@ 2021-04-28 13:37 David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Based-on: 20210406080126.24010-1-david@redhat.com

Some cleanups previously sent in other context (resizeable allocations),
followed by RAM_NORESERVE, implementing it under Linux using MAP_NORESERVE,
and letting users configure it for memory backens using the "reserve"
property (default: true).

MAP_NORESERVE under Linux has in the context of QEMU an effect on
1) Private/shared anonymous memory
-> memory-backend-ram,id=mem0,size=10G
2) Private fd-based mappings
-> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
-> memory-backend-memfd,id=mem0,size=10G
3) Private/shared hugetlb mappings
-> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M

With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
huge pages (3) for the whole memory region.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
"this mapping might be very sparse". This essentially allows
avoiding having to set "/proc/sys/vm/overcommit_memory == 1") when using
virtio-mem and also supporting hugetlbfs in the future.

v6 -> v7:
- Collected ACKs and RBs
- "hostmem: Wire up RAM_NORESERVE via "reserve" property"
-- Extended qapi/qom.json documentation
- "qmp: Include "reserve" property of memory backends"
-- Extended property description

v5 -> v6:
- "softmmu/memory: Pass ram_flags to memory_region_init ..."
-- Split up into two patches
---> "softmmu/memory: Pass ram_flags to memory_region.."
---> "softmmu/memory: Pass ram_flags to qemu_ram_alloc() ..."
-- Also set RAM_PREALLOC from qemu_ram_alloc_from_ptr()
- Collected acks/rbs

v4 -> v5:
- Sent out shared anonymous RAM fixes separately
- Rebased
- "hostmem: Wire up RAM_NORESERVE via "reserve" property"
-- Adjusted/simplified description of new "reserve" property
-- Properly add it to qapi/qom.json
- "qmp: Clarify memory backend properties returned via query-memdev"
-- Added
- "qmp: Include "share" property of memory backends"
-- Added
- "hmp: Print "share" property of memory backends with "info memdev""
- Added
- "qmp: Include "reserve" property of memory backends"
-- Adjust description of new "reserve" property

v3 -> v4:
- Minor comment/description updates
- "softmmu/physmem: Fix ram_block_discard_range() to handle shared ..."
-- Extended description
- "util/mmap-alloc: Pass flags instead of separate bools to ..."
-- Move flags to include/qemu/osdep.h and rename to "QEMU_MAP_*"
- "memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Adjust to new flags. Handle errors in mmap_activate() for now.
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux"
-- Restrict support to Linux only for now
- "qmp: Include "reserve" property of memory backends"
-- Added
- "hmp: Print "reserve" property of memory backends with ..."
-- Added

v2 -> v3:
- Renamed "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
  to "softmmu/physmem: Mark shared anonymous memory RAM_SHARED" and
  adjusted the description
- Added "softmmu/physmem: Fix ram_block_discard_range() to handle shared
  anonymous memory"
- Added "softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous
  memory"
- Added "util/mmap-alloc: Pass flags instead of separate bools to
  qemu_ram_mmap()"
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Further tweak code comments
-- Handle shared anonymous memory

v1 -> v2:
- Rebased to upstream and phs_mem_alloc simplifications
-- Upsteam added the "map_offset" parameter to many RAM allocation
   interfaces.
- "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
-- Use local variable "shared"
- "memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Simplify due to phs_mem_alloc changes
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Add a whole bunch of comments.
-- Exclude shared anonymous memory that QEMU doesn't use
-- Special-case readonly mappings

Cc: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Marcel Apfelbaum <mapfelba@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>

David Hildenbrand (15):
  util/mmap-alloc: Factor out calculation of the pagesize for the guard
    page
  util/mmap-alloc: Factor out reserving of a memory region to
    mmap_reserve()
  util/mmap-alloc: Factor out activating of memory to mmap_activate()
  softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()
  softmmu/memory: Pass ram_flags to
    memory_region_init_ram_shared_nomigrate()
  softmmu/memory: Pass ram_flags to qemu_ram_alloc() and
    qemu_ram_alloc_internal()
  util/mmap-alloc: Pass flags instead of separate bools to
    qemu_ram_mmap()
  memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  hostmem: Wire up RAM_NORESERVE via "reserve" property
  qmp: Clarify memory backend properties returned via query-memdev
  qmp: Include "share" property of memory backends
  hmp: Print "share" property of memory backends with "info memdev"
  qmp: Include "reserve" property of memory backends
  hmp: Print "reserve" property of memory backends with "info memdev"

 backends/hostmem-file.c                       |  11 +-
 backends/hostmem-memfd.c                      |   8 +-
 backends/hostmem-ram.c                        |   7 +-
 backends/hostmem.c                            |  32 +++
 hw/core/machine-hmp-cmds.c                    |   4 +
 hw/core/machine-qmp-cmds.c                    |   2 +
 hw/m68k/next-cube.c                           |   4 +-
 hw/misc/ivshmem.c                             |   5 +-
 include/exec/cpu-common.h                     |   1 +
 include/exec/memory.h                         |  42 ++--
 include/exec/ram_addr.h                       |   9 +-
 include/qemu/mmap-alloc.h                     |  16 +-
 include/qemu/osdep.h                          |  30 ++-
 include/sysemu/hostmem.h                      |   2 +-
 migration/ram.c                               |   3 +-
 qapi/machine.json                             |  16 +-
 qapi/qom.json                                 |  10 +
 .../memory-region-housekeeping.cocci          |   8 +-
 softmmu/memory.c                              |  27 ++-
 softmmu/physmem.c                             |  51 +++--
 util/mmap-alloc.c                             | 212 +++++++++++++-----
 util/oslib-posix.c                            |   7 +-
 util/oslib-win32.c                            |  13 +-
 23 files changed, 362 insertions(+), 158 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v7 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 02/15] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Igor Kotrasinski,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's factor out calculating the size of the guard page and rename the
variable to make it clearer that this pagesize only applies to the
guard page.

Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index e6fa8b598b..24854064b4 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_guard_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+    /* Mappings in the same segment must share the same page size */
+    return qemu_fd_getpagesize(fd);
+#else
+    return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
@@ -90,12 +100,12 @@ void *qemu_ram_mmap(int fd,
                     bool is_pmem,
                     off_t map_offset)
 {
+    const size_t guard_pagesize = mmap_guard_pagesize(fd);
     int prot;
     int flags;
     int map_sync_flags = 0;
     int guardfd;
     size_t offset;
-    size_t pagesize;
     size_t total;
     void *guardptr;
     void *ptr;
@@ -116,8 +126,7 @@ void *qemu_ram_mmap(int fd,
      * anonymous memory is OK.
      */
     flags = MAP_PRIVATE;
-    pagesize = qemu_fd_getpagesize(fd);
-    if (fd == -1 || pagesize == qemu_real_host_page_size) {
+    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
         guardfd = -1;
         flags |= MAP_ANONYMOUS;
     } else {
@@ -126,7 +135,6 @@ void *qemu_ram_mmap(int fd,
     }
 #else
     guardfd = -1;
-    pagesize = qemu_real_host_page_size;
     flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -138,7 +146,7 @@ void *qemu_ram_mmap(int fd,
 
     assert(is_power_of_2(align));
     /* Always align to host page size */
-    assert(align >= pagesize);
+    assert(align >= guard_pagesize);
 
     flags = MAP_FIXED;
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
@@ -193,8 +201,8 @@ void *qemu_ram_mmap(int fd,
      * a guard page guarding against potential buffer overflows.
      */
     total -= offset;
-    if (total > size + pagesize) {
-        munmap(ptr + size + pagesize, total - size - pagesize);
+    if (total > size + guard_pagesize) {
+        munmap(ptr + size + guard_pagesize, total - size - guard_pagesize);
     }
 
     return ptr;
@@ -202,15 +210,8 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-    size_t pagesize;
-
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-        pagesize = qemu_fd_getpagesize(fd);
-#else
-        pagesize = qemu_real_host_page_size;
-#endif
-        munmap(ptr, size + pagesize);
+        munmap(ptr, size + mmap_guard_pagesize(fd));
     }
 }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 02/15] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 03/15] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Igor Kotrasinski,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

We want to reserve a memory region without actually populating memory.
Let's factor that out.

Reviewed-by: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 24854064b4..223d66219c 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+    int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+    /*
+     * On ppc64 mappings in the same segment (aka slice) must share the same
+     * page size. Since we will be re-allocating part of this segment
+     * from the supplied fd, we should make sure to use the same page size, to
+     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+     * avoid allocating backing store memory.
+     * We do this unless we are using the system page size, in which case
+     * anonymous memory is OK.
+     */
+    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+        fd = -1;
+        flags |= MAP_ANONYMOUS;
+    } else {
+        flags |= MAP_NORESERVE;
+    }
+#else
+    fd = -1;
+    flags |= MAP_ANONYMOUS;
+#endif
+
+    return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -104,7 +136,6 @@ void *qemu_ram_mmap(int fd,
     int prot;
     int flags;
     int map_sync_flags = 0;
-    int guardfd;
     size_t offset;
     size_t total;
     void *guardptr;
@@ -116,30 +147,7 @@ void *qemu_ram_mmap(int fd,
      */
     total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-    /* On ppc64 mappings in the same segment (aka slice) must share the same
-     * page size. Since we will be re-allocating part of this segment
-     * from the supplied fd, we should make sure to use the same page size, to
-     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
-     * avoid allocating backing store memory.
-     * We do this unless we are using the system page size, in which case
-     * anonymous memory is OK.
-     */
-    flags = MAP_PRIVATE;
-    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
-        guardfd = -1;
-        flags |= MAP_ANONYMOUS;
-    } else {
-        guardfd = fd;
-        flags |= MAP_NORESERVE;
-    }
-#else
-    guardfd = -1;
-    flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-    guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+    guardptr = mmap_reserve(total, fd);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 03/15] util/mmap-alloc: Factor out activating of memory to mmap_activate()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 02/15] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 04/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd() David Hildenbrand
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

We want to activate memory within a reserved memory region, to make it
accessible. Let's factor that out.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 94 +++++++++++++++++++++++++----------------------
 1 file changed, 50 insertions(+), 44 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 223d66219c..0e2bd7bc0e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,52 @@ static void *mmap_reserve(size_t size, int fd)
     return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Activate memory in a reserved region from the given fd (if any), to make
+ * it accessible.
+ */
+static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
+                           bool shared, bool is_pmem, off_t map_offset)
+{
+    const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
+    int map_sync_flags = 0;
+    int flags = MAP_FIXED;
+    void *activated_ptr;
+
+    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    if (shared && is_pmem) {
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+    }
+
+    activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
+                         map_offset);
+    if (activated_ptr == MAP_FAILED && map_sync_flags) {
+        if (errno == ENOTSUP) {
+            char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+            char *file_name = g_malloc0(PATH_MAX);
+            int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+            if (len < 0) {
+                len = 0;
+            }
+            file_name[len] = '\0';
+            fprintf(stderr, "Warning: requesting persistence across crashes "
+                    "for backend file %s failed. Proceeding without "
+                    "persistence, data might become corrupted in case of host "
+                    "crash.\n", file_name);
+            g_free(proc_link);
+            g_free(file_name);
+        }
+        /*
+         * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+         * again without these flags to handle backwards compatibility.
+         */
+        activated_ptr = mmap(ptr, size, prot, flags, fd, map_offset);
+    }
+    return activated_ptr;
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -133,13 +179,8 @@ void *qemu_ram_mmap(int fd,
                     off_t map_offset)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
-    int prot;
-    int flags;
-    int map_sync_flags = 0;
-    size_t offset;
-    size_t total;
-    void *guardptr;
-    void *ptr;
+    size_t offset, total;
+    void *ptr, *guardptr;
 
     /*
      * Note: this always allocates at least one extra page of virtual address
@@ -156,45 +197,10 @@ void *qemu_ram_mmap(int fd,
     /* Always align to host page size */
     assert(align >= guard_pagesize);
 
-    flags = MAP_FIXED;
-    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
-        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-    }
-
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
-
-    ptr = mmap(guardptr + offset, size, prot,
-               flags | map_sync_flags, fd, map_offset);
-
-    if (ptr == MAP_FAILED && map_sync_flags) {
-        if (errno == ENOTSUP) {
-            char *proc_link, *file_name;
-            int len;
-            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-            file_name = g_malloc0(PATH_MAX);
-            len = readlink(proc_link, file_name, PATH_MAX - 1);
-            if (len < 0) {
-                len = 0;
-            }
-            file_name[len] = '\0';
-            fprintf(stderr, "Warning: requesting persistence across crashes "
-                    "for backend file %s failed. Proceeding without "
-                    "persistence, data might become corrupted in case of host "
-                    "crash.\n", file_name);
-            g_free(proc_link);
-            g_free(file_name);
-        }
-        /*
-         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
-         * we will remove these flags to handle compatibility.
-         */
-        ptr = mmap(guardptr + offset, size, prot, flags, fd, map_offset);
-    }
-
+    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+                        map_offset);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 04/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 03/15] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate() David Hildenbrand
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
to clean up and prepare for more flags.

Simplify the documentation of passed ram flags: Looking at our
documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
repetitive.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-memfd.c | 7 ++++---
 hw/misc/ivshmem.c        | 5 ++---
 include/exec/memory.h    | 9 +++------
 include/exec/ram_addr.h  | 6 +-----
 softmmu/memory.c         | 7 +++----
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 69b0ae30bb..93b5d1a4cf 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
+    uint32_t ram_flags;
     char *name;
     int fd;
 
@@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-                                   name, backend->size,
-                                   backend->share, fd, 0, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+                                   backend->size, ram_flags, fd, 0, errp);
     g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index a1fa4878be..1ba4a98377 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -493,9 +493,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, Error **errp)
     size = buf.st_size;
 
     /* mmap the region and map into the BAR2 */
-    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-                                   "ivshmem.bar2", size, true, fd, 0,
-                                   &local_err);
+    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2",
+                                   size, RAM_SHARED, fd, 0, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a681b2..8ad280e532 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -991,10 +991,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: Memory region features:
- *             - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag
- *             - RAM_PMEM: the memory is persistent memory
- *             Other bits are ignored now.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1020,7 +1017,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1032,7 +1029,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     ram_addr_t offset,
                                     Error **errp);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 3cb9791df3..a7e3378340 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,11 +104,7 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: specify the properties of the ram block, which can be one
- *              or bit-or of following values
- *              - RAM_SHARED: mmap the backing file or device with MAP_SHARED
- *              - RAM_PMEM: the backend @mem_path or @fd is persistent memory
- *              Other bits are ignored.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/softmmu/memory.c b/softmmu/memory.c
index d4493ef9e4..8c3acd839e 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1611,7 +1611,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     ram_addr_t offset,
                                     Error **errp)
@@ -1621,9 +1621,8 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
-                                           share ? RAM_SHARED : 0,
-                                           fd, offset, false, &err);
+    mr->ram_block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, offset,
+                                           false, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 04/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal() David Hildenbrand
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's forward ram_flags instead, renaming
memory_region_init_ram_shared_nomigrate() into
memory_region_init_ram_flags_nomigrate().

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-ram.c                        |  6 +++--
 hw/m68k/next-cube.c                           |  4 ++--
 include/exec/memory.h                         | 24 +++++++++----------
 .../memory-region-housekeeping.cocci          |  8 +++----
 softmmu/memory.c                              | 18 +++++++-------
 5 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 5cc53e76c9..741e701062 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -19,6 +19,7 @@
 static void
 ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
+    uint32_t ram_flags;
     char *name;
 
     if (!backend->size) {
@@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), name,
-                           backend->size, backend->share, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
+                                           backend->size, ram_flags, errp);
     g_free(name);
 }
 
diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c
index 92b45d760f..59ccae0d5e 100644
--- a/hw/m68k/next-cube.c
+++ b/hw/m68k/next-cube.c
@@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine)
     sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x02100000);
 
     /* BMAP memory */
-    memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
-                                            true, &error_fatal);
+    memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
+                                           RAM_SHARED, &error_fatal);
     memory_region_add_subregion(sysmem, 0x020c0000, bmapm1);
     /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */
     memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 8ad280e532..10179c6695 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -928,27 +928,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       Error **errp);
 
 /**
- * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
- *                                           Accesses into the region will
- *                                           modify memory directly.
+ * memory_region_init_ram_flags_nomigrate:  Initialize RAM memory region.
+ *                                          Accesses into the region will
+ *                                          modify memory directly.
  *
  * @mr: the #MemoryRegion to be initialized.
  * @owner: the object that tracks the region's reference count
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @share: allow remapping RAM to different addresses
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Note that this function is similar to memory_region_init_ram_nomigrate.
- * The only difference is part of the RAM region can be remapped.
+ * Note that this function does not do anything to cause the data in the
+ * RAM memory region to be migrated; that is the responsibility of the caller.
  */
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp);
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp);
 
 /**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
diff --git a/scripts/coccinelle/memory-region-housekeeping.cocci b/scripts/coccinelle/memory-region-housekeeping.cocci
index c768d8140a..29651ebde9 100644
--- a/scripts/coccinelle/memory-region-housekeeping.cocci
+++ b/scripts/coccinelle/memory-region-housekeeping.cocci
@@ -127,8 +127,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, obj, E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, obj, E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, obj, E2, E3, E4, E5);
 )
   ...+>
 }
@@ -152,8 +152,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, OBJECT(dev), E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
 )
   ...+>
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 8c3acd839e..67be0aa152 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1533,22 +1533,22 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
 }
 
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp)
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp)
 {
     Error *err = NULL;
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, share, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, ram_flags & RAM_SHARED, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
@@ -1684,7 +1684,7 @@ void memory_region_init_rom_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
     mr->readonly = true;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal(),
preparing for passing additional flags.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/ram_addr.h |  2 +-
 softmmu/memory.c        |  4 ++--
 softmmu/physmem.c       | 29 ++++++++++++-----------------
 3 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index a7e3378340..6d4513f8e2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags, MemoryRegion *mr,
                          Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
                                     void (*resized)(const char*,
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 67be0aa152..580d2c06f5 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1548,7 +1548,7 @@ void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, ram_flags & RAM_SHARED, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, ram_flags, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
@@ -1704,7 +1704,7 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, false,  mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index cc59f05593..11b45be271 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2108,12 +2108,15 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
                                                   uint64_t length,
                                                   void *host),
-                                  void *host, bool resizeable, bool share,
+                                  void *host, uint32_t ram_flags,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
 
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC)) == 0);
+    assert(!host ^ (ram_flags & RAM_PREALLOC));
+
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
     new_block = g_malloc0(sizeof(*new_block));
@@ -2125,15 +2128,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     new_block->fd = -1;
     new_block->page_size = qemu_real_host_page_size;
     new_block->host = host;
-    if (host) {
-        new_block->flags |= RAM_PREALLOC;
-    }
-    if (share) {
-        new_block->flags |= RAM_SHARED;
-    }
-    if (resizeable) {
-        new_block->flags |= RAM_RESIZEABLE;
-    }
+    new_block->flags = ram_flags;
     ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
@@ -2146,15 +2141,15 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                    MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, host, false,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, host, RAM_PREALLOC, mr,
+                                   errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
-                                   share, mr, errp);
+    assert((ram_flags & ~RAM_SHARED) == 0);
+    return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2163,8 +2158,8 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
+                                   RAM_RESIZEABLE, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 08/15] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() David Hildenbrand
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's pass flags instead of bools to prepare for passing other flags and
update the documentation of qemu_ram_mmap(). Introduce new QEMU_MAP_
flags that abstract the mmap() PROT_ and MAP_ flag handling and simplify
it.

We expose only flags that are currently supported by qemu_ram_mmap().
Maybe, we'll see qemu_mmap() in the future as well that can implement these
flags.

Note: We don't use MAP_ flags as some flags (e.g., MAP_SYNC) are only
defined for some systems and we want to always be able to identify
these flags reliably inside qemu_ram_mmap() -- for example, to properly
warn when some future flags are not available or effective on a system.
Also, this way we can simplify PROT_ handling as well.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/mmap-alloc.h | 16 +++++++++-------
 include/qemu/osdep.h      | 18 ++++++++++++++++++
 softmmu/physmem.c         |  8 +++++---
 util/mmap-alloc.c         | 15 ++++++++-------
 util/oslib-posix.c        |  3 ++-
 5 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 456ff87df1..90d0eee705 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,18 +7,22 @@ size_t qemu_fd_getpagesize(int fd);
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
 /**
- * qemu_ram_mmap: mmap the specified file or device.
+ * qemu_ram_mmap: mmap anonymous memory, the specified file or device.
+ *
+ * mmap() abstraction to map guest RAM, simplifying flag handling, taking
+ * care of alignment requirements and installing guard pages.
  *
  * Parameters:
  *  @fd: the file or the device to mmap
  *  @size: the number of bytes to be mmaped
  *  @align: if not zero, specify the alignment of the starting mapping address;
  *          otherwise, the alignment in use will be determined by QEMU.
- *  @readonly: true for a read-only mapping, false for read/write.
- *  @shared: map has RAM_SHARED flag.
- *  @is_pmem: map has RAM_PMEM flag.
+ *  @qemu_map_flags: QEMU_MAP_* flags
  *  @map_offset: map starts at offset of map_offset from the start of fd
  *
+ * Internally, MAP_PRIVATE, MAP_ANONYMOUS and MAP_SHARED_VALIDATE are set
+ * implicitly based on other parameters.
+ *
  * Return:
  *  On success, return a pointer to the mapped area.
  *  On failure, return MAP_FAILED.
@@ -26,9 +30,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
-                    bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t qemu_map_flags,
                     off_t map_offset);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index cb2a07e472..a96d6cb7ac 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -366,6 +366,24 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
+/*
+ * Abstraction of PROT_ and MAP_ flags as passed to mmap(), for example,
+ * consumed by qemu_ram_mmap().
+ */
+
+/* Map PROT_READ instead of PROT_READ | PROT_WRITE. */
+#define QEMU_MAP_READONLY   (1 << 0)
+
+/* Use MAP_SHARED instead of MAP_PRIVATE. */
+#define QEMU_MAP_SHARED     (1 << 1)
+
+/*
+ * Use MAP_SYNC | MAP_SHARED_VALIDATE if supported. Ignored without
+ * QEMU_MAP_SHARED. If mapping fails, warn and fallback to !QEMU_MAP_SYNC.
+ */
+#define QEMU_MAP_SYNC       (1 << 2)
+
+
 #define QEMU_MADV_INVALID -1
 
 #if defined(CONFIG_MADVISE)
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 11b45be271..2e8d1f47f0 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
                             off_t offset,
                             Error **errp)
 {
+    uint32_t qemu_map_flags;
     void *area;
 
     block->page_size = qemu_fd_getpagesize(fd);
@@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
         perror("ftruncate");
     }
 
-    area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
-                         offset);
+    qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
+    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
+    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+    area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0e2bd7bc0e..1ddc0e2a1e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -118,9 +118,12 @@ static void *mmap_reserve(size_t size, int fd)
  * Activate memory in a reserved region from the given fd (if any), to make
  * it accessible.
  */
-static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
-                           bool shared, bool is_pmem, off_t map_offset)
+static void *mmap_activate(void *ptr, size_t size, int fd,
+                           uint32_t qemu_map_flags, off_t map_offset)
 {
+    const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
+    const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
+    const bool sync = qemu_map_flags & QEMU_MAP_SYNC;
     const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
     int map_sync_flags = 0;
     int flags = MAP_FIXED;
@@ -128,7 +131,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
 
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
+    if (shared && sync) {
         map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
     }
 
@@ -173,9 +176,7 @@ static inline size_t mmap_guard_pagesize(int fd)
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
-                    bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t qemu_map_flags,
                     off_t map_offset)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
@@ -199,7 +200,7 @@ void *qemu_ram_mmap(int fd,
 
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+    ptr = mmap_activate(guardptr + offset, size, fd, qemu_map_flags,
                         map_offset);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 36820fec16..c12108c6cb 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -229,8 +229,9 @@ void *qemu_memalign(size_t alignment, size_t size)
 /* alloc shared memory pages */
 void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
 {
+    const uint32_t qemu_map_flags = shared ? QEMU_MAP_SHARED : 0;
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, 0);
+    void *ptr = qemu_ram_mmap(-1, size, align, qemu_map_flags, 0);
 
     if (ptr == MAP_FAILED) {
         return NULL;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 08/15] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux David Hildenbrand
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

"
RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
pages if applicable) is skipped: will bail out if not supported. When not
set, the OS will do the reservation, if supported for the memory type.
"

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add Linux support next and allow specifying
RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h     | 15 ++++++++++++---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/osdep.h      |  9 ++++++++-
 migration/ram.c           |  3 +--
 softmmu/physmem.c         | 15 ++++++++++++---
 util/mmap-alloc.c         |  7 +++++++
 util/oslib-posix.c        |  6 ++++--
 util/oslib-win32.c        | 13 ++++++++++++-
 9 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 10179c6695..8d77819bcd 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -155,6 +155,13 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages if applicable) is skipped: will bail out if not supported. When not
+ * set, the OS will do the reservation, if supported for the memory type.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -937,7 +944,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -991,7 +998,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1017,7 +1025,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6d4513f8e2..551876bed0 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index a96d6cb7ac..af65b36698 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -362,7 +362,8 @@ extern "C" {
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
@@ -383,6 +384,12 @@ void qemu_anon_ram_free(void *ptr, size_t size);
  */
 #define QEMU_MAP_SYNC       (1 << 2)
 
+/*
+ * Use MAP_NORESERVE to skip reservation of swap space (or huge pages if
+ * applicable). Bail out if not supported/effective.
+ */
+#define QEMU_MAP_NORESERVE  (1 << 3)
+
 
 #define QEMU_MADV_INVALID -1
 
diff --git a/migration/ram.c b/migration/ram.c
index 4682f3625c..7b2d4dd449 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3371,8 +3371,7 @@ int colo_init_ram_cache(void)
     WITH_RCU_READ_LOCK_GUARD() {
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             block->colo_cache = qemu_anon_ram_alloc(block->used_length,
-                                                    NULL,
-                                                    false);
+                                                    NULL, false, false);
             if (!block->colo_cache) {
                 error_report("%s: Can't alloc memory for COLO cache of block %s,"
                              "size 0x" RAM_ADDR_FMT, __func__, block->idstr,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 2e8d1f47f0..1efb1d5193 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1584,6 +1584,7 @@ static void *file_ram_alloc(RAMBlock *block,
     qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
     qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
     qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
     area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
@@ -1704,6 +1705,11 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_noreserve(RAMBlock *rb)
+{
+    return rb->flags & RAM_NORESERVE;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -1931,6 +1937,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
 
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
+    const bool noreserve = qemu_ram_is_noreserve(new_block);
     const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
@@ -1954,7 +1961,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
-                                                  shared);
+                                                  shared, noreserve);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2025,7 +2032,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2117,6 +2124,8 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     Error *local_err = NULL;
 
     assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
+                          RAM_NORESERVE)) == 0);
     assert(!host ^ (ram_flags & RAM_PREALLOC));
 
     size = HOST_PAGE_ALIGN(size);
@@ -2150,7 +2159,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~RAM_SHARED) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 1ddc0e2a1e..d0cf4aaee5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -121,6 +122,7 @@ static void *mmap_reserve(size_t size, int fd)
 static void *mmap_activate(void *ptr, size_t size, int fd,
                            uint32_t qemu_map_flags, off_t map_offset)
 {
+    const bool noreserve = qemu_map_flags & QEMU_MAP_NORESERVE;
     const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
     const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
     const bool sync = qemu_map_flags & QEMU_MAP_SYNC;
@@ -129,6 +131,11 @@ static void *mmap_activate(void *ptr, size_t size, int fd,
     int flags = MAP_FIXED;
     void *activated_ptr;
 
+    if (noreserve) {
+        error_report("Skipping reservation of swap space is not supported");
+        return MAP_FAILED;
+    }
+
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
     if (shared && sync) {
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index c12108c6cb..5bbdebae0e 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -227,9 +227,11 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
+                          bool noreserve)
 {
-    const uint32_t qemu_map_flags = shared ? QEMU_MAP_SHARED : 0;
+    const uint32_t qemu_map_flags = (shared ? QEMU_MAP_SHARED : 0) |
+                                    (noreserve ? QEMU_MAP_NORESERVE : 0);
     size_t align = QEMU_VMALLOC_ALIGN;
     void *ptr = qemu_ram_mmap(-1, size, align, qemu_map_flags, 0);
 
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b8012bb..8cafe44179 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include <malloc.h>
 
 /* this must come after including "trace.h" */
@@ -77,10 +78,20 @@ static int get_allocation_granularity(void)
     return system_info.dwAllocationGranularity;
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve)
 {
     void *ptr;
 
+    if (noreserve) {
+        /*
+         * We need a MEM_COMMIT before accessing any memory in a MEM_RESERVE
+         * area; we cannot easily mimic POSIX MAP_NORESERVE semantics.
+         */
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL;
+    }
+
     ptr = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
     trace_qemu_anon_ram_alloc(size, ptr);
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 08/15] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-05-04 10:09   ` Daniel P. Berrangé
  2021-04-28 13:37 ` [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, David Hildenbrand, Richard Henderson,
	Dr. David Alan Gilbert, Peter Xu, Greg Kurz, Stefan Hajnoczi,
	Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.

Linux man page:
  "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
  space is reserved, one has the guarantee that it is possible to modify
  the mapping. When swap space is not reserved one might get SIGSEGV
  upon a write if no physical memory is available. See also the discussion
  of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
  2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

  1) Without MAP_NORESERVE *or* with memory overcommit under Linux
     disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
     accounting/reservation happens:
      For a file backed map
       SHARED or READ-only - 0 cost (the file is the map not swap)
       PRIVATE WRITABLE - size of mapping per instance

      For an anonymous or /dev/zero map
       SHARED   - size of mapping
       PRIVATE READ-only - 0 cost (but of little use)
       PRIVATE WRITABLE - size of mapping per instance

  2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

  1) Without MAP_NORESERVE, huge pages are reserved.

  2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/osdep.h |  3 ++
 softmmu/physmem.c    |  1 +
 util/mmap-alloc.c    | 69 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index af65b36698..0a7384d15c 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -195,6 +195,9 @@ extern "C" {
 #ifndef MAP_FIXED_NOREPLACE
 #define MAP_FIXED_NOREPLACE 0
 #endif
+#ifndef MAP_NORESERVE
+#define MAP_NORESERVE 0
+#endif
 #ifndef ENOMEDIUM
 #define ENOMEDIUM ENODEV
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 1efb1d5193..ccc5985324 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2230,6 +2230,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 flags = MAP_FIXED;
                 flags |= block->flags & RAM_SHARED ?
                          MAP_SHARED : MAP_PRIVATE;
+                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
                 if (block->fd >= 0) {
                     area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
                                 flags, block->fd, offset);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index d0cf4aaee5..838e286ce5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
@@ -83,6 +84,70 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
+{
+#if defined(__linux__)
+    const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
+    const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
+    gchar *content = NULL;
+    const char *endptr;
+    unsigned int tmp;
+
+    /*
+     * hugeltb accounting is different than ordinary swap reservation:
+     * a) Hugetlb pages from the pool are reserved for both private and
+     *    shared mappings. For shared mappings, all mappers have to specify
+     *    MAP_NORESERVE.
+     * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory.
+     */
+    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+        return true;
+    }
+
+    /*
+     * Accountable mappings in the kernel that can be affected by MAP_NORESEVE
+     * are private writable mappings (see mm/mmap.c:accountable_mapping() in
+     * Linux). For all shared or readonly mappings, MAP_NORESERVE is always
+     * implicitly active -- no reservation; this includes shmem. The only
+     * exception is shared anonymous memory, it is accounted like private
+     * anonymous memory.
+     */
+    if (readonly || (shared && fd >= 0)) {
+        return true;
+    }
+
+    /*
+     * MAP_NORESERVE is globally ignored for applicable !hugetlb mappings when
+     * memory overcommit is set to "never". Sparse memory regions aren't really
+     * possible in this system configuration.
+     *
+     * Bail out now instead of silently committing way more memory than
+     * currently desired by the user.
+     */
+    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
+        !qemu_strtoui(content, &endptr, 0, &tmp) &&
+        (!endptr || *endptr == '\n')) {
+        if (tmp == 2) {
+            error_report("Skipping reservation of swap space is not supported:"
+                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
+            return false;
+        }
+        return true;
+    }
+    /* this interface has been around since Linux 2.6 */
+    error_report("Skipping reservation of swap space is not supported:"
+                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
+    return false;
+#endif
+    /*
+     * E.g., FreeBSD used to define MAP_NORESERVE, never implemented it,
+     * and removed it a while ago.
+     */
+    error_report("Skipping reservation of swap space is not supported");
+    return false;
+}
+
 /*
  * Reserve a new memory region of the requested size to be used for mapping
  * from the given fd (if any).
@@ -131,13 +196,13 @@ static void *mmap_activate(void *ptr, size_t size, int fd,
     int flags = MAP_FIXED;
     void *activated_ptr;
 
-    if (noreserve) {
-        error_report("Skipping reservation of swap space is not supported");
+    if (noreserve && !map_noreserve_effective(fd, qemu_map_flags)) {
         return MAP_FAILED;
     }
 
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    flags |= noreserve ? MAP_NORESERVE : 0;
     if (shared && sync) {
         map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
     }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (8 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-05-04  9:58   ` Paolo Bonzini
  2021-05-04 10:12   ` Daniel P. Berrangé
  2021-04-28 13:37 ` [PATCH v7 11/15] qmp: Clarify memory backend properties returned via query-memdev David Hildenbrand
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's provide a way to control the use of RAM_NORESERVE via memory
backends using the "reserve" property which defaults to true (old
behavior).

Only Linux currently supports clearing the flag (and support is checked at
runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
Windows and other POSIX systems will bail out with "reserve=false".

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-file.c  | 11 ++++++-----
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  2 +-
 qapi/qom.json            | 10 ++++++++++
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index b683da9daf..9d550e53d4 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                object_get_typename(OBJECT(backend)));
 #else
     HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
+    uint32_t ram_flags;
     gchar *name;
 
     if (!backend->size) {
@@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
-                                     name,
-                                     backend->size, fb->align,
-                                     (backend->share ? RAM_SHARED : 0) |
-                                     (fb->is_pmem ? RAM_PMEM : 0),
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
+                                     backend->size, fb->align, ram_flags,
                                      fb->mem_path, fb->readonly, errp);
     g_free(name);
 #endif
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 93b5d1a4cf..f3436b623d 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 741e701062..b8e55cdbd0 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
                                            backend->size, ram_flags, errp);
     g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index c6c1ff5b99..58fdc1b658 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
     Error *local_err = NULL;
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
+    if (!backend->reserve && value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+
     if (!host_memory_backend_mr_inited(backend)) {
         backend->prealloc = value;
         return;
@@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
+    backend->reserve = true;
     backend->prealloc_threads = 1;
 }
 
@@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
     backend->share = value;
 }
 
+static bool host_memory_backend_get_reserve(Object *o, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    return backend->reserve;
+}
+
+static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property value");
+        return;
+    }
+    if (backend->prealloc && !value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+    backend->reserve = value;
+}
+
 static bool
 host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
 {
@@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_share, host_memory_backend_set_share);
     object_class_property_set_description(oc, "share",
         "Mark the memory as private to QEMU or shared");
+    object_class_property_add_bool(oc, "reserve",
+        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
+    object_class_property_set_description(oc, "reserve",
+        "Reserve swap space (or huge pages) if applicable");
     /*
      * Do not delete/rename option. This option must be considered stable
      * (as if it didn't have the 'x-' prefix including deprecation period) as
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index df5644723a..9ff5c16963 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -64,7 +64,7 @@ struct HostMemoryBackend {
     /* protected */
     uint64_t size;
     bool merge, dump, use_canonical_path;
-    bool prealloc, is_mapped, share;
+    bool prealloc, is_mapped, share, reserve;
     uint32_t prealloc_threads;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
diff --git a/qapi/qom.json b/qapi/qom.json
index cd0e76d564..4fa3137aab 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -545,6 +545,9 @@
 # @share: if false, the memory is private to QEMU; if true, it is shared
 #         (default: false)
 #
+# @reserve: if true, reserve swap space (or huge pages) if applicable
+#           default: true)
+#
 # @size: size of the memory region in bytes
 #
 # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
@@ -556,6 +559,12 @@
 #                                        false generally, but true for machine
 #                                        types <= 4.0)
 #
+# Note: prealloc=true and reserve=false cannot be set at the same time. With
+#       reserve=true, the behavior depends on the operating system: for example,
+#       Linux will not reserve swap space for shared file mappings --
+#       "not applicable". In contrast, reserve=false will bail out if it cannot
+#       be configured accordingly.
+#
 # Since: 2.1
 ##
 { 'struct': 'MemoryBackendProperties',
@@ -566,6 +575,7 @@
             '*prealloc': 'bool',
             '*prealloc-threads': 'uint32',
             '*share': 'bool',
+            '*reserve': 'bool',
             'size': 'size',
             '*x-use-canonical-path-for-ramblock-id': 'bool' } }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 11/15] qmp: Clarify memory backend properties returned via query-memdev
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (9 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 12/15] qmp: Include "share" property of memory backends David Hildenbrand
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

We return information on the currently configured memory backends and
don't configure them, so decribe what the currently set properties
express.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Suggested-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 qapi/machine.json | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qapi/machine.json b/qapi/machine.json
index 6e90d463fc..758b901185 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -790,11 +790,11 @@
 #
 # @size: memory backend size
 #
-# @merge: enables or disables memory merge support
+# @merge: whether memory merge support is enabled
 #
-# @dump: includes memory backend's memory in a core dump or not
+# @dump: whether memory backend's memory is included in a core dump
 #
-# @prealloc: enables or disables memory preallocation
+# @prealloc: whether memory was preallocated
 #
 # @host-nodes: host nodes for its memory policy
 #
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 12/15] qmp: Include "share" property of memory backends
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (10 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 11/15] qmp: Clarify memory backend properties returned via query-memdev David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 13/15] hmp: Print "share" property of memory backends with "info memdev" David Hildenbrand
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's include the property, which can be helpful when debugging,
for example, to spot misuse of MAP_PRIVATE which can result in some ugly
corner cases (e.g., double-memory consumption on shmem).

Use the same description we also use for describing the property.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json          | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 68a942595a..d41db5b93b 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -174,6 +174,7 @@ static int query_memdev(Object *obj, void *opaque)
         m->merge = object_property_get_bool(obj, "merge", &error_abort);
         m->dump = object_property_get_bool(obj, "dump", &error_abort);
         m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
+        m->share = object_property_get_bool(obj, "share", &error_abort);
         m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
                                              &error_abort);
         host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 758b901185..32650bfe9e 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -796,6 +796,8 @@
 #
 # @prealloc: whether memory was preallocated
 #
+# @share: whether memory is private to QEMU or shared (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -809,6 +811,7 @@
     'merge':      'bool',
     'dump':       'bool',
     'prealloc':   'bool',
+    'share':      'bool',
     'host-nodes': ['uint16'],
     'policy':     'HostMemPolicy' }}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 13/15] hmp: Print "share" property of memory backends with "info memdev"
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (11 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 12/15] qmp: Include "share" property of memory backends David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 14/15] qmp: Include "reserve" property of memory backends David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 15/15] hmp: Print "reserve" property of memory backends with "info memdev" David Hildenbrand
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's print the property.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 58248cffa3..004a92b3d6 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -110,6 +110,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
                        m->value->dump ? "true" : "false");
         monitor_printf(mon, "  prealloc: %s\n",
                        m->value->prealloc ? "true" : "false");
+        monitor_printf(mon, "  share: %s\n",
+                       m->value->share ? "true" : "false");
         monitor_printf(mon, "  policy: %s\n",
                        HostMemPolicy_str(m->value->policy));
         visit_complete(v, &str);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 14/15] qmp: Include "reserve" property of memory backends
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (12 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 13/15] hmp: Print "share" property of memory backends with "info memdev" David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  2021-04-28 13:37 ` [PATCH v7 15/15] hmp: Print "reserve" property of memory backends with "info memdev" David Hildenbrand
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's include the new property.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json          | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index d41db5b93b..2d135ecdd0 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -175,6 +175,7 @@ static int query_memdev(Object *obj, void *opaque)
         m->dump = object_property_get_bool(obj, "dump", &error_abort);
         m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
         m->share = object_property_get_bool(obj, "share", &error_abort);
+        m->reserve = object_property_get_bool(obj, "reserve", &error_abort);
         m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
                                              &error_abort);
         host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 32650bfe9e..ea68f1a083 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -798,6 +798,12 @@
 #
 # @share: whether memory is private to QEMU or shared (since 6.1)
 #
+# @reserve: whether swap space (or huge pages) was reserved if applicable.
+#           This corresponds to the user configuration and not the actual
+#           behavior implemented in the OS to perform the reservation.
+#           For example, Linux will never reserve swap space for shared
+#           file mappings. (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -812,6 +818,7 @@
     'dump':       'bool',
     'prealloc':   'bool',
     'share':      'bool',
+    'reserve':    'bool',
     'host-nodes': ['uint16'],
     'policy':     'HostMemPolicy' }}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 15/15] hmp: Print "reserve" property of memory backends with "info memdev"
  2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (13 preceding siblings ...)
  2021-04-28 13:37 ` [PATCH v7 14/15] qmp: Include "reserve" property of memory backends David Hildenbrand
@ 2021-04-28 13:37 ` David Hildenbrand
  14 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-04-28 13:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Eduardo Habkost,
	Michael S. Tsirkin, Markus Armbruster, David Hildenbrand,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Paolo Bonzini, Nitesh Lal,
	Philippe Mathieu-Daudé

Let's print the new property.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 004a92b3d6..9bedc77bb4 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -112,6 +112,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
                        m->value->prealloc ? "true" : "false");
         monitor_printf(mon, "  share: %s\n",
                        m->value->share ? "true" : "false");
+        monitor_printf(mon, "  reserve: %s\n",
+                       m->value->reserve ? "true" : "false");
         monitor_printf(mon, "  policy: %s\n",
                        HostMemPolicy_str(m->value->policy));
         visit_complete(v, &str);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-04-28 13:37 ` [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
@ 2021-05-04  9:58   ` Paolo Bonzini
  2021-05-06  9:25     ` David Hildenbrand
  2021-05-04 10:12   ` Daniel P. Berrangé
  1 sibling, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2021-05-04  9:58 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu,
	Dr. David Alan Gilbert, Greg Kurz, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé

On 28/04/21 15:37, David Hildenbrand wrote:
> @@ -545,6 +545,9 @@
>   # @share: if false, the memory is private to QEMU; if true, it is shared
>   #         (default: false)
>   #
> +# @reserve: if true, reserve swap space (or huge pages) if applicable
> +#           default: true)

Missing open parenthesis and "since 6.1", otherwise the whole series 
looks good, thanks!

Paolo

> +#



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-04-28 13:37 ` [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux David Hildenbrand
@ 2021-05-04 10:09   ` Daniel P. Berrangé
  2021-05-04 10:21     ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel P. Berrangé @ 2021-05-04 10:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, qemu-devel, Peter Xu, Dr. David Alan Gilbert,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé

On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
> Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
> effect on most shared mappings - except for hugetlbfs and anonymous memory.
> 
> Linux man page:
>   "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
>   space is reserved, one has the guarantee that it is possible to modify
>   the mapping. When swap space is not reserved one might get SIGSEGV
>   upon a write if no physical memory is available. See also the discussion
>   of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
>   2.6, this flag had effect only for private writable mappings."
> 
> Note that the "guarantee" part is wrong with memory overcommit in Linux.
> 
> Also, in Linux hugetlbfs is treated differently - we configure reservation
> of huge pages from the pool, not reservation of swap space (huge pages
> cannot be swapped).
> 
> The rough behavior is [1]:
> a) !Hugetlbfs:
> 
>   1) Without MAP_NORESERVE *or* with memory overcommit under Linux
>      disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
>      accounting/reservation happens:
>       For a file backed map
>        SHARED or READ-only - 0 cost (the file is the map not swap)
>        PRIVATE WRITABLE - size of mapping per instance
> 
>       For an anonymous or /dev/zero map
>        SHARED   - size of mapping
>        PRIVATE READ-only - 0 cost (but of little use)
>        PRIVATE WRITABLE - size of mapping per instance
> 
>   2) With MAP_NORESERVE, no accounting/reservation happens.
> 
> b) Hugetlbfs:
> 
>   1) Without MAP_NORESERVE, huge pages are reserved.
> 
>   2) With MAP_NORESERVE, no huge pages are reserved.
> 
> Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
> to configure it for !hugetlbfs globally; this toggle now allows
> configuring it more fine-grained, not for the whole system.
> 
> The target use case is virtio-mem, which dynamically exposes memory
> inside a large, sparse memory area to the VM.

Can you explain this use case in more real world terms, as I'm not
understanding what a mgmt app would actually do with this in
practice ?




Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-04-28 13:37 ` [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
  2021-05-04  9:58   ` Paolo Bonzini
@ 2021-05-04 10:12   ` Daniel P. Berrangé
  2021-05-04 11:08     ` David Hildenbrand
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel P. Berrangé @ 2021-05-04 10:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu, qemu-devel,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On Wed, Apr 28, 2021 at 03:37:49PM +0200, David Hildenbrand wrote:
> Let's provide a way to control the use of RAM_NORESERVE via memory
> backends using the "reserve" property which defaults to true (old
> behavior).
> 
> Only Linux currently supports clearing the flag (and support is checked at
> runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
> Windows and other POSIX systems will bail out with "reserve=false".
> 
> The target use case is virtio-mem, which dynamically exposes memory
> inside a large, sparse memory area to the VM. This essentially allows
> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
> virtio-mem and also supporting hugetlbfs in the future.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> Reviewed-by: Markus Armbruster <armbru@redhat.com>
> Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  backends/hostmem-file.c  | 11 ++++++-----
>  backends/hostmem-memfd.c |  1 +
>  backends/hostmem-ram.c   |  1 +
>  backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
>  include/sysemu/hostmem.h |  2 +-
>  qapi/qom.json            | 10 ++++++++++
>  6 files changed, 51 insertions(+), 6 deletions(-)
> 
> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> index b683da9daf..9d550e53d4 100644
> --- a/backends/hostmem-file.c
> +++ b/backends/hostmem-file.c
> @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>                 object_get_typename(OBJECT(backend)));
>  #else
>      HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
> +    uint32_t ram_flags;
>      gchar *name;
>  
>      if (!backend->size) {
> @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>      }
>  
>      name = host_memory_backend_get_name(backend);
> -    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
> -                                     name,
> -                                     backend->size, fb->align,
> -                                     (backend->share ? RAM_SHARED : 0) |
> -                                     (fb->is_pmem ? RAM_PMEM : 0),
> +    ram_flags = backend->share ? RAM_SHARED : 0;
> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> +    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
> +    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
> +                                     backend->size, fb->align, ram_flags,
>                                       fb->mem_path, fb->readonly, errp);
>      g_free(name);
>  #endif
> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
> index 93b5d1a4cf..f3436b623d 100644
> --- a/backends/hostmem-memfd.c
> +++ b/backends/hostmem-memfd.c
> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>  
>      name = host_memory_backend_get_name(backend);
>      ram_flags = backend->share ? RAM_SHARED : 0;
> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>      memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>                                     backend->size, ram_flags, fd, 0, errp);
>      g_free(name);
> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
> index 741e701062..b8e55cdbd0 100644
> --- a/backends/hostmem-ram.c
> +++ b/backends/hostmem-ram.c
> @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>  
>      name = host_memory_backend_get_name(backend);
>      ram_flags = backend->share ? RAM_SHARED : 0;
> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>      memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>                                             backend->size, ram_flags, errp);
>      g_free(name);
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index c6c1ff5b99..58fdc1b658 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>      Error *local_err = NULL;
>      HostMemoryBackend *backend = MEMORY_BACKEND(obj);
>  
> +    if (!backend->reserve && value) {
> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
> +        return;
> +    }
> +
>      if (!host_memory_backend_mr_inited(backend)) {
>          backend->prealloc = value;
>          return;
> @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
>      /* TODO: convert access to globals to compat properties */
>      backend->merge = machine_mem_merge(machine);
>      backend->dump = machine_dump_guest_core(machine);
> +    backend->reserve = true;
>      backend->prealloc_threads = 1;
>  }
>  
> @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
>      backend->share = value;
>  }
>  
> +static bool host_memory_backend_get_reserve(Object *o, Error **errp)
> +{
> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
> +
> +    return backend->reserve;
> +}
> +
> +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
> +{
> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
> +
> +    if (host_memory_backend_mr_inited(backend)) {
> +        error_setg(errp, "cannot change property value");
> +        return;
> +    }
> +    if (backend->prealloc && !value) {
> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
> +        return;
> +    }
> +    backend->reserve = value;
> +}
> +
>  static bool
>  host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
>  {
> @@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
>          host_memory_backend_get_share, host_memory_backend_set_share);
>      object_class_property_set_description(oc, "share",
>          "Mark the memory as private to QEMU or shared");
> +    object_class_property_add_bool(oc, "reserve",
> +        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
> +    object_class_property_set_description(oc, "reserve",
> +        "Reserve swap space (or huge pages) if applicable");
>      /*
>       * Do not delete/rename option. This option must be considered stable
>       * (as if it didn't have the 'x-' prefix including deprecation period) as
> diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
> index df5644723a..9ff5c16963 100644
> --- a/include/sysemu/hostmem.h
> +++ b/include/sysemu/hostmem.h
> @@ -64,7 +64,7 @@ struct HostMemoryBackend {
>      /* protected */
>      uint64_t size;
>      bool merge, dump, use_canonical_path;
> -    bool prealloc, is_mapped, share;
> +    bool prealloc, is_mapped, share, reserve;
>      uint32_t prealloc_threads;
>      DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
>      HostMemPolicy policy;
> diff --git a/qapi/qom.json b/qapi/qom.json
> index cd0e76d564..4fa3137aab 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -545,6 +545,9 @@
>  # @share: if false, the memory is private to QEMU; if true, it is shared
>  #         (default: false)
>  #
> +# @reserve: if true, reserve swap space (or huge pages) if applicable
> +#           default: true)
> +#
>  # @size: size of the memory region in bytes
>  #
>  # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
> @@ -556,6 +559,12 @@
>  #                                        false generally, but true for machine
>  #                                        types <= 4.0)
>  #
> +# Note: prealloc=true and reserve=false cannot be set at the same time. With
> +#       reserve=true, the behavior depends on the operating system: for example,
> +#       Linux will not reserve swap space for shared file mappings --
> +#       "not applicable". In contrast, reserve=false will bail out if it cannot
> +#       be configured accordingly.
> +#
>  # Since: 2.1
>  ##
>  { 'struct': 'MemoryBackendProperties',
> @@ -566,6 +575,7 @@
>              '*prealloc': 'bool',
>              '*prealloc-threads': 'uint32',
>              '*share': 'bool',
> +            '*reserve': 'bool',
>              'size': 'size',
>              '*x-use-canonical-path-for-ramblock-id': 'bool' } }

IIUC from the previous patch in the series, 'reserve' is only implemented
on Linux.  If we make this QAPI prop dependant on CONFIG_LINUX, then mgmt
apps will do the right thing to detect what platform(s) it works on.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-05-04 10:09   ` Daniel P. Berrangé
@ 2021-05-04 10:21     ` David Hildenbrand
  2021-05-04 10:32       ` Daniel P. Berrangé
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2021-05-04 10:21 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, qemu-devel, Peter Xu, Dr. David Alan Gilbert,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé

On 04.05.21 12:09, Daniel P. Berrangé wrote:
> On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
>> Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
>> effect on most shared mappings - except for hugetlbfs and anonymous memory.
>>
>> Linux man page:
>>    "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
>>    space is reserved, one has the guarantee that it is possible to modify
>>    the mapping. When swap space is not reserved one might get SIGSEGV
>>    upon a write if no physical memory is available. See also the discussion
>>    of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
>>    2.6, this flag had effect only for private writable mappings."
>>
>> Note that the "guarantee" part is wrong with memory overcommit in Linux.
>>
>> Also, in Linux hugetlbfs is treated differently - we configure reservation
>> of huge pages from the pool, not reservation of swap space (huge pages
>> cannot be swapped).
>>
>> The rough behavior is [1]:
>> a) !Hugetlbfs:
>>
>>    1) Without MAP_NORESERVE *or* with memory overcommit under Linux
>>       disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
>>       accounting/reservation happens:
>>        For a file backed map
>>         SHARED or READ-only - 0 cost (the file is the map not swap)
>>         PRIVATE WRITABLE - size of mapping per instance
>>
>>        For an anonymous or /dev/zero map
>>         SHARED   - size of mapping
>>         PRIVATE READ-only - 0 cost (but of little use)
>>         PRIVATE WRITABLE - size of mapping per instance
>>
>>    2) With MAP_NORESERVE, no accounting/reservation happens.
>>
>> b) Hugetlbfs:
>>
>>    1) Without MAP_NORESERVE, huge pages are reserved.
>>
>>    2) With MAP_NORESERVE, no huge pages are reserved.
>>
>> Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
>> to configure it for !hugetlbfs globally; this toggle now allows
>> configuring it more fine-grained, not for the whole system.
>>
>> The target use case is virtio-mem, which dynamically exposes memory
>> inside a large, sparse memory area to the VM.
> 
> Can you explain this use case in more real world terms, as I'm not
> understanding what a mgmt app would actually do with this in
> practice ?

Let's consider huge pages for simplicity. Assume you have 128 free huge 
pages in your hypervisor that you want to dynamically assign to VMs.

Further assume you have two VMs running. A workflow could look like

1. Assign all huge pages to VM 0
2. Reassign 64 huge pages to VM 1
3. Reassign another 32 huge pages to VM 1
4. Reasssign 16 huge pages to VM 0
5. ...

Basically what we're used to doing with "ordinary" memory.

For that to work with virtio-mem, you'll have to disable reservation of 
huge pages for the virtio-mem managed memory region.

(prealloction of huge pages in virtio-mem to protect from user mistakes 
is a separate work item)

reserve=off will be the default for virtio-mem, and actual 
reservation/preallcoation will be done within virtio-mem. There could be 
use for "reserve=off" for virtio-balloon use cases as well, but I'd like 
to exclude that from the discussion for now.

Hope that answers your question, thanks.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-05-04 10:21     ` David Hildenbrand
@ 2021-05-04 10:32       ` Daniel P. Berrangé
  2021-05-04 11:04         ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel P. Berrangé @ 2021-05-04 10:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, qemu-devel, Peter Xu, Dr. David Alan Gilbert,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé

On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
> On 04.05.21 12:09, Daniel P. Berrangé wrote:
> > On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
> > > Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
> > > effect on most shared mappings - except for hugetlbfs and anonymous memory.
> > > 
> > > Linux man page:
> > >    "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
> > >    space is reserved, one has the guarantee that it is possible to modify
> > >    the mapping. When swap space is not reserved one might get SIGSEGV
> > >    upon a write if no physical memory is available. See also the discussion
> > >    of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
> > >    2.6, this flag had effect only for private writable mappings."
> > > 
> > > Note that the "guarantee" part is wrong with memory overcommit in Linux.
> > > 
> > > Also, in Linux hugetlbfs is treated differently - we configure reservation
> > > of huge pages from the pool, not reservation of swap space (huge pages
> > > cannot be swapped).
> > > 
> > > The rough behavior is [1]:
> > > a) !Hugetlbfs:
> > > 
> > >    1) Without MAP_NORESERVE *or* with memory overcommit under Linux
> > >       disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
> > >       accounting/reservation happens:
> > >        For a file backed map
> > >         SHARED or READ-only - 0 cost (the file is the map not swap)
> > >         PRIVATE WRITABLE - size of mapping per instance
> > > 
> > >        For an anonymous or /dev/zero map
> > >         SHARED   - size of mapping
> > >         PRIVATE READ-only - 0 cost (but of little use)
> > >         PRIVATE WRITABLE - size of mapping per instance
> > > 
> > >    2) With MAP_NORESERVE, no accounting/reservation happens.
> > > 
> > > b) Hugetlbfs:
> > > 
> > >    1) Without MAP_NORESERVE, huge pages are reserved.
> > > 
> > >    2) With MAP_NORESERVE, no huge pages are reserved.
> > > 
> > > Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
> > > to configure it for !hugetlbfs globally; this toggle now allows
> > > configuring it more fine-grained, not for the whole system.
> > > 
> > > The target use case is virtio-mem, which dynamically exposes memory
> > > inside a large, sparse memory area to the VM.
> > 
> > Can you explain this use case in more real world terms, as I'm not
> > understanding what a mgmt app would actually do with this in
> > practice ?
> 
> Let's consider huge pages for simplicity. Assume you have 128 free huge
> pages in your hypervisor that you want to dynamically assign to VMs.
> 
> Further assume you have two VMs running. A workflow could look like
> 
> 1. Assign all huge pages to VM 0
> 2. Reassign 64 huge pages to VM 1
> 3. Reassign another 32 huge pages to VM 1
> 4. Reasssign 16 huge pages to VM 0
> 5. ...
> 
> Basically what we're used to doing with "ordinary" memory.

What does this look like in terms of the memory backend configuration
when you boot VM 0 and VM 1 ?

Are you saying that we boot both VMs with

   -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off

and then we have another property set on 'virtio-mem' to tell it
how much/little of that 128 G, to actually give to the guest ?
How do we change that at runtime ?


> For that to work with virtio-mem, you'll have to disable reservation of huge
> pages for the virtio-mem managed memory region.
> 
> (prealloction of huge pages in virtio-mem to protect from user mistakes is a
> separate work item)
> 
> reserve=off will be the default for virtio-mem, and actual
> reservation/preallcoation will be done within virtio-mem. There could be use
> for "reserve=off" for virtio-balloon use cases as well, but I'd like to
> exclude that from the discussion for now.

The hostmem backend defaults are indepdant of frontend usage, so when you
say reserve=off is the default for virtio-mem, are you expecting the mgmt
app like libvirt to specify that ?

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-05-04 10:32       ` Daniel P. Berrangé
@ 2021-05-04 11:04         ` David Hildenbrand
  2021-05-04 11:14           ` Daniel P. Berrangé
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2021-05-04 11:04 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Michal Privoznik, Richard Henderson, qemu-devel, Peter Xu,
	Dr. David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Igor Mammedov,
	Nitesh Lal, Philippe Mathieu-Daudé

On 04.05.21 12:32, Daniel P. Berrangé wrote:
> On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
>> On 04.05.21 12:09, Daniel P. Berrangé wrote:
>>> On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
>>>> Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
>>>> effect on most shared mappings - except for hugetlbfs and anonymous memory.
>>>>
>>>> Linux man page:
>>>>     "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
>>>>     space is reserved, one has the guarantee that it is possible to modify
>>>>     the mapping. When swap space is not reserved one might get SIGSEGV
>>>>     upon a write if no physical memory is available. See also the discussion
>>>>     of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
>>>>     2.6, this flag had effect only for private writable mappings."
>>>>
>>>> Note that the "guarantee" part is wrong with memory overcommit in Linux.
>>>>
>>>> Also, in Linux hugetlbfs is treated differently - we configure reservation
>>>> of huge pages from the pool, not reservation of swap space (huge pages
>>>> cannot be swapped).
>>>>
>>>> The rough behavior is [1]:
>>>> a) !Hugetlbfs:
>>>>
>>>>     1) Without MAP_NORESERVE *or* with memory overcommit under Linux
>>>>        disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
>>>>        accounting/reservation happens:
>>>>         For a file backed map
>>>>          SHARED or READ-only - 0 cost (the file is the map not swap)
>>>>          PRIVATE WRITABLE - size of mapping per instance
>>>>
>>>>         For an anonymous or /dev/zero map
>>>>          SHARED   - size of mapping
>>>>          PRIVATE READ-only - 0 cost (but of little use)
>>>>          PRIVATE WRITABLE - size of mapping per instance
>>>>
>>>>     2) With MAP_NORESERVE, no accounting/reservation happens.
>>>>
>>>> b) Hugetlbfs:
>>>>
>>>>     1) Without MAP_NORESERVE, huge pages are reserved.
>>>>
>>>>     2) With MAP_NORESERVE, no huge pages are reserved.
>>>>
>>>> Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
>>>> to configure it for !hugetlbfs globally; this toggle now allows
>>>> configuring it more fine-grained, not for the whole system.
>>>>
>>>> The target use case is virtio-mem, which dynamically exposes memory
>>>> inside a large, sparse memory area to the VM.
>>>
>>> Can you explain this use case in more real world terms, as I'm not
>>> understanding what a mgmt app would actually do with this in
>>> practice ?
>>
>> Let's consider huge pages for simplicity. Assume you have 128 free huge
>> pages in your hypervisor that you want to dynamically assign to VMs.
>>
>> Further assume you have two VMs running. A workflow could look like
>>
>> 1. Assign all huge pages to VM 0
>> 2. Reassign 64 huge pages to VM 1
>> 3. Reassign another 32 huge pages to VM 1
>> 4. Reasssign 16 huge pages to VM 0
>> 5. ...
>>
>> Basically what we're used to doing with "ordinary" memory.
> 
> What does this look like in terms of the memory backend configuration
> when you boot VM 0 and VM 1 ?
> 
> Are you saying that we boot both VMs with
> 
>     -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off
> 
> and then we have another property set on 'virtio-mem' to tell it
> how much/little of that 128 G, to actually give to the guest ?
> How do we change that at runtime ?

Roughly, yes. We only special-case memory backends managed by virtio-mem devices.

An advanced example for a single VM could look like this:

sudo build/qemu-system-x86_64 \
	... \
	-m 4G,maxmem=64G \
	-smp sockets=2,cores=2 \
	-object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
	-numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
	-object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
	-numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
	... \
	-object hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
	-device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
	-object hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
	-device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
	... \

We can request a size change by adjusting the "requested-size" property (e.g., via qom-set)
and observe the current size by reading the "size" property (e.g., qom-get). Think of
it as an advanced device-local memory balloon mixed with the concept of a memory hotplug.


I suggest taking a look at the libvirt virito-mem implemetation
-- don't think it's upstream yet:

https://lkml.kernel.org/r/cover.1615982004.git.mprivozn@redhat.com

I'm CCing Michal -- I already gave him a note upfront which additional
properties we might see for memory backends (e.g., reserve, managed-size)
and virtio-mem devices (e.g., iothread, prealloc, reserve, prot).

> 
> 
>> For that to work with virtio-mem, you'll have to disable reservation of huge
>> pages for the virtio-mem managed memory region.
>>
>> (prealloction of huge pages in virtio-mem to protect from user mistakes is a
>> separate work item)
>>
>> reserve=off will be the default for virtio-mem, and actual
>> reservation/preallcoation will be done within virtio-mem. There could be use
>> for "reserve=off" for virtio-balloon use cases as well, but I'd like to
>> exclude that from the discussion for now.
> 
> The hostmem backend defaults are indepdant of frontend usage, so when you
> say reserve=off is the default for virtio-mem, are you expecting the mgmt
> app like libvirt to specify that ?

Sorry, yes exactly; only for the memory backend managed by a virtio-mem device.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-05-04 10:12   ` Daniel P. Berrangé
@ 2021-05-04 11:08     ` David Hildenbrand
  2021-05-04 11:18       ` Daniel P. Berrangé
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2021-05-04 11:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu, qemu-devel,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On 04.05.21 12:12, Daniel P. Berrangé wrote:
> On Wed, Apr 28, 2021 at 03:37:49PM +0200, David Hildenbrand wrote:
>> Let's provide a way to control the use of RAM_NORESERVE via memory
>> backends using the "reserve" property which defaults to true (old
>> behavior).
>>
>> Only Linux currently supports clearing the flag (and support is checked at
>> runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
>> Windows and other POSIX systems will bail out with "reserve=false".
>>
>> The target use case is virtio-mem, which dynamically exposes memory
>> inside a large, sparse memory area to the VM. This essentially allows
>> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
>> virtio-mem and also supporting hugetlbfs in the future.
>>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>> Reviewed-by: Markus Armbruster <armbru@redhat.com>
>> Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Eric Blake <eblake@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   backends/hostmem-file.c  | 11 ++++++-----
>>   backends/hostmem-memfd.c |  1 +
>>   backends/hostmem-ram.c   |  1 +
>>   backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
>>   include/sysemu/hostmem.h |  2 +-
>>   qapi/qom.json            | 10 ++++++++++
>>   6 files changed, 51 insertions(+), 6 deletions(-)
>>
>> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
>> index b683da9daf..9d550e53d4 100644
>> --- a/backends/hostmem-file.c
>> +++ b/backends/hostmem-file.c
>> @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>                  object_get_typename(OBJECT(backend)));
>>   #else
>>       HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
>> +    uint32_t ram_flags;
>>       gchar *name;
>>   
>>       if (!backend->size) {
>> @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>       }
>>   
>>       name = host_memory_backend_get_name(backend);
>> -    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
>> -                                     name,
>> -                                     backend->size, fb->align,
>> -                                     (backend->share ? RAM_SHARED : 0) |
>> -                                     (fb->is_pmem ? RAM_PMEM : 0),
>> +    ram_flags = backend->share ? RAM_SHARED : 0;
>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>> +    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
>> +    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
>> +                                     backend->size, fb->align, ram_flags,
>>                                        fb->mem_path, fb->readonly, errp);
>>       g_free(name);
>>   #endif
>> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
>> index 93b5d1a4cf..f3436b623d 100644
>> --- a/backends/hostmem-memfd.c
>> +++ b/backends/hostmem-memfd.c
>> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>   
>>       name = host_memory_backend_get_name(backend);
>>       ram_flags = backend->share ? RAM_SHARED : 0;
>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>       memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>>                                      backend->size, ram_flags, fd, 0, errp);
>>       g_free(name);
>> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
>> index 741e701062..b8e55cdbd0 100644
>> --- a/backends/hostmem-ram.c
>> +++ b/backends/hostmem-ram.c
>> @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>   
>>       name = host_memory_backend_get_name(backend);
>>       ram_flags = backend->share ? RAM_SHARED : 0;
>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>       memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>>                                              backend->size, ram_flags, errp);
>>       g_free(name);
>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>> index c6c1ff5b99..58fdc1b658 100644
>> --- a/backends/hostmem.c
>> +++ b/backends/hostmem.c
>> @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>>       Error *local_err = NULL;
>>       HostMemoryBackend *backend = MEMORY_BACKEND(obj);
>>   
>> +    if (!backend->reserve && value) {
>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>> +        return;
>> +    }
>> +
>>       if (!host_memory_backend_mr_inited(backend)) {
>>           backend->prealloc = value;
>>           return;
>> @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
>>       /* TODO: convert access to globals to compat properties */
>>       backend->merge = machine_mem_merge(machine);
>>       backend->dump = machine_dump_guest_core(machine);
>> +    backend->reserve = true;
>>       backend->prealloc_threads = 1;
>>   }
>>   
>> @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
>>       backend->share = value;
>>   }
>>   
>> +static bool host_memory_backend_get_reserve(Object *o, Error **errp)
>> +{
>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>> +
>> +    return backend->reserve;
>> +}
>> +
>> +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
>> +{
>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>> +
>> +    if (host_memory_backend_mr_inited(backend)) {
>> +        error_setg(errp, "cannot change property value");
>> +        return;
>> +    }
>> +    if (backend->prealloc && !value) {
>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>> +        return;
>> +    }
>> +    backend->reserve = value;
>> +}
>> +
>>   static bool
>>   host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
>>   {
>> @@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
>>           host_memory_backend_get_share, host_memory_backend_set_share);
>>       object_class_property_set_description(oc, "share",
>>           "Mark the memory as private to QEMU or shared");
>> +    object_class_property_add_bool(oc, "reserve",
>> +        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
>> +    object_class_property_set_description(oc, "reserve",
>> +        "Reserve swap space (or huge pages) if applicable");
>>       /*
>>        * Do not delete/rename option. This option must be considered stable
>>        * (as if it didn't have the 'x-' prefix including deprecation period) as
>> diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
>> index df5644723a..9ff5c16963 100644
>> --- a/include/sysemu/hostmem.h
>> +++ b/include/sysemu/hostmem.h
>> @@ -64,7 +64,7 @@ struct HostMemoryBackend {
>>       /* protected */
>>       uint64_t size;
>>       bool merge, dump, use_canonical_path;
>> -    bool prealloc, is_mapped, share;
>> +    bool prealloc, is_mapped, share, reserve;
>>       uint32_t prealloc_threads;
>>       DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
>>       HostMemPolicy policy;
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index cd0e76d564..4fa3137aab 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -545,6 +545,9 @@
>>   # @share: if false, the memory is private to QEMU; if true, it is shared
>>   #         (default: false)
>>   #
>> +# @reserve: if true, reserve swap space (or huge pages) if applicable
>> +#           default: true)
>> +#
>>   # @size: size of the memory region in bytes
>>   #
>>   # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
>> @@ -556,6 +559,12 @@
>>   #                                        false generally, but true for machine
>>   #                                        types <= 4.0)
>>   #
>> +# Note: prealloc=true and reserve=false cannot be set at the same time. With
>> +#       reserve=true, the behavior depends on the operating system: for example,
>> +#       Linux will not reserve swap space for shared file mappings --
>> +#       "not applicable". In contrast, reserve=false will bail out if it cannot
>> +#       be configured accordingly.
>> +#
>>   # Since: 2.1
>>   ##
>>   { 'struct': 'MemoryBackendProperties',
>> @@ -566,6 +575,7 @@
>>               '*prealloc': 'bool',
>>               '*prealloc-threads': 'uint32',
>>               '*share': 'bool',
>> +            '*reserve': 'bool',
>>               'size': 'size',
>>               '*x-use-canonical-path-for-ramblock-id': 'bool' } }
> 
> IIUC from the previous patch in the series, 'reserve' is only implemented
> on Linux.  If we make this QAPI prop dependant on CONFIG_LINUX, then mgmt
> apps will do the right thing to detect what platform(s) it works on.

Would that mean only conditionally calling (or ifdeffing) the 
object_property_add_* in backends/hostmem.c?

Note that the "share" property is basically ignored on Windows and only 
implemented on POSIX and we don't even bail out when set ...

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-05-04 11:04         ` David Hildenbrand
@ 2021-05-04 11:14           ` Daniel P. Berrangé
  2021-05-04 11:28             ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel P. Berrangé @ 2021-05-04 11:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Michal Privoznik, Richard Henderson, qemu-devel, Peter Xu,
	Dr. David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Igor Mammedov,
	Nitesh Lal, Philippe Mathieu-Daudé

On Tue, May 04, 2021 at 01:04:17PM +0200, David Hildenbrand wrote:
> On 04.05.21 12:32, Daniel P. Berrangé wrote:
> > On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
> > > On 04.05.21 12:09, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
> > > > > Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
> > > > > effect on most shared mappings - except for hugetlbfs and anonymous memory.
> > > > > 
> > > > > Linux man page:
> > > > >     "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
> > > > >     space is reserved, one has the guarantee that it is possible to modify
> > > > >     the mapping. When swap space is not reserved one might get SIGSEGV
> > > > >     upon a write if no physical memory is available. See also the discussion
> > > > >     of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
> > > > >     2.6, this flag had effect only for private writable mappings."
> > > > > 
> > > > > Note that the "guarantee" part is wrong with memory overcommit in Linux.
> > > > > 
> > > > > Also, in Linux hugetlbfs is treated differently - we configure reservation
> > > > > of huge pages from the pool, not reservation of swap space (huge pages
> > > > > cannot be swapped).
> > > > > 
> > > > > The rough behavior is [1]:
> > > > > a) !Hugetlbfs:
> > > > > 
> > > > >     1) Without MAP_NORESERVE *or* with memory overcommit under Linux
> > > > >        disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
> > > > >        accounting/reservation happens:
> > > > >         For a file backed map
> > > > >          SHARED or READ-only - 0 cost (the file is the map not swap)
> > > > >          PRIVATE WRITABLE - size of mapping per instance
> > > > > 
> > > > >         For an anonymous or /dev/zero map
> > > > >          SHARED   - size of mapping
> > > > >          PRIVATE READ-only - 0 cost (but of little use)
> > > > >          PRIVATE WRITABLE - size of mapping per instance
> > > > > 
> > > > >     2) With MAP_NORESERVE, no accounting/reservation happens.
> > > > > 
> > > > > b) Hugetlbfs:
> > > > > 
> > > > >     1) Without MAP_NORESERVE, huge pages are reserved.
> > > > > 
> > > > >     2) With MAP_NORESERVE, no huge pages are reserved.
> > > > > 
> > > > > Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
> > > > > to configure it for !hugetlbfs globally; this toggle now allows
> > > > > configuring it more fine-grained, not for the whole system.
> > > > > 
> > > > > The target use case is virtio-mem, which dynamically exposes memory
> > > > > inside a large, sparse memory area to the VM.
> > > > 
> > > > Can you explain this use case in more real world terms, as I'm not
> > > > understanding what a mgmt app would actually do with this in
> > > > practice ?
> > > 
> > > Let's consider huge pages for simplicity. Assume you have 128 free huge
> > > pages in your hypervisor that you want to dynamically assign to VMs.
> > > 
> > > Further assume you have two VMs running. A workflow could look like
> > > 
> > > 1. Assign all huge pages to VM 0
> > > 2. Reassign 64 huge pages to VM 1
> > > 3. Reassign another 32 huge pages to VM 1
> > > 4. Reasssign 16 huge pages to VM 0
> > > 5. ...
> > > 
> > > Basically what we're used to doing with "ordinary" memory.
> > 
> > What does this look like in terms of the memory backend configuration
> > when you boot VM 0 and VM 1 ?
> > 
> > Are you saying that we boot both VMs with
> > 
> >     -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off
> > 
> > and then we have another property set on 'virtio-mem' to tell it
> > how much/little of that 128 G, to actually give to the guest ?
> > How do we change that at runtime ?
> 
> Roughly, yes. We only special-case memory backends managed by virtio-mem devices.
> 
> An advanced example for a single VM could look like this:
> 
> sudo build/qemu-system-x86_64 \
> 	... \
> 	-m 4G,maxmem=64G \
> 	-smp sockets=2,cores=2 \
> 	-object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
> 	-numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
> 	-object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
> 	-numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
> 	... \
> 	-object hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
> 	-device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
> 	-object hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
> 	-device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
> 	... \
> 
> We can request a size change by adjusting the "requested-size" property (e.g., via qom-set)
> and observe the current size by reading the "size" property (e.g., qom-get). Think of
> it as an advanced device-local memory balloon mixed with the concept of a memory hotplug.

Ok, so in this example, the initial  GB of RAM has normal reserve=on
so if there's insufficient hugepages we'll see the startup failure IIUC.

What happens when we set qom-set requested-size=10GB at runtime, but there
are only 8 GB of hugepages left available ?

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-05-04 11:08     ` David Hildenbrand
@ 2021-05-04 11:18       ` Daniel P. Berrangé
  2021-05-04 12:47         ` David Hildenbrand
  2021-05-06  9:59         ` David Hildenbrand
  0 siblings, 2 replies; 29+ messages in thread
From: Daniel P. Berrangé @ 2021-05-04 11:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu, qemu-devel,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On Tue, May 04, 2021 at 01:08:02PM +0200, David Hildenbrand wrote:
> On 04.05.21 12:12, Daniel P. Berrangé wrote:
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, David Hildenbrand wrote:
> > > Let's provide a way to control the use of RAM_NORESERVE via memory
> > > backends using the "reserve" property which defaults to true (old
> > > behavior).
> > > 
> > > Only Linux currently supports clearing the flag (and support is checked at
> > > runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
> > > Windows and other POSIX systems will bail out with "reserve=false".
> > > 
> > > The target use case is virtio-mem, which dynamically exposes memory
> > > inside a large, sparse memory area to the VM. This essentially allows
> > > avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
> > > virtio-mem and also supporting hugetlbfs in the future.
> > > 
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> > > Reviewed-by: Markus Armbruster <armbru@redhat.com>
> > > Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
> > > Cc: Markus Armbruster <armbru@redhat.com>
> > > Cc: Eric Blake <eblake@redhat.com>
> > > Cc: Igor Mammedov <imammedo@redhat.com>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   backends/hostmem-file.c  | 11 ++++++-----
> > >   backends/hostmem-memfd.c |  1 +
> > >   backends/hostmem-ram.c   |  1 +
> > >   backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
> > >   include/sysemu/hostmem.h |  2 +-
> > >   qapi/qom.json            | 10 ++++++++++
> > >   6 files changed, 51 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> > > index b683da9daf..9d550e53d4 100644
> > > --- a/backends/hostmem-file.c
> > > +++ b/backends/hostmem-file.c
> > > @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> > >                  object_get_typename(OBJECT(backend)));
> > >   #else
> > >       HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
> > > +    uint32_t ram_flags;
> > >       gchar *name;
> > >       if (!backend->size) {
> > > @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> > >       }
> > >       name = host_memory_backend_get_name(backend);
> > > -    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
> > > -                                     name,
> > > -                                     backend->size, fb->align,
> > > -                                     (backend->share ? RAM_SHARED : 0) |
> > > -                                     (fb->is_pmem ? RAM_PMEM : 0),
> > > +    ram_flags = backend->share ? RAM_SHARED : 0;
> > > +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> > > +    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
> > > +    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
> > > +                                     backend->size, fb->align, ram_flags,
> > >                                        fb->mem_path, fb->readonly, errp);
> > >       g_free(name);
> > >   #endif
> > > diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
> > > index 93b5d1a4cf..f3436b623d 100644
> > > --- a/backends/hostmem-memfd.c
> > > +++ b/backends/hostmem-memfd.c
> > > @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> > >       name = host_memory_backend_get_name(backend);
> > >       ram_flags = backend->share ? RAM_SHARED : 0;
> > > +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> > >       memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
> > >                                      backend->size, ram_flags, fd, 0, errp);
> > >       g_free(name);
> > > diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
> > > index 741e701062..b8e55cdbd0 100644
> > > --- a/backends/hostmem-ram.c
> > > +++ b/backends/hostmem-ram.c
> > > @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> > >       name = host_memory_backend_get_name(backend);
> > >       ram_flags = backend->share ? RAM_SHARED : 0;
> > > +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> > >       memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
> > >                                              backend->size, ram_flags, errp);
> > >       g_free(name);
> > > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > > index c6c1ff5b99..58fdc1b658 100644
> > > --- a/backends/hostmem.c
> > > +++ b/backends/hostmem.c
> > > @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
> > >       Error *local_err = NULL;
> > >       HostMemoryBackend *backend = MEMORY_BACKEND(obj);
> > > +    if (!backend->reserve && value) {
> > > +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
> > > +        return;
> > > +    }
> > > +
> > >       if (!host_memory_backend_mr_inited(backend)) {
> > >           backend->prealloc = value;
> > >           return;
> > > @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
> > >       /* TODO: convert access to globals to compat properties */
> > >       backend->merge = machine_mem_merge(machine);
> > >       backend->dump = machine_dump_guest_core(machine);
> > > +    backend->reserve = true;
> > >       backend->prealloc_threads = 1;
> > >   }
> > > @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
> > >       backend->share = value;
> > >   }
> > > +static bool host_memory_backend_get_reserve(Object *o, Error **errp)
> > > +{
> > > +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
> > > +
> > > +    return backend->reserve;
> > > +}
> > > +
> > > +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
> > > +{
> > > +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
> > > +
> > > +    if (host_memory_backend_mr_inited(backend)) {
> > > +        error_setg(errp, "cannot change property value");
> > > +        return;
> > > +    }
> > > +    if (backend->prealloc && !value) {
> > > +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
> > > +        return;
> > > +    }
> > > +    backend->reserve = value;
> > > +}
> > > +
> > >   static bool
> > >   host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
> > >   {
> > > @@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
> > >           host_memory_backend_get_share, host_memory_backend_set_share);
> > >       object_class_property_set_description(oc, "share",
> > >           "Mark the memory as private to QEMU or shared");
> > > +    object_class_property_add_bool(oc, "reserve",
> > > +        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
> > > +    object_class_property_set_description(oc, "reserve",
> > > +        "Reserve swap space (or huge pages) if applicable");
> > >       /*
> > >        * Do not delete/rename option. This option must be considered stable
> > >        * (as if it didn't have the 'x-' prefix including deprecation period) as
> > > diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
> > > index df5644723a..9ff5c16963 100644
> > > --- a/include/sysemu/hostmem.h
> > > +++ b/include/sysemu/hostmem.h
> > > @@ -64,7 +64,7 @@ struct HostMemoryBackend {
> > >       /* protected */
> > >       uint64_t size;
> > >       bool merge, dump, use_canonical_path;
> > > -    bool prealloc, is_mapped, share;
> > > +    bool prealloc, is_mapped, share, reserve;
> > >       uint32_t prealloc_threads;
> > >       DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
> > >       HostMemPolicy policy;
> > > diff --git a/qapi/qom.json b/qapi/qom.json
> > > index cd0e76d564..4fa3137aab 100644
> > > --- a/qapi/qom.json
> > > +++ b/qapi/qom.json
> > > @@ -545,6 +545,9 @@
> > >   # @share: if false, the memory is private to QEMU; if true, it is shared
> > >   #         (default: false)
> > >   #
> > > +# @reserve: if true, reserve swap space (or huge pages) if applicable
> > > +#           default: true)
> > > +#
> > >   # @size: size of the memory region in bytes
> > >   #
> > >   # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
> > > @@ -556,6 +559,12 @@
> > >   #                                        false generally, but true for machine
> > >   #                                        types <= 4.0)
> > >   #
> > > +# Note: prealloc=true and reserve=false cannot be set at the same time. With
> > > +#       reserve=true, the behavior depends on the operating system: for example,
> > > +#       Linux will not reserve swap space for shared file mappings --
> > > +#       "not applicable". In contrast, reserve=false will bail out if it cannot
> > > +#       be configured accordingly.
> > > +#
> > >   # Since: 2.1
> > >   ##
> > >   { 'struct': 'MemoryBackendProperties',
> > > @@ -566,6 +575,7 @@
> > >               '*prealloc': 'bool',
> > >               '*prealloc-threads': 'uint32',
> > >               '*share': 'bool',
> > > +            '*reserve': 'bool',
> > >               'size': 'size',
> > >               '*x-use-canonical-path-for-ramblock-id': 'bool' } }
> > 
> > IIUC from the previous patch in the series, 'reserve' is only implemented
> > on Linux.  If we make this QAPI prop dependant on CONFIG_LINUX, then mgmt
> > apps will do the right thing to detect what platform(s) it works on.
> 
> Would that mean only conditionally calling (or ifdeffing) the
> object_property_add_* in backends/hostmem.c?

Yes, any code which refers to the property would need to have
matching conditionals. The plus side is that you don't have to
do error reporting to say it doesn't exist, because it becomes
impossible for the mgmt app to even set the property in the
first place on platforms where it doesn't exist. 

> Note that the "share" property is basically ignored on Windows and only
> implemented on POSIX and we don't even bail out when set ...

It might pre-date the support for conditionals in QAPI. Silently ignoring
things that can't be supported is even worse !

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  2021-05-04 11:14           ` Daniel P. Berrangé
@ 2021-05-04 11:28             ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-05-04 11:28 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Michal Privoznik, Richard Henderson, qemu-devel, Peter Xu,
	Dr. David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Igor Mammedov,
	Nitesh Lal, Philippe Mathieu-Daudé

On 04.05.21 13:14, Daniel P. Berrangé wrote:
> On Tue, May 04, 2021 at 01:04:17PM +0200, David Hildenbrand wrote:
>> On 04.05.21 12:32, Daniel P. Berrangé wrote:
>>> On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
>>>> On 04.05.21 12:09, Daniel P. Berrangé wrote:
>>>>> On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
>>>>>> Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
>>>>>> effect on most shared mappings - except for hugetlbfs and anonymous memory.
>>>>>>
>>>>>> Linux man page:
>>>>>>      "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
>>>>>>      space is reserved, one has the guarantee that it is possible to modify
>>>>>>      the mapping. When swap space is not reserved one might get SIGSEGV
>>>>>>      upon a write if no physical memory is available. See also the discussion
>>>>>>      of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
>>>>>>      2.6, this flag had effect only for private writable mappings."
>>>>>>
>>>>>> Note that the "guarantee" part is wrong with memory overcommit in Linux.
>>>>>>
>>>>>> Also, in Linux hugetlbfs is treated differently - we configure reservation
>>>>>> of huge pages from the pool, not reservation of swap space (huge pages
>>>>>> cannot be swapped).
>>>>>>
>>>>>> The rough behavior is [1]:
>>>>>> a) !Hugetlbfs:
>>>>>>
>>>>>>      1) Without MAP_NORESERVE *or* with memory overcommit under Linux
>>>>>>         disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
>>>>>>         accounting/reservation happens:
>>>>>>          For a file backed map
>>>>>>           SHARED or READ-only - 0 cost (the file is the map not swap)
>>>>>>           PRIVATE WRITABLE - size of mapping per instance
>>>>>>
>>>>>>          For an anonymous or /dev/zero map
>>>>>>           SHARED   - size of mapping
>>>>>>           PRIVATE READ-only - 0 cost (but of little use)
>>>>>>           PRIVATE WRITABLE - size of mapping per instance
>>>>>>
>>>>>>      2) With MAP_NORESERVE, no accounting/reservation happens.
>>>>>>
>>>>>> b) Hugetlbfs:
>>>>>>
>>>>>>      1) Without MAP_NORESERVE, huge pages are reserved.
>>>>>>
>>>>>>      2) With MAP_NORESERVE, no huge pages are reserved.
>>>>>>
>>>>>> Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
>>>>>> to configure it for !hugetlbfs globally; this toggle now allows
>>>>>> configuring it more fine-grained, not for the whole system.
>>>>>>
>>>>>> The target use case is virtio-mem, which dynamically exposes memory
>>>>>> inside a large, sparse memory area to the VM.
>>>>>
>>>>> Can you explain this use case in more real world terms, as I'm not
>>>>> understanding what a mgmt app would actually do with this in
>>>>> practice ?
>>>>
>>>> Let's consider huge pages for simplicity. Assume you have 128 free huge
>>>> pages in your hypervisor that you want to dynamically assign to VMs.
>>>>
>>>> Further assume you have two VMs running. A workflow could look like
>>>>
>>>> 1. Assign all huge pages to VM 0
>>>> 2. Reassign 64 huge pages to VM 1
>>>> 3. Reassign another 32 huge pages to VM 1
>>>> 4. Reasssign 16 huge pages to VM 0
>>>> 5. ...
>>>>
>>>> Basically what we're used to doing with "ordinary" memory.
>>>
>>> What does this look like in terms of the memory backend configuration
>>> when you boot VM 0 and VM 1 ?
>>>
>>> Are you saying that we boot both VMs with
>>>
>>>      -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off
>>>
>>> and then we have another property set on 'virtio-mem' to tell it
>>> how much/little of that 128 G, to actually give to the guest ?
>>> How do we change that at runtime ?
>>
>> Roughly, yes. We only special-case memory backends managed by virtio-mem devices.
>>
>> An advanced example for a single VM could look like this:
>>
>> sudo build/qemu-system-x86_64 \
>> 	... \
>> 	-m 4G,maxmem=64G \
>> 	-smp sockets=2,cores=2 \
>> 	-object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
>> 	-numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
>> 	-object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
>> 	-numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
>> 	... \
>> 	-object hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
>> 	-device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
>> 	-object hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
>> 	-device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
>> 	... \
>>
>> We can request a size change by adjusting the "requested-size" property (e.g., via qom-set)
>> and observe the current size by reading the "size" property (e.g., qom-get). Think of
>> it as an advanced device-local memory balloon mixed with the concept of a memory hotplug.
> 
> Ok, so in this example, the initial  GB of RAM has normal reserve=on
> so if there's insufficient hugepages we'll see the startup failure IIUC.

Yes, except in some NUMA configurations, as huge page reservation isn't 
numa aware; even with reservation there are cases where we can run out 
of applicable free huge pages. Usually we end up preallocating all 
memory in the memory backend just so we're on the safe side.

> 
> What happens when we set qom-set requested-size=10GB at runtime, but there
> are only 8 GB of hugepages left available ?

This is one of the user errors that will be tackled next by a dynamic 
preallocation (and/or reservation) inside virtio-mem.

Once the guest would actually touch >8 GiB, we run out of free huge 
pages and don't have huge page overcommit enabled (or huge page 
overcommit fails allocation which can happen easily), we'd essentially 
crash the VM.

Pretty much similar to messing up memory overcommit with "ordinary" 
memory and getting your VM killed by the OOM handler.

The solution is fairly easy: preallocate huge pages when resizing the 
virtio-mem device (making new huge pages available to the VM in this case).

In the simplest case this can be done using fallocate(). If you're 
interested about the dirty details where it's not that easy, take a look 
at my MADV_POPULATE_READ/MADV_POPULATE_WRITE kernel series [1]. Marek is 
working on handling virtio-mem device via an iothread, so we can do 
preallocation easily "concurrently" while the VM is running, avoiding 
holding the BQL for a long time.

[1] https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-05-04 11:18       ` Daniel P. Berrangé
@ 2021-05-04 12:47         ` David Hildenbrand
  2021-05-06  9:59         ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-05-04 12:47 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu, qemu-devel,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On 04.05.21 13:18, Daniel P. Berrangé wrote:
> On Tue, May 04, 2021 at 01:08:02PM +0200, David Hildenbrand wrote:
>> On 04.05.21 12:12, Daniel P. Berrangé wrote:
>>> On Wed, Apr 28, 2021 at 03:37:49PM +0200, David Hildenbrand wrote:
>>>> Let's provide a way to control the use of RAM_NORESERVE via memory
>>>> backends using the "reserve" property which defaults to true (old
>>>> behavior).
>>>>
>>>> Only Linux currently supports clearing the flag (and support is checked at
>>>> runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
>>>> Windows and other POSIX systems will bail out with "reserve=false".
>>>>
>>>> The target use case is virtio-mem, which dynamically exposes memory
>>>> inside a large, sparse memory area to the VM. This essentially allows
>>>> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
>>>> virtio-mem and also supporting hugetlbfs in the future.
>>>>
>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>>> Reviewed-by: Markus Armbruster <armbru@redhat.com>
>>>> Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
>>>> Cc: Markus Armbruster <armbru@redhat.com>
>>>> Cc: Eric Blake <eblake@redhat.com>
>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    backends/hostmem-file.c  | 11 ++++++-----
>>>>    backends/hostmem-memfd.c |  1 +
>>>>    backends/hostmem-ram.c   |  1 +
>>>>    backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
>>>>    include/sysemu/hostmem.h |  2 +-
>>>>    qapi/qom.json            | 10 ++++++++++
>>>>    6 files changed, 51 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
>>>> index b683da9daf..9d550e53d4 100644
>>>> --- a/backends/hostmem-file.c
>>>> +++ b/backends/hostmem-file.c
>>>> @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>                   object_get_typename(OBJECT(backend)));
>>>>    #else
>>>>        HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
>>>> +    uint32_t ram_flags;
>>>>        gchar *name;
>>>>        if (!backend->size) {
>>>> @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        }
>>>>        name = host_memory_backend_get_name(backend);
>>>> -    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
>>>> -                                     name,
>>>> -                                     backend->size, fb->align,
>>>> -                                     (backend->share ? RAM_SHARED : 0) |
>>>> -                                     (fb->is_pmem ? RAM_PMEM : 0),
>>>> +    ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>> +    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
>>>> +    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
>>>> +                                     backend->size, fb->align, ram_flags,
>>>>                                         fb->mem_path, fb->readonly, errp);
>>>>        g_free(name);
>>>>    #endif
>>>> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
>>>> index 93b5d1a4cf..f3436b623d 100644
>>>> --- a/backends/hostmem-memfd.c
>>>> +++ b/backends/hostmem-memfd.c
>>>> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        name = host_memory_backend_get_name(backend);
>>>>        ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>>        memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>>>>                                       backend->size, ram_flags, fd, 0, errp);
>>>>        g_free(name);
>>>> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
>>>> index 741e701062..b8e55cdbd0 100644
>>>> --- a/backends/hostmem-ram.c
>>>> +++ b/backends/hostmem-ram.c
>>>> @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        name = host_memory_backend_get_name(backend);
>>>>        ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>>        memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>>>>                                               backend->size, ram_flags, errp);
>>>>        g_free(name);
>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>>> index c6c1ff5b99..58fdc1b658 100644
>>>> --- a/backends/hostmem.c
>>>> +++ b/backends/hostmem.c
>>>> @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>>>>        Error *local_err = NULL;
>>>>        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
>>>> +    if (!backend->reserve && value) {
>>>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>>>> +        return;
>>>> +    }
>>>> +
>>>>        if (!host_memory_backend_mr_inited(backend)) {
>>>>            backend->prealloc = value;
>>>>            return;
>>>> @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
>>>>        /* TODO: convert access to globals to compat properties */
>>>>        backend->merge = machine_mem_merge(machine);
>>>>        backend->dump = machine_dump_guest_core(machine);
>>>> +    backend->reserve = true;
>>>>        backend->prealloc_threads = 1;
>>>>    }
>>>> @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
>>>>        backend->share = value;
>>>>    }
>>>> +static bool host_memory_backend_get_reserve(Object *o, Error **errp)
>>>> +{
>>>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>>>> +
>>>> +    return backend->reserve;
>>>> +}
>>>> +
>>>> +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
>>>> +{
>>>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>>>> +
>>>> +    if (host_memory_backend_mr_inited(backend)) {
>>>> +        error_setg(errp, "cannot change property value");
>>>> +        return;
>>>> +    }
>>>> +    if (backend->prealloc && !value) {
>>>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>>>> +        return;
>>>> +    }
>>>> +    backend->reserve = value;
>>>> +}
>>>> +
>>>>    static bool
>>>>    host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
>>>>    {
>>>> @@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
>>>>            host_memory_backend_get_share, host_memory_backend_set_share);
>>>>        object_class_property_set_description(oc, "share",
>>>>            "Mark the memory as private to QEMU or shared");
>>>> +    object_class_property_add_bool(oc, "reserve",
>>>> +        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
>>>> +    object_class_property_set_description(oc, "reserve",
>>>> +        "Reserve swap space (or huge pages) if applicable");
>>>>        /*
>>>>         * Do not delete/rename option. This option must be considered stable
>>>>         * (as if it didn't have the 'x-' prefix including deprecation period) as
>>>> diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
>>>> index df5644723a..9ff5c16963 100644
>>>> --- a/include/sysemu/hostmem.h
>>>> +++ b/include/sysemu/hostmem.h
>>>> @@ -64,7 +64,7 @@ struct HostMemoryBackend {
>>>>        /* protected */
>>>>        uint64_t size;
>>>>        bool merge, dump, use_canonical_path;
>>>> -    bool prealloc, is_mapped, share;
>>>> +    bool prealloc, is_mapped, share, reserve;
>>>>        uint32_t prealloc_threads;
>>>>        DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
>>>>        HostMemPolicy policy;
>>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>>> index cd0e76d564..4fa3137aab 100644
>>>> --- a/qapi/qom.json
>>>> +++ b/qapi/qom.json
>>>> @@ -545,6 +545,9 @@
>>>>    # @share: if false, the memory is private to QEMU; if true, it is shared
>>>>    #         (default: false)
>>>>    #
>>>> +# @reserve: if true, reserve swap space (or huge pages) if applicable
>>>> +#           default: true)
>>>> +#
>>>>    # @size: size of the memory region in bytes
>>>>    #
>>>>    # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
>>>> @@ -556,6 +559,12 @@
>>>>    #                                        false generally, but true for machine
>>>>    #                                        types <= 4.0)
>>>>    #
>>>> +# Note: prealloc=true and reserve=false cannot be set at the same time. With
>>>> +#       reserve=true, the behavior depends on the operating system: for example,
>>>> +#       Linux will not reserve swap space for shared file mappings --
>>>> +#       "not applicable". In contrast, reserve=false will bail out if it cannot
>>>> +#       be configured accordingly.
>>>> +#
>>>>    # Since: 2.1
>>>>    ##
>>>>    { 'struct': 'MemoryBackendProperties',
>>>> @@ -566,6 +575,7 @@
>>>>                '*prealloc': 'bool',
>>>>                '*prealloc-threads': 'uint32',
>>>>                '*share': 'bool',
>>>> +            '*reserve': 'bool',
>>>>                'size': 'size',
>>>>                '*x-use-canonical-path-for-ramblock-id': 'bool' } }
>>>
>>> IIUC from the previous patch in the series, 'reserve' is only implemented
>>> on Linux.  If we make this QAPI prop dependant on CONFIG_LINUX, then mgmt
>>> apps will do the right thing to detect what platform(s) it works on.
>>
>> Would that mean only conditionally calling (or ifdeffing) the
>> object_property_add_* in backends/hostmem.c?
> 
> Yes, any code which refers to the property would need to have
> matching conditionals. The plus side is that you don't have to
> do error reporting to say it doesn't exist, because it becomes
> impossible for the mgmt app to even set the property in the
> first place on platforms where it doesn't exist.

It's a little more complicated on Linux, see patch #9. We'll still need 
error reporting for some memory backends when the user cofigured 
"/proc/sys/vm/overcommit_memory = 2". So we cannot completely get rid of 
error reporting I'm afraid.

Which raises the question if registering the propert conditionally just 
to handle !Linux slightly better is worth the effort.

(in theory, we might see support for some other POSIX systems in the 
future; Windows is more tricky)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-05-04  9:58   ` Paolo Bonzini
@ 2021-05-06  9:25     ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-05-06  9:25 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu,
	Dr. David Alan Gilbert, Greg Kurz, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé

On 04.05.21 11:58, Paolo Bonzini wrote:
> On 28/04/21 15:37, David Hildenbrand wrote:
>> @@ -545,6 +545,9 @@
>>    # @share: if false, the memory is private to QEMU; if true, it is shared
>>    #         (default: false)
>>    #
>> +# @reserve: if true, reserve swap space (or huge pages) if applicable
>> +#           default: true)
> 
> Missing open parenthesis and "since 6.1", otherwise the whole series
> looks good, thanks!

I could have sworn I had the "since 6.1" in before :)

Thanks!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-05-04 11:18       ` Daniel P. Berrangé
  2021-05-04 12:47         ` David Hildenbrand
@ 2021-05-06  9:59         ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2021-05-06  9:59 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marcel Apfelbaum, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Markus Armbruster, Peter Xu, qemu-devel,
	Greg Kurz, Paolo Bonzini, Stefan Hajnoczi,
	Murilo Opsfelder Araujo, Igor Mammedov, Nitesh Lal,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On 04.05.21 13:18, Daniel P. Berrangé wrote:
> On Tue, May 04, 2021 at 01:08:02PM +0200, David Hildenbrand wrote:
>> On 04.05.21 12:12, Daniel P. Berrangé wrote:
>>> On Wed, Apr 28, 2021 at 03:37:49PM +0200, David Hildenbrand wrote:
>>>> Let's provide a way to control the use of RAM_NORESERVE via memory
>>>> backends using the "reserve" property which defaults to true (old
>>>> behavior).
>>>>
>>>> Only Linux currently supports clearing the flag (and support is checked at
>>>> runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
>>>> Windows and other POSIX systems will bail out with "reserve=false".
>>>>
>>>> The target use case is virtio-mem, which dynamically exposes memory
>>>> inside a large, sparse memory area to the VM. This essentially allows
>>>> avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
>>>> virtio-mem and also supporting hugetlbfs in the future.
>>>>
>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>>> Reviewed-by: Markus Armbruster <armbru@redhat.com>
>>>> Acked-by: Eduardo Habkost <ehabkost@redhat.com> for memory backend and machine core
>>>> Cc: Markus Armbruster <armbru@redhat.com>
>>>> Cc: Eric Blake <eblake@redhat.com>
>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    backends/hostmem-file.c  | 11 ++++++-----
>>>>    backends/hostmem-memfd.c |  1 +
>>>>    backends/hostmem-ram.c   |  1 +
>>>>    backends/hostmem.c       | 32 ++++++++++++++++++++++++++++++++
>>>>    include/sysemu/hostmem.h |  2 +-
>>>>    qapi/qom.json            | 10 ++++++++++
>>>>    6 files changed, 51 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
>>>> index b683da9daf..9d550e53d4 100644
>>>> --- a/backends/hostmem-file.c
>>>> +++ b/backends/hostmem-file.c
>>>> @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>                   object_get_typename(OBJECT(backend)));
>>>>    #else
>>>>        HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
>>>> +    uint32_t ram_flags;
>>>>        gchar *name;
>>>>        if (!backend->size) {
>>>> @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        }
>>>>        name = host_memory_backend_get_name(backend);
>>>> -    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
>>>> -                                     name,
>>>> -                                     backend->size, fb->align,
>>>> -                                     (backend->share ? RAM_SHARED : 0) |
>>>> -                                     (fb->is_pmem ? RAM_PMEM : 0),
>>>> +    ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>> +    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
>>>> +    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
>>>> +                                     backend->size, fb->align, ram_flags,
>>>>                                         fb->mem_path, fb->readonly, errp);
>>>>        g_free(name);
>>>>    #endif
>>>> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
>>>> index 93b5d1a4cf..f3436b623d 100644
>>>> --- a/backends/hostmem-memfd.c
>>>> +++ b/backends/hostmem-memfd.c
>>>> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        name = host_memory_backend_get_name(backend);
>>>>        ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>>        memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>>>>                                       backend->size, ram_flags, fd, 0, errp);
>>>>        g_free(name);
>>>> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
>>>> index 741e701062..b8e55cdbd0 100644
>>>> --- a/backends/hostmem-ram.c
>>>> +++ b/backends/hostmem-ram.c
>>>> @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>>        name = host_memory_backend_get_name(backend);
>>>>        ram_flags = backend->share ? RAM_SHARED : 0;
>>>> +    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>>>>        memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>>>>                                               backend->size, ram_flags, errp);
>>>>        g_free(name);
>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>>> index c6c1ff5b99..58fdc1b658 100644
>>>> --- a/backends/hostmem.c
>>>> +++ b/backends/hostmem.c
>>>> @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>>>>        Error *local_err = NULL;
>>>>        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
>>>> +    if (!backend->reserve && value) {
>>>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>>>> +        return;
>>>> +    }
>>>> +
>>>>        if (!host_memory_backend_mr_inited(backend)) {
>>>>            backend->prealloc = value;
>>>>            return;
>>>> @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
>>>>        /* TODO: convert access to globals to compat properties */
>>>>        backend->merge = machine_mem_merge(machine);
>>>>        backend->dump = machine_dump_guest_core(machine);
>>>> +    backend->reserve = true;
>>>>        backend->prealloc_threads = 1;
>>>>    }
>>>> @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
>>>>        backend->share = value;
>>>>    }
>>>> +static bool host_memory_backend_get_reserve(Object *o, Error **errp)
>>>> +{
>>>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>>>> +
>>>> +    return backend->reserve;
>>>> +}
>>>> +
>>>> +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
>>>> +{
>>>> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
>>>> +
>>>> +    if (host_memory_backend_mr_inited(backend)) {
>>>> +        error_setg(errp, "cannot change property value");
>>>> +        return;
>>>> +    }
>>>> +    if (backend->prealloc && !value) {
>>>> +        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
>>>> +        return;
>>>> +    }
>>>> +    backend->reserve = value;
>>>> +}
>>>> +
>>>>    static bool
>>>>    host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
>>>>    {
>>>> @@ -494,6 +522,10 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
>>>>            host_memory_backend_get_share, host_memory_backend_set_share);
>>>>        object_class_property_set_description(oc, "share",
>>>>            "Mark the memory as private to QEMU or shared");
>>>> +    object_class_property_add_bool(oc, "reserve",
>>>> +        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
>>>> +    object_class_property_set_description(oc, "reserve",
>>>> +        "Reserve swap space (or huge pages) if applicable");
>>>>        /*
>>>>         * Do not delete/rename option. This option must be considered stable
>>>>         * (as if it didn't have the 'x-' prefix including deprecation period) as
>>>> diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
>>>> index df5644723a..9ff5c16963 100644
>>>> --- a/include/sysemu/hostmem.h
>>>> +++ b/include/sysemu/hostmem.h
>>>> @@ -64,7 +64,7 @@ struct HostMemoryBackend {
>>>>        /* protected */
>>>>        uint64_t size;
>>>>        bool merge, dump, use_canonical_path;
>>>> -    bool prealloc, is_mapped, share;
>>>> +    bool prealloc, is_mapped, share, reserve;
>>>>        uint32_t prealloc_threads;
>>>>        DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
>>>>        HostMemPolicy policy;
>>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>>> index cd0e76d564..4fa3137aab 100644
>>>> --- a/qapi/qom.json
>>>> +++ b/qapi/qom.json
>>>> @@ -545,6 +545,9 @@
>>>>    # @share: if false, the memory is private to QEMU; if true, it is shared
>>>>    #         (default: false)
>>>>    #
>>>> +# @reserve: if true, reserve swap space (or huge pages) if applicable
>>>> +#           default: true)
>>>> +#
>>>>    # @size: size of the memory region in bytes
>>>>    #
>>>>    # @x-use-canonical-path-for-ramblock-id: if true, the canoncial path is used
>>>> @@ -556,6 +559,12 @@
>>>>    #                                        false generally, but true for machine
>>>>    #                                        types <= 4.0)
>>>>    #
>>>> +# Note: prealloc=true and reserve=false cannot be set at the same time. With
>>>> +#       reserve=true, the behavior depends on the operating system: for example,
>>>> +#       Linux will not reserve swap space for shared file mappings --
>>>> +#       "not applicable". In contrast, reserve=false will bail out if it cannot
>>>> +#       be configured accordingly.
>>>> +#
>>>>    # Since: 2.1
>>>>    ##
>>>>    { 'struct': 'MemoryBackendProperties',
>>>> @@ -566,6 +575,7 @@
>>>>                '*prealloc': 'bool',
>>>>                '*prealloc-threads': 'uint32',
>>>>                '*share': 'bool',
>>>> +            '*reserve': 'bool',
>>>>                'size': 'size',
>>>>                '*x-use-canonical-path-for-ramblock-id': 'bool' } }
>>>
>>> IIUC from the previous patch in the series, 'reserve' is only implemented
>>> on Linux.  If we make this QAPI prop dependant on CONFIG_LINUX, then mgmt
>>> apps will do the right thing to detect what platform(s) it works on.
>>
>> Would that mean only conditionally calling (or ifdeffing) the
>> object_property_add_* in backends/hostmem.c?
> 
> Yes, any code which refers to the property would need to have
> matching conditionals. The plus side is that you don't have to
> do error reporting to say it doesn't exist, because it becomes
> impossible for the mgmt app to even set the property in the
> first place on platforms where it doesn't exist.

I could fold in the following change, which should be sufficient.
In that case, setting the property to "false" could only fail if
the user messes with the memory overcommit configuration in /sys/.

diff --git a/backends/hostmem.c b/backends/hostmem.c
index e5038e9cab..4c05862ed5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -431,6 +431,7 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
      backend->share = value;
  }
  
+#ifdef CONFIG_LINUX
  static bool host_memory_backend_get_reserve(Object *o, Error **errp)
  {
      HostMemoryBackend *backend = MEMORY_BACKEND(o);
@@ -452,6 +453,7 @@ static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
      }
      backend->reserve = value;
  }
+#endif /* CONFIG_LINUX */
  
  static bool
  host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
@@ -521,10 +523,12 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
          host_memory_backend_get_share, host_memory_backend_set_share);
      object_class_property_set_description(oc, "share",
          "Mark the memory as private to QEMU or shared");
+#ifdef CONFIG_LINUX
      object_class_property_add_bool(oc, "reserve",
          host_memory_backend_get_reserve, host_memory_backend_set_reserve);
      object_class_property_set_description(oc, "reserve",
          "Reserve swap space (or huge pages) if applicable");
+#endif /* CONFIG_LINUX */
      /*
       * Do not delete/rename option. This option must be considered stable
       * (as if it didn't have the 'x-' prefix including deprecation period) as
diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 9bedc77bb4..76b22b00d6 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -112,8 +112,10 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
                         m->value->prealloc ? "true" : "false");
          monitor_printf(mon, "  share: %s\n",
                         m->value->share ? "true" : "false");
-        monitor_printf(mon, "  reserve: %s\n",
-                       m->value->reserve ? "true" : "false");
+        if (m->value->has_reserve) {
+            monitor_printf(mon, "  reserve: %s\n",
+                           m->value->reserve ? "true" : "false");
+        }
          monitor_printf(mon, "  policy: %s\n",
                         HostMemPolicy_str(m->value->policy));
          visit_complete(v, &str);
diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 773904dbca..8922cc511f 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -173,7 +173,10 @@ static int query_memdev(Object *obj, void *opaque)
          m->dump = object_property_get_bool(obj, "dump", &error_abort);
          m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
          m->share = object_property_get_bool(obj, "share", &error_abort);
+#ifdef CONFIG_LINUX
          m->reserve = object_property_get_bool(obj, "reserve", &error_abort);
+        m->has_reserve = true;
+#endif /* CONFIG_LINUX */
          m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
                                               &error_abort);
          host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index ea68f1a083..1cfb16e6eb 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -818,7 +818,7 @@
      'dump':       'bool',
      'prealloc':   'bool',
      'share':      'bool',
-    'reserve':    'bool',
+    '*reserve':    'bool',
      'host-nodes': ['uint16'],
      'policy':     'HostMemPolicy' }}
  
-- 
2.30.2


Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-05-06 10:00 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28 13:37 [PATCH v7 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 02/15] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 03/15] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 04/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 08/15] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux David Hildenbrand
2021-05-04 10:09   ` Daniel P. Berrangé
2021-05-04 10:21     ` David Hildenbrand
2021-05-04 10:32       ` Daniel P. Berrangé
2021-05-04 11:04         ` David Hildenbrand
2021-05-04 11:14           ` Daniel P. Berrangé
2021-05-04 11:28             ` David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
2021-05-04  9:58   ` Paolo Bonzini
2021-05-06  9:25     ` David Hildenbrand
2021-05-04 10:12   ` Daniel P. Berrangé
2021-05-04 11:08     ` David Hildenbrand
2021-05-04 11:18       ` Daniel P. Berrangé
2021-05-04 12:47         ` David Hildenbrand
2021-05-06  9:59         ` David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 11/15] qmp: Clarify memory backend properties returned via query-memdev David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 12/15] qmp: Include "share" property of memory backends David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 13/15] hmp: Print "share" property of memory backends with "info memdev" David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 14/15] qmp: Include "reserve" property of memory backends David Hildenbrand
2021-04-28 13:37 ` [PATCH v7 15/15] hmp: Print "reserve" property of memory backends with "info memdev" David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).