All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property
@ 2021-03-08 15:05 David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 01/12] softmmu/physmem: Mark shared anonymous memory RAM_SHARED David Hildenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Some fixes for shared anonymous memory, cleanups previously sent in other
context (resizeable allocations), followed by RAM_NORESERVE, implementing
it under POSIX using MAP_NORESERVE, and letting users configure it for
memory backens using the "reserve" property (default: true).

MAP_NORESERVE under Linux has in the context of QEMU an effect on
1) Private/shared anonymous memory
-> memory-backend-ram,id=mem0,size=10G
2) Private fd-based mappings
-> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
-> memory-backend-memfd,id=mem0,size=10G
3) Private/shared hugetlb mappings
-> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M

With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
huge pages (3) for the whole memory region.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
"this mapping might be very sparse". This essentially allows
avoiding having to set "/proc/sys/vm/overcommit_memory == 1") when using
virtio-mem and also supporting hugetlbfs in the future.

virtio-mem currently only supports anonymous memory, in the future we want
to also support private memfd, shared file-based and shared hugetlbfs
mappings.

virtio-mem features I am currently working on that will make it all
play together with this work include:
1. Introducing a prealloc option for virtio-mem (e.g., using fallocate()
   when plugging blocks) to fail nicely when running out of
   backing storage like huge pages ("prealloc=on").
2. Handling virtio-mem requests via an iothread to not hold the BQL while
   populating/preallocating memory ("iothread=X").
3. Protecting unplugged memory e.g., using userfaultfd ("prot=uffd").
4. Dynamic reservation of swap space ("reserve=on")
5. Supporting resizable RAM block/memmory regions, such that we won't
   always expose a large, sparse memory region to the VM.
6. (resizeable allocations / optimized mmap handling when resizing RAM
    blocks)

Based-on: 20210303130916.22553-1-david@redhat.com

v2 -> v3:
- Renamed "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
  to "softmmu/physmem: Mark shared anonymous memory RAM_SHARED" and
  adjusted the description
- Added "softmmu/physmem: Fix ram_block_discard_range() to handle shared
  anonymous memory"
- Added "softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous
  memory"
- Added "util/mmap-alloc: Pass flags instead of separate bools to
  qemu_ram_mmap()"
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Further tweak code comments
-- Handle shared anonymous memory

v1 -> v2:
- Rebased to upstream and phs_mem_alloc simplifications
-- Upsteam added the "map_offset" parameter to many RAM allocation
   interfaces.
- "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
-- Use local variable "shared"
- "memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Simplify due to phs_mem_alloc changes
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Add a whole bunch of comments.
-- Exclude shared anonymous memory that QEMU doesn't use
-- Special-case readonly mappings

Cc: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Marcel Apfelbaum <mapfelba@redhat.com>

David Hildenbrand (12):
  softmmu/physmem: Mark shared anonymous memory RAM_SHARED
  softmmu/physmem: Fix ram_block_discard_range() to handle shared
    anonymous memory
  softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous
    memory
  util/mmap-alloc: Factor out calculation of the pagesize for the guard
    page
  util/mmap-alloc: Factor out reserving of a memory region to
    mmap_reserve()
  util/mmap-alloc: Factor out activating of memory to mmap_activate()
  softmmu/memory: Pass ram_flags into qemu_ram_alloc_from_fd()
  softmmu/memory: Pass ram_flags into
    memory_region_init_ram_shared_nomigrate()
  util/mmap-alloc: Pass flags instead of separate bools to
    qemu_ram_mmap()
  memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE
  hostmem: Wire up RAM_NORESERVE via "reserve" property

 backends/hostmem-file.c                       |  11 +-
 backends/hostmem-memfd.c                      |   8 +-
 backends/hostmem-ram.c                        |   7 +-
 backends/hostmem.c                            |  33 +++
 hw/m68k/next-cube.c                           |   4 +-
 hw/misc/ivshmem.c                             |   5 +-
 include/exec/cpu-common.h                     |   1 +
 include/exec/memory.h                         |  43 ++--
 include/exec/ram_addr.h                       |   9 +-
 include/qemu/mmap-alloc.h                     |  20 +-
 include/qemu/osdep.h                          |   3 +-
 include/sysemu/hostmem.h                      |   2 +-
 migration/ram.c                               |   3 +-
 .../memory-region-housekeeping.cocci          |   8 +-
 softmmu/memory.c                              |  27 ++-
 softmmu/physmem.c                             |  61 +++--
 util/mmap-alloc.c                             | 217 ++++++++++++------
 util/oslib-posix.c                            |   7 +-
 util/oslib-win32.c                            |  13 +-
 19 files changed, 323 insertions(+), 159 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 01/12] softmmu/physmem: Mark shared anonymous memory RAM_SHARED
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's drop the "shared" parameter from ram_block_add() and properly
store it in the flags of the ram block instead, such that
qemu_ram_is_shared() properly succeeds on all ram blocks that were mapped
MAP_SHARED.

We'll use this information next to fix some cases with shared anonymous
memory.

Reviewed-by: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 141fce79e8..62ea4abbdd 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1927,8 +1927,9 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
+static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
+    const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
     ram_addr_t old_ram_size, new_ram_size;
@@ -2064,7 +2065,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    ram_block_add(new_block, &local_err, ram_flags & RAM_SHARED);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2127,10 +2128,13 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     if (host) {
         new_block->flags |= RAM_PREALLOC;
     }
+    if (share) {
+        new_block->flags |= RAM_SHARED;
+    }
     if (resizeable) {
         new_block->flags |= RAM_RESIZEABLE;
     }
-    ram_block_add(new_block, &local_err, share);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 01/12] softmmu/physmem: Mark shared anonymous memory RAM_SHARED David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-11 16:39   ` Dr. David Alan Gilbert
  2021-03-11 21:37   ` Peter Xu
  2021-03-08 15:05 ` [PATCH v3 03/12] softmmu/physmem: Fix qemu_ram_remap() " David Hildenbrand
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

We can create shared anonymous memory via
    "-object memory-backend-ram,share=on,..."
which is, for example, required by PVRDMA for mremap() to work.

Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.

Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 62ea4abbdd..2ba815fec6 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3506,6 +3506,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
         /* The logic here is messy;
          *    madvise DONTNEED fails for hugepages
          *    fallocate works on hugepages and shmem
+         *    shared anonymous memory requires madvise REMOVE
          */
         need_madvise = (rb->page_size == qemu_host_page_size);
         need_fallocate = rb->fd != -1;
@@ -3539,7 +3540,11 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
              * fallocate'd away).
              */
 #if defined(CONFIG_MADVISE)
-            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
+            if (qemu_ram_is_shared(rb) && rb->fd < 0) {
+                ret = madvise(host_startaddr, length, MADV_REMOVE);
+            } else {
+                ret = madvise(host_startaddr, length, MADV_DONTNEED);
+            }
             if (ret) {
                 ret = -errno;
                 error_report("ram_block_discard_range: Failed to discard range "
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 03/12] softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous memory
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 01/12] softmmu/physmem: Mark shared anonymous memory RAM_SHARED David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 04/12] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

RAM_SHARED now also properly indicates shared anonymous memory. Let's check
that flag for anonymous memory as well, to restore the proper mapping.

Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 2ba815fec6..d0a0027a16 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2222,13 +2222,13 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 abort();
             } else {
                 flags = MAP_FIXED;
+                flags |= block->flags & RAM_SHARED ?
+                         MAP_SHARED : MAP_PRIVATE;
                 if (block->fd >= 0) {
-                    flags |= (block->flags & RAM_SHARED ?
-                              MAP_SHARED : MAP_PRIVATE);
                     area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
                                 flags, block->fd, offset);
                 } else {
-                    flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+                    flags |= MAP_ANONYMOUS;
                     area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
                                 flags, -1, 0);
                 }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 04/12] util/mmap-alloc: Factor out calculation of the pagesize for the guard page
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 03/12] softmmu/physmem: Fix qemu_ram_remap() " David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 05/12] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's factor out calculating the size of the guard page and rename the
variable to make it clearer that this pagesize only applies to the
guard page.

Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index e6fa8b598b..24854064b4 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_guard_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+    /* Mappings in the same segment must share the same page size */
+    return qemu_fd_getpagesize(fd);
+#else
+    return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
@@ -90,12 +100,12 @@ void *qemu_ram_mmap(int fd,
                     bool is_pmem,
                     off_t map_offset)
 {
+    const size_t guard_pagesize = mmap_guard_pagesize(fd);
     int prot;
     int flags;
     int map_sync_flags = 0;
     int guardfd;
     size_t offset;
-    size_t pagesize;
     size_t total;
     void *guardptr;
     void *ptr;
@@ -116,8 +126,7 @@ void *qemu_ram_mmap(int fd,
      * anonymous memory is OK.
      */
     flags = MAP_PRIVATE;
-    pagesize = qemu_fd_getpagesize(fd);
-    if (fd == -1 || pagesize == qemu_real_host_page_size) {
+    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
         guardfd = -1;
         flags |= MAP_ANONYMOUS;
     } else {
@@ -126,7 +135,6 @@ void *qemu_ram_mmap(int fd,
     }
 #else
     guardfd = -1;
-    pagesize = qemu_real_host_page_size;
     flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -138,7 +146,7 @@ void *qemu_ram_mmap(int fd,
 
     assert(is_power_of_2(align));
     /* Always align to host page size */
-    assert(align >= pagesize);
+    assert(align >= guard_pagesize);
 
     flags = MAP_FIXED;
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
@@ -193,8 +201,8 @@ void *qemu_ram_mmap(int fd,
      * a guard page guarding against potential buffer overflows.
      */
     total -= offset;
-    if (total > size + pagesize) {
-        munmap(ptr + size + pagesize, total - size - pagesize);
+    if (total > size + guard_pagesize) {
+        munmap(ptr + size + guard_pagesize, total - size - guard_pagesize);
     }
 
     return ptr;
@@ -202,15 +210,8 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-    size_t pagesize;
-
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-        pagesize = qemu_fd_getpagesize(fd);
-#else
-        pagesize = qemu_real_host_page_size;
-#endif
-        munmap(ptr, size + pagesize);
+        munmap(ptr, size + mmap_guard_pagesize(fd));
     }
 }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 05/12] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 04/12] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 06/12] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

We want to reserve a memory region without actually populating memory.
Let's factor that out.

Reviewed-by: Igor Kotrasinski <i.kotrasinsk@partner.samsung.com>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 24854064b4..223d66219c 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+    int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+    /*
+     * On ppc64 mappings in the same segment (aka slice) must share the same
+     * page size. Since we will be re-allocating part of this segment
+     * from the supplied fd, we should make sure to use the same page size, to
+     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+     * avoid allocating backing store memory.
+     * We do this unless we are using the system page size, in which case
+     * anonymous memory is OK.
+     */
+    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+        fd = -1;
+        flags |= MAP_ANONYMOUS;
+    } else {
+        flags |= MAP_NORESERVE;
+    }
+#else
+    fd = -1;
+    flags |= MAP_ANONYMOUS;
+#endif
+
+    return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -104,7 +136,6 @@ void *qemu_ram_mmap(int fd,
     int prot;
     int flags;
     int map_sync_flags = 0;
-    int guardfd;
     size_t offset;
     size_t total;
     void *guardptr;
@@ -116,30 +147,7 @@ void *qemu_ram_mmap(int fd,
      */
     total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-    /* On ppc64 mappings in the same segment (aka slice) must share the same
-     * page size. Since we will be re-allocating part of this segment
-     * from the supplied fd, we should make sure to use the same page size, to
-     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
-     * avoid allocating backing store memory.
-     * We do this unless we are using the system page size, in which case
-     * anonymous memory is OK.
-     */
-    flags = MAP_PRIVATE;
-    if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
-        guardfd = -1;
-        flags |= MAP_ANONYMOUS;
-    } else {
-        guardfd = fd;
-        flags |= MAP_NORESERVE;
-    }
-#else
-    guardfd = -1;
-    flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-    guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+    guardptr = mmap_reserve(total, fd);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 06/12] util/mmap-alloc: Factor out activating of memory to mmap_activate()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 05/12] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 07/12] softmmu/memory: Pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

We want to activate memory within a reserved memory region, to make it
accessible. Let's factor that out.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 94 +++++++++++++++++++++++++----------------------
 1 file changed, 50 insertions(+), 44 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 223d66219c..0e2bd7bc0e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,52 @@ static void *mmap_reserve(size_t size, int fd)
     return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Activate memory in a reserved region from the given fd (if any), to make
+ * it accessible.
+ */
+static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
+                           bool shared, bool is_pmem, off_t map_offset)
+{
+    const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
+    int map_sync_flags = 0;
+    int flags = MAP_FIXED;
+    void *activated_ptr;
+
+    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    if (shared && is_pmem) {
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+    }
+
+    activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
+                         map_offset);
+    if (activated_ptr == MAP_FAILED && map_sync_flags) {
+        if (errno == ENOTSUP) {
+            char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+            char *file_name = g_malloc0(PATH_MAX);
+            int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+            if (len < 0) {
+                len = 0;
+            }
+            file_name[len] = '\0';
+            fprintf(stderr, "Warning: requesting persistence across crashes "
+                    "for backend file %s failed. Proceeding without "
+                    "persistence, data might become corrupted in case of host "
+                    "crash.\n", file_name);
+            g_free(proc_link);
+            g_free(file_name);
+        }
+        /*
+         * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+         * again without these flags to handle backwards compatibility.
+         */
+        activated_ptr = mmap(ptr, size, prot, flags, fd, map_offset);
+    }
+    return activated_ptr;
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -133,13 +179,8 @@ void *qemu_ram_mmap(int fd,
                     off_t map_offset)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
-    int prot;
-    int flags;
-    int map_sync_flags = 0;
-    size_t offset;
-    size_t total;
-    void *guardptr;
-    void *ptr;
+    size_t offset, total;
+    void *ptr, *guardptr;
 
     /*
      * Note: this always allocates at least one extra page of virtual address
@@ -156,45 +197,10 @@ void *qemu_ram_mmap(int fd,
     /* Always align to host page size */
     assert(align >= guard_pagesize);
 
-    flags = MAP_FIXED;
-    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
-        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-    }
-
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
-
-    ptr = mmap(guardptr + offset, size, prot,
-               flags | map_sync_flags, fd, map_offset);
-
-    if (ptr == MAP_FAILED && map_sync_flags) {
-        if (errno == ENOTSUP) {
-            char *proc_link, *file_name;
-            int len;
-            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-            file_name = g_malloc0(PATH_MAX);
-            len = readlink(proc_link, file_name, PATH_MAX - 1);
-            if (len < 0) {
-                len = 0;
-            }
-            file_name[len] = '\0';
-            fprintf(stderr, "Warning: requesting persistence across crashes "
-                    "for backend file %s failed. Proceeding without "
-                    "persistence, data might become corrupted in case of host "
-                    "crash.\n", file_name);
-            g_free(proc_link);
-            g_free(file_name);
-        }
-        /*
-         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
-         * we will remove these flags to handle compatibility.
-         */
-        ptr = mmap(guardptr + offset, size, prot, flags, fd, map_offset);
-    }
-
+    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+                        map_offset);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 07/12] softmmu/memory: Pass ram_flags into qemu_ram_alloc_from_fd()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 06/12] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 08/12] softmmu/memory: Pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
to clean up and prepare for more flags.

Simplify the documentation of passed ram flags: Looking at our
documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
repetitive.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-memfd.c | 7 ++++---
 hw/misc/ivshmem.c        | 5 ++---
 include/exec/memory.h    | 9 +++------
 include/exec/ram_addr.h  | 6 +-----
 softmmu/memory.c         | 7 +++----
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 69b0ae30bb..93b5d1a4cf 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
+    uint32_t ram_flags;
     char *name;
     int fd;
 
@@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-                                   name, backend->size,
-                                   backend->share, fd, 0, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+                                   backend->size, ram_flags, fd, 0, errp);
     g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 603e992a7f..730669dfc5 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -494,9 +494,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, Error **errp)
     size = buf.st_size;
 
     /* mmap the region and map into the BAR2 */
-    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-                                   "ivshmem.bar2", size, true, fd, 0,
-                                   &local_err);
+    memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2",
+                                   size, RAM_SHARED, fd, 0, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c6fb714e49..14004de685 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -967,10 +967,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: Memory region features:
- *             - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag
- *             - RAM_PMEM: the memory is persistent memory
- *             Other bits are ignored now.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -996,7 +993,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1008,7 +1005,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     struct Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     ram_addr_t offset,
                                     Error **errp);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 3cb9791df3..a7e3378340 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,11 +104,7 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: specify the properties of the ram block, which can be one
- *              or bit-or of following values
- *              - RAM_SHARED: mmap the backing file or device with MAP_SHARED
- *              - RAM_PMEM: the backend @mem_path or @fd is persistent memory
- *              Other bits are ignored.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 874a8fccde..9f67c6c23c 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1610,7 +1610,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     struct Object *owner,
                                     const char *name,
                                     uint64_t size,
-                                    bool share,
+                                    uint32_t ram_flags,
                                     int fd,
                                     ram_addr_t offset,
                                     Error **errp)
@@ -1620,9 +1620,8 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
-                                           share ? RAM_SHARED : 0,
-                                           fd, offset, false, &err);
+    mr->ram_block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, offset,
+                                           false, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 08/12] softmmu/memory: Pass ram_flags into memory_region_init_ram_shared_nomigrate()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 07/12] softmmu/memory: Pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's forward ram_flags instead, renaming
memory_region_init_ram_shared_nomigrate() into
memory_region_init_ram_flags_nomigrate(). Forward flags to
qemu_ram_alloc() and qemu_ram_alloc_internal().

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-ram.c                        |  6 +++--
 hw/m68k/next-cube.c                           |  4 ++--
 include/exec/memory.h                         | 24 +++++++++----------
 include/exec/ram_addr.h                       |  2 +-
 .../memory-region-housekeeping.cocci          |  8 +++----
 softmmu/memory.c                              | 20 ++++++++--------
 softmmu/physmem.c                             | 24 ++++++++-----------
 7 files changed, 43 insertions(+), 45 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 5cc53e76c9..741e701062 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -19,6 +19,7 @@
 static void
 ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
+    uint32_t ram_flags;
     char *name;
 
     if (!backend->size) {
@@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), name,
-                           backend->size, backend->share, errp);
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
+                                           backend->size, ram_flags, errp);
     g_free(name);
 }
 
diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c
index 92b45d760f..59ccae0d5e 100644
--- a/hw/m68k/next-cube.c
+++ b/hw/m68k/next-cube.c
@@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine)
     sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x02100000);
 
     /* BMAP memory */
-    memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
-                                            true, &error_fatal);
+    memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
+                                           RAM_SHARED, &error_fatal);
     memory_region_add_subregion(sysmem, 0x020c0000, bmapm1);
     /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */
     memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 14004de685..2d97bdf59c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -904,27 +904,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       Error **errp);
 
 /**
- * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
- *                                           Accesses into the region will
- *                                           modify memory directly.
+ * memory_region_init_ram_flags_nomigrate:  Initialize RAM memory region.
+ *                                          Accesses into the region will
+ *                                          modify memory directly.
  *
  * @mr: the #MemoryRegion to be initialized.
  * @owner: the object that tracks the region's reference count
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @share: allow remapping RAM to different addresses
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Note that this function is similar to memory_region_init_ram_nomigrate.
- * The only difference is part of the RAM region can be remapped.
+ * Note that this function does not do anything to cause the data in the
+ * RAM memory region to be migrated; that is the responsibility of the caller.
  */
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             struct Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp);
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            struct Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp);
 
 /**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index a7e3378340..6d4513f8e2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags, MemoryRegion *mr,
                          Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
                                     void (*resized)(const char*,
diff --git a/scripts/coccinelle/memory-region-housekeeping.cocci b/scripts/coccinelle/memory-region-housekeeping.cocci
index c768d8140a..29651ebde9 100644
--- a/scripts/coccinelle/memory-region-housekeeping.cocci
+++ b/scripts/coccinelle/memory-region-housekeeping.cocci
@@ -127,8 +127,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, obj, E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, obj, E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, obj, E2, E3, E4, E5);
 )
   ...+>
 }
@@ -152,8 +152,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, OBJECT(dev), E2, E3, E4);
 |
-- memory_region_init_ram_shared_nomigrate(E1, NULL, E2, E3, E4, E5);
-+ memory_region_init_ram_shared_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
+- memory_region_init_ram_flags_nomigrate(E1, NULL, E2, E3, E4, E5);
++ memory_region_init_ram_flags_nomigrate(E1, OBJECT(dev), E2, E3, E4, E5);
 )
   ...+>
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 9f67c6c23c..03bf13a5e7 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1532,22 +1532,22 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
 }
 
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
-                                             Object *owner,
-                                             const char *name,
-                                             uint64_t size,
-                                             bool share,
-                                             Error **errp)
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+                                            Object *owner,
+                                            const char *name,
+                                            uint64_t size,
+                                            uint32_t ram_flags,
+                                            Error **errp)
 {
     Error *err = NULL;
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, share, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, ram_flags, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
@@ -1683,7 +1683,7 @@ void memory_region_init_rom_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
-    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size, 0, errp);
     mr->readonly = true;
 }
 
@@ -1703,7 +1703,7 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, false,  mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index d0a0027a16..8f90cb4cd2 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2108,12 +2108,14 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
                                                   uint64_t length,
                                                   void *host),
-                                  void *host, bool resizeable, bool share,
+                                  void *host, uint32_t ram_flags,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
 
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
     new_block = g_malloc0(sizeof(*new_block));
@@ -2125,15 +2127,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     new_block->fd = -1;
     new_block->page_size = qemu_real_host_page_size;
     new_block->host = host;
+    new_block->flags = ram_flags;
     if (host) {
         new_block->flags |= RAM_PREALLOC;
     }
-    if (share) {
-        new_block->flags |= RAM_SHARED;
-    }
-    if (resizeable) {
-        new_block->flags |= RAM_RESIZEABLE;
-    }
     ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
@@ -2146,15 +2143,14 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                    MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, host, false,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, host, 0, mr, errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
-                                   share, mr, errp);
+    assert((ram_flags & ~RAM_SHARED) == 0);
+    return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2163,8 +2159,8 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
-                                   false, mr, errp);
+    return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
+                                   RAM_RESIZEABLE, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 08/12] softmmu/memory: Pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-09 20:04   ` Peter Xu
  2021-03-08 15:05   ` David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's introduce a new set of flags that abstract mmap logic and replace
our current set of bools, to prepare for another flag.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/mmap-alloc.h | 17 +++++++++++------
 softmmu/physmem.c         |  8 +++++---
 util/mmap-alloc.c         | 14 +++++++-------
 util/oslib-posix.c        |  3 ++-
 4 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 456ff87df1..55664ea9f3 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
 
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
+/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
+#define QEMU_RAM_MMAP_READONLY      (1 << 0)
+
+/* Map MAP_SHARED instead of MAP_PRIVATE. */
+#define QEMU_RAM_MMAP_SHARED        (1 << 1)
+
+/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
+#define QEMU_RAM_MMAP_PMEM          (1 << 2)
+
 /**
  * qemu_ram_mmap: mmap the specified file or device.
  *
@@ -14,9 +23,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @size: the number of bytes to be mmaped
  *  @align: if not zero, specify the alignment of the starting mapping address;
  *          otherwise, the alignment in use will be determined by QEMU.
- *  @readonly: true for a read-only mapping, false for read/write.
- *  @shared: map has RAM_SHARED flag.
- *  @is_pmem: map has RAM_PMEM flag.
+ *  @mmap_flags: QEMU_RAM_MMAP_* flags
  *  @map_offset: map starts at offset of map_offset from the start of fd
  *
  * Return:
@@ -26,9 +33,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
-                    bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t mmap_flags,
                     off_t map_offset);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 8f90cb4cd2..ec7a382ccd 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
                             off_t offset,
                             Error **errp)
 {
+    uint32_t mmap_flags;
     void *area;
 
     block->page_size = qemu_fd_getpagesize(fd);
@@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
         perror("ftruncate");
     }
 
-    area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
-                         offset);
+    mmap_flags = readonly ? QEMU_RAM_MMAP_READONLY : 0;
+    mmap_flags |= (block->flags & RAM_SHARED) ? QEMU_RAM_MMAP_SHARED : 0;
+    mmap_flags |= (block->flags & RAM_PMEM) ? QEMU_RAM_MMAP_PMEM : 0;
+    area = qemu_ram_mmap(fd, memory, block->mr->align, mmap_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0e2bd7bc0e..bd8f7ab547 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -118,9 +118,12 @@ static void *mmap_reserve(size_t size, int fd)
  * Activate memory in a reserved region from the given fd (if any), to make
  * it accessible.
  */
-static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
-                           bool shared, bool is_pmem, off_t map_offset)
+static void *mmap_activate(void *ptr, size_t size, int fd, uint32_t mmap_flags,
+                           off_t map_offset)
 {
+    const bool readonly = mmap_flags & QEMU_RAM_MMAP_READONLY;
+    const bool shared = mmap_flags & QEMU_RAM_MMAP_SHARED;
+    const bool is_pmem = mmap_flags & QEMU_RAM_MMAP_PMEM;
     const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
     int map_sync_flags = 0;
     int flags = MAP_FIXED;
@@ -173,9 +176,7 @@ static inline size_t mmap_guard_pagesize(int fd)
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
-                    bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t mmap_flags,
                     off_t map_offset)
 {
     const size_t guard_pagesize = mmap_guard_pagesize(fd);
@@ -199,8 +200,7 @@ void *qemu_ram_mmap(int fd,
 
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
-                        map_offset);
+    ptr = mmap_activate(guardptr + offset, size, fd, mmap_flags, map_offset);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 36820fec16..1d250416f1 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -229,8 +229,9 @@ void *qemu_memalign(size_t alignment, size_t size)
 /* alloc shared memory pages */
 void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
 {
+    const uint32_t mmap_flags = shared ? QEMU_RAM_MMAP_SHARED : 0;
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, 0);
+    void *ptr = qemu_ram_mmap(-1, size, align, mmap_flags, 0);
 
     if (ptr == MAP_FAILED) {
         return NULL;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 10/12] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
@ 2021-03-08 15:05   ` David Hildenbrand
  2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Peter Xu, Michael S. Tsirkin, Eduardo Habkost,
	Dr. David Alan Gilbert, Richard Henderson, Paolo Bonzini,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Stefan Hajnoczi, Murilo Opsfelder Araujo, Greg Kurz,
	Liam Merwick, Marcel Apfelbaum, Christian Borntraeger,
	Cornelia Huck, Halil Pasic, Igor Kotrasinski, Juan Quintela,
	Stefan Weil, Thomas Huth, kvm, qemu-s390x

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

  RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or
  huge pages on Linux) is skipped: will bail out if not supported. When not
  set, the OS might reserve swap space (or huge pages on Linux), depending
  on OS support.

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add the POSIX mmap implementation next and
allow specifying RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Cc: Juan Quintela <quintela@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: kvm@vger.kernel.org
Cc: qemu-s390x@nongnu.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h     | 16 +++++++++++++---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/mmap-alloc.h |  3 +++
 include/qemu/osdep.h      |  3 ++-
 migration/ram.c           |  3 +--
 softmmu/physmem.c         | 15 +++++++++++----
 util/mmap-alloc.c         |  6 ++++++
 util/oslib-posix.c        |  6 ++++--
 util/oslib-win32.c        | 13 ++++++++++++-
 10 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2d97bdf59c..1369497415 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -155,6 +155,14 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages Linux) is skipped: will bail out if not supported. When not set, the
+ * OS might reserve swap space (or huge pages on Linux), depending on OS
+ * support.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -913,7 +921,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -967,7 +975,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -993,7 +1002,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6d4513f8e2..551876bed0 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 55664ea9f3..6ac05d70d4 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -15,6 +15,9 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 /* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
 #define QEMU_RAM_MMAP_PMEM          (1 << 2)
 
+/* Map MAP_NORESERVE and fail if not effective. */
+#define QEMU_RAM_MMAP_NORESERVE     (1 << 3)
+
 /**
  * qemu_ram_mmap: mmap the specified file or device.
  *
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..d6d8ef0999 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -343,7 +343,8 @@ extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..dd8daad386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3322,8 +3322,7 @@ int colo_init_ram_cache(void)
     WITH_RCU_READ_LOCK_GUARD() {
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             block->colo_cache = qemu_anon_ram_alloc(block->used_length,
-                                                    NULL,
-                                                    false);
+                                                    NULL, false, false);
             if (!block->colo_cache) {
                 error_report("%s: Can't alloc memory for COLO cache of block %s,"
                              "size 0x" RAM_ADDR_FMT, __func__, block->idstr,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index ec7a382ccd..dcc1fb74aa 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1584,6 +1584,7 @@ static void *file_ram_alloc(RAMBlock *block,
     mmap_flags = readonly ? QEMU_RAM_MMAP_READONLY : 0;
     mmap_flags |= (block->flags & RAM_SHARED) ? QEMU_RAM_MMAP_SHARED : 0;
     mmap_flags |= (block->flags & RAM_PMEM) ? QEMU_RAM_MMAP_PMEM : 0;
+    mmap_flags |= (block->flags & RAM_NORESERVE) ? QEMU_RAM_MMAP_NORESERVE : 0;
     area = qemu_ram_mmap(fd, memory, block->mr->align, mmap_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
@@ -1704,6 +1705,11 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_noreserve(RAMBlock *rb)
+{
+    return rb->flags & RAM_NORESERVE;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -1932,6 +1938,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     const bool shared = qemu_ram_is_shared(new_block);
+    const bool noreserve = qemu_ram_is_noreserve(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
     ram_addr_t old_ram_size, new_ram_size;
@@ -1954,7 +1961,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
-                                                  shared);
+                                                  shared, noreserve);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2025,7 +2032,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2116,7 +2123,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     RAMBlock *new_block;
     Error *local_err = NULL;
 
-    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_NORESERVE)) == 0);
 
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
@@ -2151,7 +2158,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~RAM_SHARED) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index bd8f7ab547..ecace41ad5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -183,6 +184,11 @@ void *qemu_ram_mmap(int fd,
     size_t offset, total;
     void *ptr, *guardptr;
 
+    if (mmap_flags & QEMU_RAM_MMAP_NORESERVE) {
+        error_report("Skipping reservation of swap space is not supported");
+        return MAP_FAILED;
+    }
+
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 1d250416f1..eab92fcafe 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -227,9 +227,11 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
+                          bool noreserve)
 {
-    const uint32_t mmap_flags = shared ? QEMU_RAM_MMAP_SHARED : 0;
+    const uint32_t mmap_flags = (shared ? QEMU_RAM_MMAP_SHARED : 0) |
+                                (noreserve ? QEMU_RAM_MMAP_NORESERVE : 0);
     size_t align = QEMU_VMALLOC_ALIGN;
     void *ptr = qemu_ram_mmap(-1, size, align, mmap_flags, 0);
 
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b8012bb..8cafe44179 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include <malloc.h>
 
 /* this must come after including "trace.h" */
@@ -77,10 +78,20 @@ static int get_allocation_granularity(void)
     return system_info.dwAllocationGranularity;
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve)
 {
     void *ptr;
 
+    if (noreserve) {
+        /*
+         * We need a MEM_COMMIT before accessing any memory in a MEM_RESERVE
+         * area; we cannot easily mimic POSIX MAP_NORESERVE semantics.
+         */
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL;
+    }
+
     ptr = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
     trace_qemu_anon_ram_alloc(size, ptr);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 10/12] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
@ 2021-03-08 15:05   ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Kotrasinski, kvm, Michael S. Tsirkin, Thomas Huth, Peter Xu,
	Paolo Bonzini, Juan Quintela, David Hildenbrand, Halil Pasic,
	Christian Borntraeger, Murilo Opsfelder Araujo,
	Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Eduardo Habkost, Stefan Weil,
	Richard Henderson, Dr. David Alan Gilbert, Greg Kurz, qemu-s390x,
	Stefan Hajnoczi, Igor Mammedov, Cornelia Huck

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

  RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or
  huge pages on Linux) is skipped: will bail out if not supported. When not
  set, the OS might reserve swap space (or huge pages on Linux), depending
  on OS support.

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add the POSIX mmap implementation next and
allow specifying RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Cc: Juan Quintela <quintela@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: kvm@vger.kernel.org
Cc: qemu-s390x@nongnu.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h     | 16 +++++++++++++---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/mmap-alloc.h |  3 +++
 include/qemu/osdep.h      |  3 ++-
 migration/ram.c           |  3 +--
 softmmu/physmem.c         | 15 +++++++++++----
 util/mmap-alloc.c         |  6 ++++++
 util/oslib-posix.c        |  6 ++++--
 util/oslib-win32.c        | 13 ++++++++++++-
 10 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2d97bdf59c..1369497415 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -155,6 +155,14 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages Linux) is skipped: will bail out if not supported. When not set, the
+ * OS might reserve swap space (or huge pages on Linux), depending on OS
+ * support.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -913,7 +921,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -967,7 +975,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  *         (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -993,7 +1002,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *             RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6d4513f8e2..551876bed0 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 55664ea9f3..6ac05d70d4 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -15,6 +15,9 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 /* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
 #define QEMU_RAM_MMAP_PMEM          (1 << 2)
 
+/* Map MAP_NORESERVE and fail if not effective. */
+#define QEMU_RAM_MMAP_NORESERVE     (1 << 3)
+
 /**
  * qemu_ram_mmap: mmap the specified file or device.
  *
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..d6d8ef0999 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -343,7 +343,8 @@ extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..dd8daad386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3322,8 +3322,7 @@ int colo_init_ram_cache(void)
     WITH_RCU_READ_LOCK_GUARD() {
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             block->colo_cache = qemu_anon_ram_alloc(block->used_length,
-                                                    NULL,
-                                                    false);
+                                                    NULL, false, false);
             if (!block->colo_cache) {
                 error_report("%s: Can't alloc memory for COLO cache of block %s,"
                              "size 0x" RAM_ADDR_FMT, __func__, block->idstr,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index ec7a382ccd..dcc1fb74aa 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1584,6 +1584,7 @@ static void *file_ram_alloc(RAMBlock *block,
     mmap_flags = readonly ? QEMU_RAM_MMAP_READONLY : 0;
     mmap_flags |= (block->flags & RAM_SHARED) ? QEMU_RAM_MMAP_SHARED : 0;
     mmap_flags |= (block->flags & RAM_PMEM) ? QEMU_RAM_MMAP_PMEM : 0;
+    mmap_flags |= (block->flags & RAM_NORESERVE) ? QEMU_RAM_MMAP_NORESERVE : 0;
     area = qemu_ram_mmap(fd, memory, block->mr->align, mmap_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
@@ -1704,6 +1705,11 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_noreserve(RAMBlock *rb)
+{
+    return rb->flags & RAM_NORESERVE;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -1932,6 +1938,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     const bool shared = qemu_ram_is_shared(new_block);
+    const bool noreserve = qemu_ram_is_noreserve(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
     ram_addr_t old_ram_size, new_ram_size;
@@ -1954,7 +1961,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
-                                                  shared);
+                                                  shared, noreserve);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2025,7 +2032,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2116,7 +2123,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     RAMBlock *new_block;
     Error *local_err = NULL;
 
-    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_NORESERVE)) == 0);
 
     size = HOST_PAGE_ALIGN(size);
     max_size = HOST_PAGE_ALIGN(max_size);
@@ -2151,7 +2158,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~RAM_SHARED) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index bd8f7ab547..ecace41ad5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
 
@@ -183,6 +184,11 @@ void *qemu_ram_mmap(int fd,
     size_t offset, total;
     void *ptr, *guardptr;
 
+    if (mmap_flags & QEMU_RAM_MMAP_NORESERVE) {
+        error_report("Skipping reservation of swap space is not supported");
+        return MAP_FAILED;
+    }
+
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 1d250416f1..eab92fcafe 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -227,9 +227,11 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
+                          bool noreserve)
 {
-    const uint32_t mmap_flags = shared ? QEMU_RAM_MMAP_SHARED : 0;
+    const uint32_t mmap_flags = (shared ? QEMU_RAM_MMAP_SHARED : 0) |
+                                (noreserve ? QEMU_RAM_MMAP_NORESERVE : 0);
     size_t align = QEMU_VMALLOC_ALIGN;
     void *ptr = qemu_ram_mmap(-1, size, align, mmap_flags, 0);
 
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b8012bb..8cafe44179 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include <malloc.h>
 
 /* this must come after including "trace.h" */
@@ -77,10 +78,20 @@ static int get_allocation_granularity(void)
     return system_info.dwAllocationGranularity;
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
+                          bool noreserve)
 {
     void *ptr;
 
+    if (noreserve) {
+        /*
+         * We need a MEM_COMMIT before accessing any memory in a MEM_RESERVE
+         * area; we cannot easily mimic POSIX MAP_NORESERVE semantics.
+         */
+        error_report("Skipping reservation of swap space is not supported.");
+        return NULL;
+    }
+
     ptr = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
     trace_qemu_anon_ram_alloc(size, ptr);
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 11/12] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (9 preceding siblings ...)
  2021-03-08 15:05   ` David Hildenbrand
@ 2021-03-08 15:05 ` David Hildenbrand
  2021-03-10 10:28   ` David Hildenbrand
  2021-03-08 15:06 ` [PATCH v3 12/12] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand
  11 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's support RAM_NORESERVE via MAP_NORESERVE. At least on Linux,
the flag has no effect on most shared mappings - except for hugetlbfs
and anonymous memory.

Linux man page:
  "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
  space is reserved, one has the guarantee that it is possible to modify
  the mapping. When swap space is not reserved one might get SIGSEGV
  upon a write if no physical memory is available. See also the discussion
  of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
  2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

  1) Without MAP_NORESERVE *or* with memory overcommit under Linux
     disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
     accounting/reservation happens:
      For a file backed map
       SHARED or READ-only - 0 cost (the file is the map not swap)
       PRIVATE WRITABLE - size of mapping per instance

      For an anonymous or /dev/zero map
       SHARED   - size of mapping
       PRIVATE READ-only - 0 cost (but of little use)
       PRIVATE WRITABLE - size of mapping per instance

  2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

  1) Without MAP_NORESERVE, huge pages are reserved.

  2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c |  1 +
 util/mmap-alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index dcc1fb74aa..199c5a4985 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2229,6 +2229,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 flags = MAP_FIXED;
                 flags |= block->flags & RAM_SHARED ?
                          MAP_SHARED : MAP_PRIVATE;
+                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
                 if (block->fd >= 0) {
                     area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
                                 flags, block->fd, offset);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index ecace41ad5..c511a68bbe 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC       0x958458f6
@@ -125,6 +126,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, uint32_t mmap_flags,
     const bool readonly = mmap_flags & QEMU_RAM_MMAP_READONLY;
     const bool shared = mmap_flags & QEMU_RAM_MMAP_SHARED;
     const bool is_pmem = mmap_flags & QEMU_RAM_MMAP_PMEM;
+    const bool noreserve = mmap_flags & QEMU_RAM_MMAP_NORESERVE;
     const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
     int map_sync_flags = 0;
     int flags = MAP_FIXED;
@@ -132,6 +134,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, uint32_t mmap_flags,
 
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    flags |= noreserve ? MAP_NORESERVE : 0;
     if (shared && is_pmem) {
         map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
     }
@@ -174,6 +177,66 @@ static inline size_t mmap_guard_pagesize(int fd)
 #endif
 }
 
+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, uint32_t mmap_flags)
+{
+#if defined(__linux__)
+    const bool readonly = mmap_flags & QEMU_RAM_MMAP_READONLY;
+    const bool shared = mmap_flags & QEMU_RAM_MMAP_SHARED;
+    gchar *content = NULL;
+    const char *endptr;
+    unsigned int tmp;
+
+    /*
+     * hugeltb accounting is different than ordinary swap reservation:
+     * a) Hugetlb pages from the pool are reserved for both private and
+     *    shared mappings. For shared mappings, reservations are tracked
+     *    per file -- all mappers have to specify MAP_NORESERVE.
+     * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory.
+     */
+    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+        return true;
+    }
+
+    /*
+     * Accountable mappings in the kernel that can be affected by MAP_NORESEVE
+     * are private writable mappings (see mm/mmap.c:accountable_mapping() in
+     * Linux). For all shared or readonly mappings, MAP_NORESERVE is always
+     * implicitly active -- no reservation; this includes shmem. The only
+     * exception is shared anonymous memory, it is accounted like private
+     * anonymous memory.
+     */
+    if (readonly || (shared && fd >= 0)) {
+        return true;
+    }
+
+    /*
+     * MAP_NORESERVE is globally ignored for private writable mappings when
+     * overcommit is set to "never". Sparse memory regions aren't really
+     * possible in this system configuration.
+     *
+     * Bail out now instead of silently committing way more memory than
+     * currently desired by the user.
+     */
+    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
+        !qemu_strtoui(content, &endptr, 0, &tmp) &&
+        (!endptr || *endptr == '\n')) {
+        if (tmp == 2) {
+            error_report("Skipping reservation of swap space is not supported:"
+                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
+            return false;
+        }
+        return true;
+    }
+    /* this interface has been around since Linux 2.6 */
+    error_report("Skipping reservation of swap space is not supported:"
+                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
+    return false;
+#else
+    return true;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
@@ -184,7 +247,8 @@ void *qemu_ram_mmap(int fd,
     size_t offset, total;
     void *ptr, *guardptr;
 
-    if (mmap_flags & QEMU_RAM_MMAP_NORESERVE) {
+    if (mmap_flags & QEMU_RAM_MMAP_NORESERVE &&
+        !map_noreserve_effective(fd, mmap_flags)) {
         error_report("Skipping reservation of swap space is not supported");
         return MAP_FAILED;
     }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 12/12] hostmem: Wire up RAM_NORESERVE via "reserve" property
  2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
                   ` (10 preceding siblings ...)
  2021-03-08 15:05 ` [PATCH v3 11/12] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
@ 2021-03-08 15:06 ` David Hildenbrand
  11 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-08 15:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, Greg Kurz, Halil Pasic, Christian Borntraeger,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski, Juan Quintela

Let's provide a way to control the use of RAM_NORESERVE via memory
backends using the "reserve" property which defaults to true (old
behavior).

Only POSIX supports setting the flag (and Linux support is checked at
runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
Windows will bail out.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem-file.c  | 11 ++++++-----
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 backends/hostmem.c       | 33 +++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  2 +-
 5 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index b683da9daf..9d550e53d4 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                object_get_typename(OBJECT(backend)));
 #else
     HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
+    uint32_t ram_flags;
     gchar *name;
 
     if (!backend->size) {
@@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = host_memory_backend_get_name(backend);
-    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
-                                     name,
-                                     backend->size, fb->align,
-                                     (backend->share ? RAM_SHARED : 0) |
-                                     (fb->is_pmem ? RAM_PMEM : 0),
+    ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+    memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
+                                     backend->size, fb->align, ram_flags,
                                      fb->mem_path, fb->readonly, errp);
     g_free(name);
 #endif
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 93b5d1a4cf..f3436b623d 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 741e701062..b8e55cdbd0 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
                                            backend->size, ram_flags, errp);
     g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index c6c1ff5b99..4e80162915 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
     Error *local_err = NULL;
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
+    if (!backend->reserve && value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+
     if (!host_memory_backend_mr_inited(backend)) {
         backend->prealloc = value;
         return;
@@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
+    backend->reserve = true;
     backend->prealloc_threads = 1;
 }
 
@@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
     backend->share = value;
 }
 
+static bool host_memory_backend_get_reserve(Object *o, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    return backend->reserve;
+}
+
+static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property value");
+        return;
+    }
+    if (backend->prealloc && !value) {
+        error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+        return;
+    }
+    backend->reserve = value;
+}
+
 static bool
 host_memory_backend_get_use_canonical_path(Object *obj, Error **errp)
 {
@@ -494,6 +522,11 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_share, host_memory_backend_set_share);
     object_class_property_set_description(oc, "share",
         "Mark the memory as private to QEMU or shared");
+    object_class_property_add_bool(oc, "reserve",
+        host_memory_backend_get_reserve, host_memory_backend_set_reserve);
+    object_class_property_set_description(oc, "reserve",
+        "Reserve swap space (or huge pages under Linux) for the whole memory"
+        " backend, if supported by the OS.");
     /*
      * Do not delete/rename option. This option must be considered stable
      * (as if it didn't have the 'x-' prefix including deprecation period) as
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index df5644723a..9ff5c16963 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -64,7 +64,7 @@ struct HostMemoryBackend {
     /* protected */
     uint64_t size;
     bool merge, dump, use_canonical_path;
-    bool prealloc, is_mapped, share;
+    bool prealloc, is_mapped, share, reserve;
     uint32_t prealloc_threads;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-08 15:05 ` [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
@ 2021-03-09 20:04   ` Peter Xu
  2021-03-09 20:27     ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2021-03-09 20:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
> Let's introduce a new set of flags that abstract mmap logic and replace
> our current set of bools, to prepare for another flag.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/mmap-alloc.h | 17 +++++++++++------
>  softmmu/physmem.c         |  8 +++++---
>  util/mmap-alloc.c         | 14 +++++++-------
>  util/oslib-posix.c        |  3 ++-
>  4 files changed, 25 insertions(+), 17 deletions(-)
> 
> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> index 456ff87df1..55664ea9f3 100644
> --- a/include/qemu/mmap-alloc.h
> +++ b/include/qemu/mmap-alloc.h
> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
>  
>  size_t qemu_mempath_getpagesize(const char *mem_path);
>  
> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
> +
> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
> +
> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)

Sorry to speak late - I just noticed that is_pmem can actually be converted too
with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
directly using MAP_*, I think?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-09 20:04   ` Peter Xu
@ 2021-03-09 20:27     ` David Hildenbrand
  2021-03-09 20:58       ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-09 20:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Murilo Opsfelder Araujo, Cornelia Huck,
	Eduardo Habkost, Michael S. Tsirkin, Stefan Weil,
	David Hildenbrand, Richard Henderson, Dr. David Alan Gilbert,
	Juan Quintela, qemu-devel, Halil Pasic, Christian Borntraeger,
	Greg Kurz, Stefan Hajnoczi, Igor Mammedov, Thomas Huth,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski


> Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
>> Let's introduce a new set of flags that abstract mmap logic and replace
>> our current set of bools, to prepare for another flag.
>> 
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> include/qemu/mmap-alloc.h | 17 +++++++++++------
>> softmmu/physmem.c         |  8 +++++---
>> util/mmap-alloc.c         | 14 +++++++-------
>> util/oslib-posix.c        |  3 ++-
>> 4 files changed, 25 insertions(+), 17 deletions(-)
>> 
>> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
>> index 456ff87df1..55664ea9f3 100644
>> --- a/include/qemu/mmap-alloc.h
>> +++ b/include/qemu/mmap-alloc.h
>> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
>> 
>> size_t qemu_mempath_getpagesize(const char *mem_path);
>> 
>> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
>> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
>> +
>> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
>> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
>> +
>> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
>> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
> 
> Sorry to speak late - I just noticed that is_pmem can actually be converted too
> with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
> use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
> directly using MAP_*, I think?
> 

No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).

 I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.

PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.

Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.


> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-09 20:27     ` David Hildenbrand
@ 2021-03-09 20:58       ` Peter Xu
  2021-03-10  8:41         ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2021-03-09 20:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Tue, Mar 09, 2021 at 09:27:10PM +0100, David Hildenbrand wrote:
> 
> > Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
> > 
> > On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
> >> Let's introduce a new set of flags that abstract mmap logic and replace
> >> our current set of bools, to prepare for another flag.
> >> 
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >> include/qemu/mmap-alloc.h | 17 +++++++++++------
> >> softmmu/physmem.c         |  8 +++++---
> >> util/mmap-alloc.c         | 14 +++++++-------
> >> util/oslib-posix.c        |  3 ++-
> >> 4 files changed, 25 insertions(+), 17 deletions(-)
> >> 
> >> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> >> index 456ff87df1..55664ea9f3 100644
> >> --- a/include/qemu/mmap-alloc.h
> >> +++ b/include/qemu/mmap-alloc.h
> >> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
> >> 
> >> size_t qemu_mempath_getpagesize(const char *mem_path);
> >> 
> >> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
> >> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
> >> +
> >> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
> >> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
> >> +
> >> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
> >> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
> > 
> > Sorry to speak late - I just noticed that is_pmem can actually be converted too
> > with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
> > use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
> > directly using MAP_*, I think?
> > 
> 
> No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).
> 
>  I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.
> 
> PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.
> 
> Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.

Yeh the READONLY flag would be special, it will need to be separated from the
rest flags.  I'd keep my own preference, but if you really like the current
way, maybe at least move it to qemu/osdep.h?  So at least when someone needs a
cross-platform flag they'll show up - while mmap-alloc.h looks still only for
the posix world, then it'll be odd to introduce these flags only for posix even
if posix definied most of them.

At the meantime, maybe rename QEMU_RAM_MMAP_* to QEMU_MMAP_* too?  All of them
look applicable to no-RAM-backends too.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-09 20:58       ` Peter Xu
@ 2021-03-10  8:41         ` David Hildenbrand
  2021-03-10 10:11           ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-10  8:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 09.03.21 21:58, Peter Xu wrote:
> On Tue, Mar 09, 2021 at 09:27:10PM +0100, David Hildenbrand wrote:
>>
>>> Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
>>>
>>> On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
>>>> Let's introduce a new set of flags that abstract mmap logic and replace
>>>> our current set of bools, to prepare for another flag.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>> include/qemu/mmap-alloc.h | 17 +++++++++++------
>>>> softmmu/physmem.c         |  8 +++++---
>>>> util/mmap-alloc.c         | 14 +++++++-------
>>>> util/oslib-posix.c        |  3 ++-
>>>> 4 files changed, 25 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
>>>> index 456ff87df1..55664ea9f3 100644
>>>> --- a/include/qemu/mmap-alloc.h
>>>> +++ b/include/qemu/mmap-alloc.h
>>>> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
>>>>
>>>> size_t qemu_mempath_getpagesize(const char *mem_path);
>>>>
>>>> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
>>>> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
>>>> +
>>>> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
>>>> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
>>>> +
>>>> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
>>>> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
>>>
>>> Sorry to speak late - I just noticed that is_pmem can actually be converted too
>>> with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
>>> use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
>>> directly using MAP_*, I think?
>>>
>>
>> No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).
>>
>>   I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.
>>
>> PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.
>>
>> Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.
> 
> Yeh the READONLY flag would be special, it will need to be separated from the
> rest flags.  I'd keep my own preference, but if you really like the current
> way, maybe at least move it to qemu/osdep.h?  So at least when someone needs a
> cross-platform flag they'll show up - while mmap-alloc.h looks still only for
> the posix world, then it'll be odd to introduce these flags only for posix even
> if posix definied most of them.

I'll give it another thought today. I certainly want to avoid moving all 
that MAP_ flag and PROT_ logic to the callers. E.g., MAP_SHARED implies 
!MAP_PRIVATE. MAP_SYNC implies that we want MAP_SHARED_VALIDATE. fd < 0 
implies MAP_ANONYMOUS.

Maybe something like

/*
  * QEMU's MMAP abstraction to map guest RAM, taking care of alignment
  * requirements and guard pages.
  *
  * Supported flags: MAP_SHARED, MAP_SYNC
  *
  * Implicitly set flags:
  * - MAP PRIVATE: When !MAP_SHARED and !MAP_SYNC
  * - MAP_ANONYMOUS: When fd < 0
  * - MAP_SHARED_VALIDATE: When MAP_SYNC
  *
  * If mapping with MAP_SYNC|MAP_SHARED_VALIDATE fails, fallback to
  * !MAP_SYNC|MAP_SHARED and warn.
  */
  void *qemu_ram_mmap(int fd,
                      size_t size,
                      size_t align,
                      bool readonly,
                      uint32_t mmap_flags,
                      off_t map_offset);


I also thought about introducing
	QEMU_MAP_READONLY 0x100000000ul

and using "uint64_t mmap_flags" - thoughts?

> 
> At the meantime, maybe rename QEMU_RAM_MMAP_* to QEMU_MMAP_* too?  All of them
> look applicable to no-RAM-backends too.

Hm, I don't think this is a good idea unless we would have something 
like qemu_mmap() - which I don't think we'll have in the near future.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-10  8:41         ` David Hildenbrand
@ 2021-03-10 10:11           ` David Hildenbrand
  2021-03-10 10:55             ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-10 10:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 10.03.21 09:41, David Hildenbrand wrote:
> On 09.03.21 21:58, Peter Xu wrote:
>> On Tue, Mar 09, 2021 at 09:27:10PM +0100, David Hildenbrand wrote:
>>>
>>>> Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
>>>>
>>>> On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
>>>>> Let's introduce a new set of flags that abstract mmap logic and replace
>>>>> our current set of bools, to prepare for another flag.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>> include/qemu/mmap-alloc.h | 17 +++++++++++------
>>>>> softmmu/physmem.c         |  8 +++++---
>>>>> util/mmap-alloc.c         | 14 +++++++-------
>>>>> util/oslib-posix.c        |  3 ++-
>>>>> 4 files changed, 25 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
>>>>> index 456ff87df1..55664ea9f3 100644
>>>>> --- a/include/qemu/mmap-alloc.h
>>>>> +++ b/include/qemu/mmap-alloc.h
>>>>> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
>>>>>
>>>>> size_t qemu_mempath_getpagesize(const char *mem_path);
>>>>>
>>>>> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
>>>>> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
>>>>> +
>>>>> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
>>>>> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
>>>>> +
>>>>> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
>>>>> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
>>>>
>>>> Sorry to speak late - I just noticed that is_pmem can actually be converted too
>>>> with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
>>>> use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
>>>> directly using MAP_*, I think?
>>>>
>>>
>>> No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).
>>>
>>>    I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.
>>>
>>> PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.
>>>
>>> Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.
>>
>> Yeh the READONLY flag would be special, it will need to be separated from the
>> rest flags.  I'd keep my own preference, but if you really like the current
>> way, maybe at least move it to qemu/osdep.h?  So at least when someone needs a
>> cross-platform flag they'll show up - while mmap-alloc.h looks still only for
>> the posix world, then it'll be odd to introduce these flags only for posix even
>> if posix definied most of them.
> 
> I'll give it another thought today. I certainly want to avoid moving all
> that MAP_ flag and PROT_ logic to the callers. E.g., MAP_SHARED implies
> !MAP_PRIVATE. MAP_SYNC implies that we want MAP_SHARED_VALIDATE. fd < 0
> implies MAP_ANONYMOUS.
> 
> Maybe something like
> 
> /*
>    * QEMU's MMAP abstraction to map guest RAM, taking care of alignment
>    * requirements and guard pages.
>    *
>    * Supported flags: MAP_SHARED, MAP_SYNC
>    *
>    * Implicitly set flags:
>    * - MAP PRIVATE: When !MAP_SHARED and !MAP_SYNC
>    * - MAP_ANONYMOUS: When fd < 0
>    * - MAP_SHARED_VALIDATE: When MAP_SYNC
>    *
>    * If mapping with MAP_SYNC|MAP_SHARED_VALIDATE fails, fallback to
>    * !MAP_SYNC|MAP_SHARED and warn.
>    */
>    void *qemu_ram_mmap(int fd,
>                        size_t size,
>                        size_t align,
>                        bool readonly,
>                        uint32_t mmap_flags,
>                        off_t map_offset);

What about this:


 From 13a59d404bb3edaed9e42c94432be28fb9a65c26 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Fri, 5 Mar 2021 17:20:37 +0100
Subject: [PATCH] util/mmap-alloc: Pass MAP_ flags instead of separate bools to
  qemu_ram_mmap()

Let's pass MAP_ flags instead of bools to prepare for passing other MAP_
flags and update the documentation of qemu_ram_mmap(). Only allow selected
MAP_ flags (MAP_SHARED, MAP_SYNC) to be passed and keep setting other
flags implicitly.

Keep the "readonly" flag, as it cannot be expressed via MAP_ flags.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/qemu/mmap-alloc.h | 19 ++++++++++++++-----
  softmmu/physmem.c         |  6 ++++--
  util/mmap-alloc.c         | 13 ++++++++-----
  util/oslib-posix.c        |  3 ++-
  4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 456ff87df1..27ef374810 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,7 +7,10 @@ size_t qemu_fd_getpagesize(int fd);
  size_t qemu_mempath_getpagesize(const char *mem_path);
  
  /**
- * qemu_ram_mmap: mmap the specified file or device.
+ * qemu_ram_mmap: mmap anonymous memory, the specified file or device.
+ *
+ * QEMU's MMAP abstraction to map guest RAM, simplifying flag handling,
+ * taking care of alignment requirements and installing guard pages.
   *
   * Parameters:
   *  @fd: the file or the device to mmap
@@ -15,10 +18,17 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
   *  @align: if not zero, specify the alignment of the starting mapping address;
   *          otherwise, the alignment in use will be determined by QEMU.
   *  @readonly: true for a read-only mapping, false for read/write.
- *  @shared: map has RAM_SHARED flag.
- *  @is_pmem: map has RAM_PMEM flag.
+ *  @map_flags: supported MAP_* flags: MAP_SHARED, MAP_SYNC
   *  @map_offset: map starts at offset of map_offset from the start of fd
   *
+ * Implicitly handled map_flags:
+ * - MAP PRIVATE: With !MAP_SHARED
+ * - MAP_ANONYMOUS: With fd < 0
+ * - MAP_SHARED_VALIDATE: With MAP_SYNC && MAP_SHARED
+ *
+ * MAP_SYNC is ignored without MAP_SHARED. If mapping via MAP_SYNC fails,
+ * warn and fallback to mapping without MAP_SYNC.
+ *
   * Return:
   *  On success, return a pointer to the mapped area.
   *  On failure, return MAP_FAILED.
@@ -27,8 +37,7 @@ void *qemu_ram_mmap(int fd,
                      size_t size,
                      size_t align,
                      bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t map_flags,
                      off_t map_offset);
  
  void qemu_ram_munmap(int fd, void *ptr, size_t size);
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 8f3d286e12..1336884b51 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
                              off_t offset,
                              Error **errp)
  {
+    uint32_t map_flags;
      void *area;
  
      block->page_size = qemu_fd_getpagesize(fd);
@@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
          perror("ftruncate");
      }
  
+    map_flags = (block->flags & RAM_SHARED) ? MAP_SHARED : 0;
+    map_flags |= (block->flags & RAM_PMEM) ? MAP_SYNC : 0;
      area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
-                         offset);
+                         map_flags, offset);
      if (area == MAP_FAILED) {
          error_setg_errno(errp, errno,
                           "unable to map backing store for guest RAM");
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0e2bd7bc0e..b558f1675a 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -119,16 +119,20 @@ static void *mmap_reserve(size_t size, int fd)
   * it accessible.
   */
  static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
-                           bool shared, bool is_pmem, off_t map_offset)
+                           uint32_t map_flags, off_t map_offset)
  {
+    const bool shared = map_flags & MAP_SHARED;
+    const bool sync = map_flags & MAP_SYNC;
      const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
      int map_sync_flags = 0;
      int flags = MAP_FIXED;
      void *activated_ptr;
  
+    g_assert(!(map_flags & ~(MAP_SHARED | MAP_SYNC)));
+
      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
+    if (shared && sync) {
          map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
      }
  
@@ -174,8 +178,7 @@ void *qemu_ram_mmap(int fd,
                      size_t size,
                      size_t align,
                      bool readonly,
-                    bool shared,
-                    bool is_pmem,
+                    uint32_t map_flags,
                      off_t map_offset)
  {
      const size_t guard_pagesize = mmap_guard_pagesize(fd);
@@ -199,7 +202,7 @@ void *qemu_ram_mmap(int fd,
  
      offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
  
-    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+    ptr = mmap_activate(guardptr + offset, size, fd, readonly, map_flags,
                          map_offset);
      if (ptr == MAP_FAILED) {
          munmap(guardptr, total);
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 36820fec16..95e2b85279 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -229,8 +229,9 @@ void *qemu_memalign(size_t alignment, size_t size)
  /* alloc shared memory pages */
  void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
  {
+    const uint32_t map_flags = shared ? MAP_SHARED : 0;
      size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, 0);
+    void *ptr = qemu_ram_mmap(-1, size, align, false, map_flags, 0);
  
      if (ptr == MAP_FAILED) {
          return NULL;
-- 
2.29.2




-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 11/12] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE
  2021-03-08 15:05 ` [PATCH v3 11/12] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
@ 2021-03-10 10:28   ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-10 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, Greg Kurz,
	Halil Pasic, Christian Borntraeger, Stefan Hajnoczi,
	Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 08.03.21 16:05, David Hildenbrand wrote:
> Let's support RAM_NORESERVE via MAP_NORESERVE. At least on Linux,
> the flag has no effect on most shared mappings - except for hugetlbfs
> and anonymous memory.
> 
> Linux man page:
>    "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
>    space is reserved, one has the guarantee that it is possible to modify
>    the mapping. When swap space is not reserved one might get SIGSEGV
>    upon a write if no physical memory is available. See also the discussion
>    of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
>    2.6, this flag had effect only for private writable mappings."
> 
> Note that the "guarantee" part is wrong with memory overcommit in Linux.
> 
> Also, in Linux hugetlbfs is treated differently - we configure reservation
> of huge pages from the pool, not reservation of swap space (huge pages
> cannot be swapped).
> 
> The rough behavior is [1]:
> a) !Hugetlbfs:
> 
>    1) Without MAP_NORESERVE *or* with memory overcommit under Linux
>       disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
>       accounting/reservation happens:
>        For a file backed map
>         SHARED or READ-only - 0 cost (the file is the map not swap)
>         PRIVATE WRITABLE - size of mapping per instance
> 
>        For an anonymous or /dev/zero map
>         SHARED   - size of mapping
>         PRIVATE READ-only - 0 cost (but of little use)
>         PRIVATE WRITABLE - size of mapping per instance
> 
>    2) With MAP_NORESERVE, no accounting/reservation happens.
> 
> b) Hugetlbfs:
> 
>    1) Without MAP_NORESERVE, huge pages are reserved.
> 
>    2) With MAP_NORESERVE, no huge pages are reserved.
> 
> Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
> to configure it for !hugetlbfs globally; this toggle now allows
> configuring it more fine-grained, not for the whole system.
> 
> The target use case is virtio-mem, which dynamically exposes memory
> inside a large, sparse memory area to the VM.
> 
> [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   softmmu/physmem.c |  1 +
>   util/mmap-alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index dcc1fb74aa..199c5a4985 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2229,6 +2229,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>                   flags = MAP_FIXED;
>                   flags |= block->flags & RAM_SHARED ?
>                            MAP_SHARED : MAP_PRIVATE;
> +                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
>                   if (block->fd >= 0) {
>                       area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
>                                   flags, block->fd, offset);
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index ecace41ad5..c511a68bbe 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -20,6 +20,7 @@
>   #include "qemu/osdep.h"
>   #include "qemu/mmap-alloc.h"
>   #include "qemu/host-utils.h"
> +#include "qemu/cutils.h"
>   #include "qemu/error-report.h"
>   
>   #define HUGETLBFS_MAGIC       0x958458f6
> @@ -125,6 +126,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, uint32_t mmap_flags,
>       const bool readonly = mmap_flags & QEMU_RAM_MMAP_READONLY;
>       const bool shared = mmap_flags & QEMU_RAM_MMAP_SHARED;
>       const bool is_pmem = mmap_flags & QEMU_RAM_MMAP_PMEM;
> +    const bool noreserve = mmap_flags & QEMU_RAM_MMAP_NORESERVE;
>       const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
>       int map_sync_flags = 0;
>       int flags = MAP_FIXED;
> @@ -132,6 +134,7 @@ static void *mmap_activate(void *ptr, size_t size, int fd, uint32_t mmap_flags,
>   
>       flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>       flags |= shared ? MAP_SHARED : MAP_PRIVATE;
> +    flags |= noreserve ? MAP_NORESERVE : 0;
>       if (shared && is_pmem) {
>           map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
>       }
> @@ -174,6 +177,66 @@ static inline size_t mmap_guard_pagesize(int fd)
>   #endif
>   }
>   
> +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
> +static bool map_noreserve_effective(int fd, uint32_t mmap_flags)
> +{
> +#if defined(__linux__)
> +    const bool readonly = mmap_flags & QEMU_RAM_MMAP_READONLY;
> +    const bool shared = mmap_flags & QEMU_RAM_MMAP_SHARED;
> +    gchar *content = NULL;
> +    const char *endptr;
> +    unsigned int tmp;
> +
> +    /*
> +     * hugeltb accounting is different than ordinary swap reservation:
> +     * a) Hugetlb pages from the pool are reserved for both private and
> +     *    shared mappings. For shared mappings, reservations are tracked
> +     *    per file -- all mappers have to specify MAP_NORESERVE.
> +     * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory.
> +     */
> +    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Accountable mappings in the kernel that can be affected by MAP_NORESEVE
> +     * are private writable mappings (see mm/mmap.c:accountable_mapping() in
> +     * Linux). For all shared or readonly mappings, MAP_NORESERVE is always
> +     * implicitly active -- no reservation; this includes shmem. The only
> +     * exception is shared anonymous memory, it is accounted like private
> +     * anonymous memory.
> +     */
> +    if (readonly || (shared && fd >= 0)) {
> +        return true;
> +    }
> +
> +    /*
> +     * MAP_NORESERVE is globally ignored for private writable mappings when
> +     * overcommit is set to "never". Sparse memory regions aren't really
> +     * possible in this system configuration.
> +     *
> +     * Bail out now instead of silently committing way more memory than
> +     * currently desired by the user.
> +     */
> +    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
> +        !qemu_strtoui(content, &endptr, 0, &tmp) &&
> +        (!endptr || *endptr == '\n')) {
> +        if (tmp == 2) {
> +            error_report("Skipping reservation of swap space is not supported:"
> +                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
> +            return false;
> +        }
> +        return true;
> +    }
> +    /* this interface has been around since Linux 2.6 */
> +    error_report("Skipping reservation of swap space is not supported:"
> +                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
> +    return false;
> +#else

I'll return "false" here for now after learning that e.g., FreeBSD never 
implemented the flag and removed it a while ago
	https://github.com/Clozure/ccl/issues/17

So I'll unlock it only for Linux - which makes sense, because I only 
test there (and only care about Linux with MAP_NORESERVE)

> +    return true;


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-10 10:11           ` David Hildenbrand
@ 2021-03-10 10:55             ` David Hildenbrand
  2021-03-10 16:27               ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-10 10:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 10.03.21 11:11, David Hildenbrand wrote:
> On 10.03.21 09:41, David Hildenbrand wrote:
>> On 09.03.21 21:58, Peter Xu wrote:
>>> On Tue, Mar 09, 2021 at 09:27:10PM +0100, David Hildenbrand wrote:
>>>>
>>>>> Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
>>>>>
>>>>> On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
>>>>>> Let's introduce a new set of flags that abstract mmap logic and replace
>>>>>> our current set of bools, to prepare for another flag.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>> include/qemu/mmap-alloc.h | 17 +++++++++++------
>>>>>> softmmu/physmem.c         |  8 +++++---
>>>>>> util/mmap-alloc.c         | 14 +++++++-------
>>>>>> util/oslib-posix.c        |  3 ++-
>>>>>> 4 files changed, 25 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
>>>>>> index 456ff87df1..55664ea9f3 100644
>>>>>> --- a/include/qemu/mmap-alloc.h
>>>>>> +++ b/include/qemu/mmap-alloc.h
>>>>>> @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
>>>>>>
>>>>>> size_t qemu_mempath_getpagesize(const char *mem_path);
>>>>>>
>>>>>> +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
>>>>>> +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
>>>>>> +
>>>>>> +/* Map MAP_SHARED instead of MAP_PRIVATE. */
>>>>>> +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
>>>>>> +
>>>>>> +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
>>>>>> +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
>>>>>
>>>>> Sorry to speak late - I just noticed that is_pmem can actually be converted too
>>>>> with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
>>>>> use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
>>>>> directly using MAP_*, I think?
>>>>>
>>>>
>>>> No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).
>>>>
>>>>     I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.
>>>>
>>>> PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.
>>>>
>>>> Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.
>>>
>>> Yeh the READONLY flag would be special, it will need to be separated from the
>>> rest flags.  I'd keep my own preference, but if you really like the current
>>> way, maybe at least move it to qemu/osdep.h?  So at least when someone needs a
>>> cross-platform flag they'll show up - while mmap-alloc.h looks still only for
>>> the posix world, then it'll be odd to introduce these flags only for posix even
>>> if posix definied most of them.
>>
>> I'll give it another thought today. I certainly want to avoid moving all
>> that MAP_ flag and PROT_ logic to the callers. E.g., MAP_SHARED implies
>> !MAP_PRIVATE. MAP_SYNC implies that we want MAP_SHARED_VALIDATE. fd < 0
>> implies MAP_ANONYMOUS.
>>
>> Maybe something like
>>
>> /*
>>     * QEMU's MMAP abstraction to map guest RAM, taking care of alignment
>>     * requirements and guard pages.
>>     *
>>     * Supported flags: MAP_SHARED, MAP_SYNC
>>     *
>>     * Implicitly set flags:
>>     * - MAP PRIVATE: When !MAP_SHARED and !MAP_SYNC
>>     * - MAP_ANONYMOUS: When fd < 0
>>     * - MAP_SHARED_VALIDATE: When MAP_SYNC
>>     *
>>     * If mapping with MAP_SYNC|MAP_SHARED_VALIDATE fails, fallback to
>>     * !MAP_SYNC|MAP_SHARED and warn.
>>     */
>>     void *qemu_ram_mmap(int fd,
>>                         size_t size,
>>                         size_t align,
>>                         bool readonly,
>>                         uint32_t mmap_flags,
>>                         off_t map_offset);
> 
> What about this:
> 


The only ugly thing is that e.g., MAP_SYNC is only defined for Linux and 
MAP_NORESERVE is mostly only defined on Linux.

So we need something like we already have in mmap-alloc.c:

#ifdef CONFIG_LINUX
#include <linux/mman.h>
#else  /* !CONFIG_LINUX */
#define MAP_SYNC              0x0
#define MAP_SHARED_VALIDATE   0x0
#endif /* CONFIG_LINUX */


and for the noreserve part


#ifndef MAP_NORESERVE
#define MAP_NORESERVE 0x0
#endif


But then, I can no longer bail out if someone specifies a flag although 
it is unsupported/not effective. hmmmm ...


> 
>   From 13a59d404bb3edaed9e42c94432be28fb9a65c26 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Fri, 5 Mar 2021 17:20:37 +0100
> Subject: [PATCH] util/mmap-alloc: Pass MAP_ flags instead of separate bools to
>    qemu_ram_mmap()
> 
> Let's pass MAP_ flags instead of bools to prepare for passing other MAP_
> flags and update the documentation of qemu_ram_mmap(). Only allow selected
> MAP_ flags (MAP_SHARED, MAP_SYNC) to be passed and keep setting other
> flags implicitly.
> 
> Keep the "readonly" flag, as it cannot be expressed via MAP_ flags.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>    include/qemu/mmap-alloc.h | 19 ++++++++++++++-----
>    softmmu/physmem.c         |  6 ++++--
>    util/mmap-alloc.c         | 13 ++++++++-----
>    util/oslib-posix.c        |  3 ++-
>    4 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> index 456ff87df1..27ef374810 100644
> --- a/include/qemu/mmap-alloc.h
> +++ b/include/qemu/mmap-alloc.h
> @@ -7,7 +7,10 @@ size_t qemu_fd_getpagesize(int fd);
>    size_t qemu_mempath_getpagesize(const char *mem_path);
>    
>    /**
> - * qemu_ram_mmap: mmap the specified file or device.
> + * qemu_ram_mmap: mmap anonymous memory, the specified file or device.
> + *
> + * QEMU's MMAP abstraction to map guest RAM, simplifying flag handling,
> + * taking care of alignment requirements and installing guard pages.
>     *
>     * Parameters:
>     *  @fd: the file or the device to mmap
> @@ -15,10 +18,17 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
>     *  @align: if not zero, specify the alignment of the starting mapping address;
>     *          otherwise, the alignment in use will be determined by QEMU.
>     *  @readonly: true for a read-only mapping, false for read/write.
> - *  @shared: map has RAM_SHARED flag.
> - *  @is_pmem: map has RAM_PMEM flag.
> + *  @map_flags: supported MAP_* flags: MAP_SHARED, MAP_SYNC
>     *  @map_offset: map starts at offset of map_offset from the start of fd
>     *
> + * Implicitly handled map_flags:
> + * - MAP PRIVATE: With !MAP_SHARED
> + * - MAP_ANONYMOUS: With fd < 0
> + * - MAP_SHARED_VALIDATE: With MAP_SYNC && MAP_SHARED
> + *
> + * MAP_SYNC is ignored without MAP_SHARED. If mapping via MAP_SYNC fails,
> + * warn and fallback to mapping without MAP_SYNC.
> + *
>     * Return:
>     *  On success, return a pointer to the mapped area.
>     *  On failure, return MAP_FAILED.
> @@ -27,8 +37,7 @@ void *qemu_ram_mmap(int fd,
>                        size_t size,
>                        size_t align,
>                        bool readonly,
> -                    bool shared,
> -                    bool is_pmem,
> +                    uint32_t map_flags,
>                        off_t map_offset);
>    
>    void qemu_ram_munmap(int fd, void *ptr, size_t size);
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 8f3d286e12..1336884b51 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
>                                off_t offset,
>                                Error **errp)
>    {
> +    uint32_t map_flags;
>        void *area;
>    
>        block->page_size = qemu_fd_getpagesize(fd);
> @@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
>            perror("ftruncate");
>        }
>    
> +    map_flags = (block->flags & RAM_SHARED) ? MAP_SHARED : 0;
> +    map_flags |= (block->flags & RAM_PMEM) ? MAP_SYNC : 0;
>        area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
> -                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
> -                         offset);
> +                         map_flags, offset);
>        if (area == MAP_FAILED) {
>            error_setg_errno(errp, errno,
>                             "unable to map backing store for guest RAM");
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 0e2bd7bc0e..b558f1675a 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -119,16 +119,20 @@ static void *mmap_reserve(size_t size, int fd)
>     * it accessible.
>     */
>    static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
> -                           bool shared, bool is_pmem, off_t map_offset)
> +                           uint32_t map_flags, off_t map_offset)
>    {
> +    const bool shared = map_flags & MAP_SHARED;
> +    const bool sync = map_flags & MAP_SYNC;
>        const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
>        int map_sync_flags = 0;
>        int flags = MAP_FIXED;
>        void *activated_ptr;
>    
> +    g_assert(!(map_flags & ~(MAP_SHARED | MAP_SYNC)));
> +
>        flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>        flags |= shared ? MAP_SHARED : MAP_PRIVATE;
> -    if (shared && is_pmem) {
> +    if (shared && sync) {
>            map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
>        }
>    
> @@ -174,8 +178,7 @@ void *qemu_ram_mmap(int fd,
>                        size_t size,
>                        size_t align,
>                        bool readonly,
> -                    bool shared,
> -                    bool is_pmem,
> +                    uint32_t map_flags,
>                        off_t map_offset)
>    {
>        const size_t guard_pagesize = mmap_guard_pagesize(fd);
> @@ -199,7 +202,7 @@ void *qemu_ram_mmap(int fd,
>    
>        offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
>    
> -    ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
> +    ptr = mmap_activate(guardptr + offset, size, fd, readonly, map_flags,
>                            map_offset);
>        if (ptr == MAP_FAILED) {
>            munmap(guardptr, total);
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 36820fec16..95e2b85279 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -229,8 +229,9 @@ void *qemu_memalign(size_t alignment, size_t size)
>    /* alloc shared memory pages */
>    void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
>    {
> +    const uint32_t map_flags = shared ? MAP_SHARED : 0;
>        size_t align = QEMU_VMALLOC_ALIGN;
> -    void *ptr = qemu_ram_mmap(-1, size, align, false, shared, false, 0);
> +    void *ptr = qemu_ram_mmap(-1, size, align, false, map_flags, 0);
>    
>        if (ptr == MAP_FAILED) {
>            return NULL;
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
  2021-03-10 10:55             ` David Hildenbrand
@ 2021-03-10 16:27               ` Peter Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Xu @ 2021-03-10 16:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Wed, Mar 10, 2021 at 11:55:58AM +0100, David Hildenbrand wrote:
> On 10.03.21 11:11, David Hildenbrand wrote:
> > On 10.03.21 09:41, David Hildenbrand wrote:
> > > On 09.03.21 21:58, Peter Xu wrote:
> > > > On Tue, Mar 09, 2021 at 09:27:10PM +0100, David Hildenbrand wrote:
> > > > > 
> > > > > > Am 09.03.2021 um 21:04 schrieb Peter Xu <peterx@redhat.com>:
> > > > > > 
> > > > > > On Mon, Mar 08, 2021 at 04:05:57PM +0100, David Hildenbrand wrote:
> > > > > > > Let's introduce a new set of flags that abstract mmap logic and replace
> > > > > > > our current set of bools, to prepare for another flag.
> > > > > > > 
> > > > > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > > > > > ---
> > > > > > > include/qemu/mmap-alloc.h | 17 +++++++++++------
> > > > > > > softmmu/physmem.c         |  8 +++++---
> > > > > > > util/mmap-alloc.c         | 14 +++++++-------
> > > > > > > util/oslib-posix.c        |  3 ++-
> > > > > > > 4 files changed, 25 insertions(+), 17 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> > > > > > > index 456ff87df1..55664ea9f3 100644
> > > > > > > --- a/include/qemu/mmap-alloc.h
> > > > > > > +++ b/include/qemu/mmap-alloc.h
> > > > > > > @@ -6,6 +6,15 @@ size_t qemu_fd_getpagesize(int fd);
> > > > > > > 
> > > > > > > size_t qemu_mempath_getpagesize(const char *mem_path);
> > > > > > > 
> > > > > > > +/* Map PROT_READ instead of PROT_READ|PROT_WRITE. */
> > > > > > > +#define QEMU_RAM_MMAP_READONLY      (1 << 0)
> > > > > > > +
> > > > > > > +/* Map MAP_SHARED instead of MAP_PRIVATE. */
> > > > > > > +#define QEMU_RAM_MMAP_SHARED        (1 << 1)
> > > > > > > +
> > > > > > > +/* Map MAP_SYNC|MAP_SHARED_VALIDATE if possible, fallback and warn otherwise. */
> > > > > > > +#define QEMU_RAM_MMAP_PMEM          (1 << 2)
> > > > > > 
> > > > > > Sorry to speak late - I just noticed that is_pmem can actually be converted too
> > > > > > with "MAP_SYNC | MAP_SHARED_VALIDATE".  We can even define MAP_PMEM_EXTRA for
> > > > > > use within qemu if we want.  Then we can avoid one layer of QEMU_RAM_* by
> > > > > > directly using MAP_*, I think?
> > > > > > 
> > > > > 
> > > > > No problem :) I don‘t think passing in random MAP_ flags is a good interface (we would at least need an allow list).
> > > > > 
> > > > >     I like the abstraction / explicit semenatics of QEMU_RAM_MMAP_PMEM as spelled out in the comment. Doing the fallback when passing in the mmap flags is a little ugly. We could do the fallback in the caller, I think I remember there is only a single call site.
> > > > > 
> > > > > PROT_READ won‘t be covered as well, not sure if passing in protections improves the interface.
> > > > > 
> > > > > Long story short, I like the abstraction provided by these flags, only exporting what we actually support/abstracting it, and setting some MAP_ flags automatically (MAP_PRIVATE, MAP_ANON) instead of having to spell that put in the caller.
> > > > 
> > > > Yeh the READONLY flag would be special, it will need to be separated from the
> > > > rest flags.  I'd keep my own preference, but if you really like the current
> > > > way, maybe at least move it to qemu/osdep.h?  So at least when someone needs a
> > > > cross-platform flag they'll show up - while mmap-alloc.h looks still only for
> > > > the posix world, then it'll be odd to introduce these flags only for posix even
> > > > if posix definied most of them.
> > > 
> > > I'll give it another thought today. I certainly want to avoid moving all
> > > that MAP_ flag and PROT_ logic to the callers. E.g., MAP_SHARED implies
> > > !MAP_PRIVATE. MAP_SYNC implies that we want MAP_SHARED_VALIDATE. fd < 0
> > > implies MAP_ANONYMOUS.
> > > 
> > > Maybe something like
> > > 
> > > /*
> > >     * QEMU's MMAP abstraction to map guest RAM, taking care of alignment
> > >     * requirements and guard pages.
> > >     *
> > >     * Supported flags: MAP_SHARED, MAP_SYNC
> > >     *
> > >     * Implicitly set flags:
> > >     * - MAP PRIVATE: When !MAP_SHARED and !MAP_SYNC
> > >     * - MAP_ANONYMOUS: When fd < 0
> > >     * - MAP_SHARED_VALIDATE: When MAP_SYNC
> > >     *
> > >     * If mapping with MAP_SYNC|MAP_SHARED_VALIDATE fails, fallback to
> > >     * !MAP_SYNC|MAP_SHARED and warn.
> > >     */
> > >     void *qemu_ram_mmap(int fd,
> > >                         size_t size,
> > >                         size_t align,
> > >                         bool readonly,
> > >                         uint32_t mmap_flags,
> > >                         off_t map_offset);
> > 
> > What about this:
> > 
> 
> 
> The only ugly thing is that e.g., MAP_SYNC is only defined for Linux and
> MAP_NORESERVE is mostly only defined on Linux.
> 
> So we need something like we already have in mmap-alloc.c:
> 
> #ifdef CONFIG_LINUX
> #include <linux/mman.h>
> #else  /* !CONFIG_LINUX */
> #define MAP_SYNC              0x0
> #define MAP_SHARED_VALIDATE   0x0
> #endif /* CONFIG_LINUX */
> 
> 
> and for the noreserve part
> 
> 
> #ifndef MAP_NORESERVE
> #define MAP_NORESERVE 0x0
> #endif
> 
> 
> But then, I can no longer bail out if someone specifies a flag although it
> is unsupported/not effective. hmmmm ...

I see, indeed that'll be awkward too.  How about keep your original proposal,
but just move it to osdep.h?  That seems to be the simplest and cleanest
approach so far.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
@ 2021-03-11 16:39   ` Dr. David Alan Gilbert
  2021-03-11 16:45     ` David Hildenbrand
  2021-03-11 21:37   ` Peter Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Dr. David Alan Gilbert @ 2021-03-11 16:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Juan Quintela, Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, qemu-devel, Peter Xu, Greg Kurz, Halil Pasic,
	Christian Borntraeger, Stefan Hajnoczi, Igor Mammedov,
	Thomas Huth, Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

* David Hildenbrand (david@redhat.com) wrote:
> We can create shared anonymous memory via
>     "-object memory-backend-ram,share=on,..."
> which is, for example, required by PVRDMA for mremap() to work.
> 
> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.

OK, I wonder how stable these rules are; is it defined anywhere that
it's required?

Still,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  softmmu/physmem.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 62ea4abbdd..2ba815fec6 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -3506,6 +3506,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>          /* The logic here is messy;
>           *    madvise DONTNEED fails for hugepages
>           *    fallocate works on hugepages and shmem
> +         *    shared anonymous memory requires madvise REMOVE
>           */
>          need_madvise = (rb->page_size == qemu_host_page_size);
>          need_fallocate = rb->fd != -1;
> @@ -3539,7 +3540,11 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>               * fallocate'd away).
>               */
>  #if defined(CONFIG_MADVISE)
> -            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> +            if (qemu_ram_is_shared(rb) && rb->fd < 0) {
> +                ret = madvise(host_startaddr, length, MADV_REMOVE);
> +            } else {
> +                ret = madvise(host_startaddr, length, MADV_DONTNEED);
> +            }
>              if (ret) {
>                  ret = -errno;
>                  error_report("ram_block_discard_range: Failed to discard range "
> -- 
> 2.29.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 16:39   ` Dr. David Alan Gilbert
@ 2021-03-11 16:45     ` David Hildenbrand
  2021-03-11 17:11       ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-11 16:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, qemu-devel, Peter Xu, Greg Kurz, Halil Pasic,
	Christian Borntraeger, Stefan Hajnoczi, Igor Mammedov,
	Thomas Huth, Paolo Bonzini, Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> We can create shared anonymous memory via
>>      "-object memory-backend-ram,share=on,..."
>> which is, for example, required by PVRDMA for mremap() to work.
>>
>> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
>> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
> 
> OK, I wonder how stable these rules are; is it defined anywhere that
> it's required?
> 

I had a look at the Linux implementation: it's essentially shmem ... but 
we don't have an fd exposed, so we cannot use fallocate() ... :)

MADV_REMOVE documents (man):

"In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; 
but since Linux 3.5, any filesystem which supports the fallocate(2) 
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."


Thanks!

> Still,
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
>> Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   softmmu/physmem.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 62ea4abbdd..2ba815fec6 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -3506,6 +3506,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>>           /* The logic here is messy;
>>            *    madvise DONTNEED fails for hugepages
>>            *    fallocate works on hugepages and shmem
>> +         *    shared anonymous memory requires madvise REMOVE
>>            */
>>           need_madvise = (rb->page_size == qemu_host_page_size);
>>           need_fallocate = rb->fd != -1;
>> @@ -3539,7 +3540,11 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>>                * fallocate'd away).
>>                */
>>   #if defined(CONFIG_MADVISE)
>> -            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
>> +            if (qemu_ram_is_shared(rb) && rb->fd < 0) {
>> +                ret = madvise(host_startaddr, length, MADV_REMOVE);
>> +            } else {
>> +                ret = madvise(host_startaddr, length, MADV_DONTNEED);
>> +            }
>>               if (ret) {
>>                   ret = -errno;
>>                   error_report("ram_block_discard_range: Failed to discard range "
>> -- 
>> 2.29.2
>>


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 16:45     ` David Hildenbrand
@ 2021-03-11 17:11       ` Peter Xu
  2021-03-11 17:15         ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2021-03-11 17:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Thu, Mar 11, 2021 at 05:45:46PM +0100, David Hildenbrand wrote:
> On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> > > We can create shared anonymous memory via
> > >      "-object memory-backend-ram,share=on,..."
> > > which is, for example, required by PVRDMA for mremap() to work.
> > > 
> > > Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
> > > have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
> > 
> > OK, I wonder how stable these rules are; is it defined anywhere that
> > it's required?
> > 
> 
> I had a look at the Linux implementation: it's essentially shmem ... but we
> don't have an fd exposed, so we cannot use fallocate() ... :)
> 
> MADV_REMOVE documents (man):
> 
> "In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; but
> since Linux 3.5, any filesystem which supports the fallocate(2)
> FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."

Hmm, I see that MADV_DONTNEED will still tear down all mappings even for
anonymous shmem.. what did I miss?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 17:11       ` Peter Xu
@ 2021-03-11 17:15         ` David Hildenbrand
  2021-03-11 17:18           ` David Hildenbrand
  2021-03-11 17:22           ` Peter Xu
  0 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-11 17:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 11.03.21 18:11, Peter Xu wrote:
> On Thu, Mar 11, 2021 at 05:45:46PM +0100, David Hildenbrand wrote:
>> On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand (david@redhat.com) wrote:
>>>> We can create shared anonymous memory via
>>>>       "-object memory-backend-ram,share=on,..."
>>>> which is, for example, required by PVRDMA for mremap() to work.
>>>>
>>>> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
>>>> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
>>>
>>> OK, I wonder how stable these rules are; is it defined anywhere that
>>> it's required?
>>>
>>
>> I had a look at the Linux implementation: it's essentially shmem ... but we
>> don't have an fd exposed, so we cannot use fallocate() ... :)
>>
>> MADV_REMOVE documents (man):
>>
>> "In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; but
>> since Linux 3.5, any filesystem which supports the fallocate(2)
>> FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."
> 
> Hmm, I see that MADV_DONTNEED will still tear down all mappings even for
> anonymous shmem.. what did I miss?

Where did you see that?

> 

MADV_DONTNEED only invalidates private copies in the pagecache. It's 
essentially useless for any kind of shared mappings.

(I am 99.9% sure that we can replace fallocate()+MADV_DONTNEED by 
fallocate() for fd-based shared mappings, but that's a different story)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 17:15         ` David Hildenbrand
@ 2021-03-11 17:18           ` David Hildenbrand
  2021-03-11 17:22           ` Peter Xu
  1 sibling, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-11 17:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 11.03.21 18:15, David Hildenbrand wrote:
> On 11.03.21 18:11, Peter Xu wrote:
>> On Thu, Mar 11, 2021 at 05:45:46PM +0100, David Hildenbrand wrote:
>>> On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
>>>> * David Hildenbrand (david@redhat.com) wrote:
>>>>> We can create shared anonymous memory via
>>>>>        "-object memory-backend-ram,share=on,..."
>>>>> which is, for example, required by PVRDMA for mremap() to work.
>>>>>
>>>>> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
>>>>> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
>>>>
>>>> OK, I wonder how stable these rules are; is it defined anywhere that
>>>> it's required?
>>>>
>>>
>>> I had a look at the Linux implementation: it's essentially shmem ... but we
>>> don't have an fd exposed, so we cannot use fallocate() ... :)
>>>
>>> MADV_REMOVE documents (man):
>>>
>>> "In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; but
>>> since Linux 3.5, any filesystem which supports the fallocate(2)
>>> FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."
>>
>> Hmm, I see that MADV_DONTNEED will still tear down all mappings even for
>> anonymous shmem.. what did I miss?
> 
> Where did you see that?
> 
>>
> 
> MADV_DONTNEED only invalidates private copies in the pagecache. It's
> essentially useless for any kind of shared mappings.


And to clarify, I think what you see is that the mapping gets torn down, 
but not the backend storage released/freed.

Removing the backend storage (MADV_REMOVE/fallocate()) will implicitly 
tear down the mapping from what I can tell (and what my experiments show).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 17:15         ` David Hildenbrand
  2021-03-11 17:18           ` David Hildenbrand
@ 2021-03-11 17:22           ` Peter Xu
  2021-03-11 17:41             ` David Hildenbrand
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Xu @ 2021-03-11 17:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Thu, Mar 11, 2021 at 06:15:15PM +0100, David Hildenbrand wrote:
> On 11.03.21 18:11, Peter Xu wrote:
> > On Thu, Mar 11, 2021 at 05:45:46PM +0100, David Hildenbrand wrote:
> > > On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
> > > > * David Hildenbrand (david@redhat.com) wrote:
> > > > > We can create shared anonymous memory via
> > > > >       "-object memory-backend-ram,share=on,..."
> > > > > which is, for example, required by PVRDMA for mremap() to work.
> > > > > 
> > > > > Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
> > > > > have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
> > > > 
> > > > OK, I wonder how stable these rules are; is it defined anywhere that
> > > > it's required?
> > > > 
> > > 
> > > I had a look at the Linux implementation: it's essentially shmem ... but we
> > > don't have an fd exposed, so we cannot use fallocate() ... :)
> > > 
> > > MADV_REMOVE documents (man):
> > > 
> > > "In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; but
> > > since Linux 3.5, any filesystem which supports the fallocate(2)
> > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."
> > 
> > Hmm, I see that MADV_DONTNEED will still tear down all mappings even for
> > anonymous shmem.. what did I miss?
> 
> Where did you see that?

I see madvise_dontneed_free() calls zap_page_range().

> 
> > 
> 
> MADV_DONTNEED only invalidates private copies in the pagecache. It's
> essentially useless for any kind of shared mappings.

Since it's about zapping page tables, then I don't understand why it won't work
for shmem..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 17:22           ` Peter Xu
@ 2021-03-11 17:41             ` David Hildenbrand
  2021-03-11 21:25               ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2021-03-11 17:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 11.03.21 18:22, Peter Xu wrote:
> On Thu, Mar 11, 2021 at 06:15:15PM +0100, David Hildenbrand wrote:
>> On 11.03.21 18:11, Peter Xu wrote:
>>> On Thu, Mar 11, 2021 at 05:45:46PM +0100, David Hildenbrand wrote:
>>>> On 11.03.21 17:39, Dr. David Alan Gilbert wrote:
>>>>> * David Hildenbrand (david@redhat.com) wrote:
>>>>>> We can create shared anonymous memory via
>>>>>>        "-object memory-backend-ram,share=on,..."
>>>>>> which is, for example, required by PVRDMA for mremap() to work.
>>>>>>
>>>>>> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
>>>>>> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
>>>>>
>>>>> OK, I wonder how stable these rules are; is it defined anywhere that
>>>>> it's required?
>>>>>
>>>>
>>>> I had a look at the Linux implementation: it's essentially shmem ... but we
>>>> don't have an fd exposed, so we cannot use fallocate() ... :)
>>>>
>>>> MADV_REMOVE documents (man):
>>>>
>>>> "In the initial implementation, only tmpfs(5) was supported MADV_REMOVE; but
>>>> since Linux 3.5, any filesystem which supports the fallocate(2)
>>>> FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE."
>>>
>>> Hmm, I see that MADV_DONTNEED will still tear down all mappings even for
>>> anonymous shmem.. what did I miss?
>>
>> Where did you see that?
> 
> I see madvise_dontneed_free() calls zap_page_range().
> 
>>
>>>
>>
>> MADV_DONTNEED only invalidates private copies in the pagecache. It's
>> essentially useless for any kind of shared mappings.

Let me rephrase because it was wrong: MADV_DONTNEED invalidates private 
COW pages referenced in the page tables :)

> 
> Since it's about zapping page tables, then I don't understand why it won't work
> for shmem..

It zaps the page tables but the shmem pages are still referenced (in the 
pagecache AFAIU). On next user space access, you would fill the page 
tables with the previous content.

That's why MADV_DONTNEED works properly on private anonymous memory, but 
not on shared anonymous memory - the only valid references are in the 
page tables in case of private mappings (well, unless we have other 
references like GUP etc.).


I did wonder, however, if there is benefit in doing both:

MADV_REMOVE followed by MADV_DONTNEED or the other way around. Like, 
will the extra MADV_DONTNEED also remove page tables and not just 
invalidate/zap the entries. Doesn't make a difference 
functionality-wise, but memory-consumption-wise.

I'll still have to have a look.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 17:41             ` David Hildenbrand
@ 2021-03-11 21:25               ` Peter Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Xu @ 2021-03-11 21:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Thu, Mar 11, 2021 at 06:41:29PM +0100, David Hildenbrand wrote:
> It zaps the page tables but the shmem pages are still referenced (in the
> pagecache AFAIU). On next user space access, you would fill the page tables
> with the previous content.
> 
> That's why MADV_DONTNEED works properly on private anonymous memory, but not
> on shared anonymous memory - the only valid references are in the page
> tables in case of private mappings (well, unless we have other references
> like GUP etc.).

For some reason I thought anonymous shared memory could do auto-recycle, but
after a second thought what you said makes perfect sense.

> 
> 
> I did wonder, however, if there is benefit in doing both:
> 
> MADV_REMOVE followed by MADV_DONTNEED or the other way around. Like, will
> the extra MADV_DONTNEED also remove page tables and not just invalidate/zap
> the entries. Doesn't make a difference functionality-wise, but
> memory-consumption-wise.
> 
> I'll still have to have a look.

I saw your other email - that'll be another topic of course.  For now I believe
it's not necessary, and your current patch looks valid.

I just hope when qemu decides to disgard the range, we're sure the rdma
mremap() region have been unmaped - iiuc that's the only use case of that.
Otherwise data would corrupt.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
  2021-03-11 16:39   ` Dr. David Alan Gilbert
@ 2021-03-11 21:37   ` Peter Xu
  2021-03-11 21:49     ` David Hildenbrand
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Xu @ 2021-03-11 21:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On Mon, Mar 08, 2021 at 04:05:50PM +0100, David Hildenbrand wrote:
> We can create shared anonymous memory via
>     "-object memory-backend-ram,share=on,..."
> which is, for example, required by PVRDMA for mremap() to work.
> 
> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
> 
> Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")

I'm thinking whether we should keep this fixes - it's valid, however it could
unveil issues if those remapped ranges didn't get unmapped in time.  After all
"not releasing some memory existed" seems not a huge deal for stable.  No
strong opinion, just raise it up as a pure question.

> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory
  2021-03-11 21:37   ` Peter Xu
@ 2021-03-11 21:49     ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2021-03-11 21:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marcel Apfelbaum, Cornelia Huck, Eduardo Habkost,
	Michael S. Tsirkin, Stefan Weil, Murilo Opsfelder Araujo,
	Richard Henderson, Dr. David Alan Gilbert, Juan Quintela,
	qemu-devel, Halil Pasic, Christian Borntraeger, Greg Kurz,
	Stefan Hajnoczi, Igor Mammedov, Thomas Huth, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Igor Kotrasinski

On 11.03.21 22:37, Peter Xu wrote:
> On Mon, Mar 08, 2021 at 04:05:50PM +0100, David Hildenbrand wrote:
>> We can create shared anonymous memory via
>>      "-object memory-backend-ram,share=on,..."
>> which is, for example, required by PVRDMA for mremap() to work.
>>
>> Shared anonymous memory is weird, though. Instead of MADV_DONTNEED, we
>> have to use MADV_REMOVE. MADV_DONTNEED fails silently and does nothing.
>>
>> Fixes: 06329ccecfa0 ("mem: add share parameter to memory-backend-ram")
> 
> I'm thinking whether we should keep this fixes - it's valid, however it could
> unveil issues if those remapped ranges didn't get unmapped in time.  After all
> "not releasing some memory existed" seems not a huge deal for stable.  No
> strong opinion, just raise it up as a pure question.
> 

If someone would be using it along with postcopy (which should work 
apart from that issue) you could be in trouble. That's why i think at 
least the Fixes: tag is valid. CC: stable might be debatable indeed.

>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-03-11 21:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 15:05 [PATCH v3 00/12] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 01/12] softmmu/physmem: Mark shared anonymous memory RAM_SHARED David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 02/12] softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory David Hildenbrand
2021-03-11 16:39   ` Dr. David Alan Gilbert
2021-03-11 16:45     ` David Hildenbrand
2021-03-11 17:11       ` Peter Xu
2021-03-11 17:15         ` David Hildenbrand
2021-03-11 17:18           ` David Hildenbrand
2021-03-11 17:22           ` Peter Xu
2021-03-11 17:41             ` David Hildenbrand
2021-03-11 21:25               ` Peter Xu
2021-03-11 21:37   ` Peter Xu
2021-03-11 21:49     ` David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 03/12] softmmu/physmem: Fix qemu_ram_remap() " David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 04/12] util/mmap-alloc: Factor out calculation of the pagesize for the guard page David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 05/12] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 06/12] util/mmap-alloc: Factor out activating of memory to mmap_activate() David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 07/12] softmmu/memory: Pass ram_flags into qemu_ram_alloc_from_fd() David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 08/12] softmmu/memory: Pass ram_flags into memory_region_init_ram_shared_nomigrate() David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 09/12] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() David Hildenbrand
2021-03-09 20:04   ` Peter Xu
2021-03-09 20:27     ` David Hildenbrand
2021-03-09 20:58       ` Peter Xu
2021-03-10  8:41         ` David Hildenbrand
2021-03-10 10:11           ` David Hildenbrand
2021-03-10 10:55             ` David Hildenbrand
2021-03-10 16:27               ` Peter Xu
2021-03-08 15:05 ` [PATCH v3 10/12] memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() David Hildenbrand
2021-03-08 15:05   ` David Hildenbrand
2021-03-08 15:05 ` [PATCH v3 11/12] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE David Hildenbrand
2021-03-10 10:28   ` David Hildenbrand
2021-03-08 15:06 ` [PATCH v3 12/12] hostmem: Wire up RAM_NORESERVE via "reserve" property David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.