All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
@ 2020-02-03 18:31 David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 01/13] util: vfio-helpers: Factor out and fix processing of existings ram blocks David Hildenbrand
                   ` (14 more replies)
  0 siblings, 15 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	David Hildenbrand, Dr . David Alan Gilbert, Greg Kurz,
	Alex Williamson, Murilo Opsfelder Araujo, Paolo Bonzini,
	Stefan Weil, Richard Henderson

We already allow resizable ram blocks for anonymous memory, however, they
are not actually resized. All memory is mmaped() R/W, including the memory
exceeding the used_length, up to the max_length.

When resizing, effectively only the boundary is moved. Implement actually
resizable anonymous allocations and make use of them in resizable ram
blocks when possible. Memory exceeding the used_length will be
inaccessible. Especially ram block notifiers require care.

Having actually resizable anonymous allocations (via mmap-hackery) allows
to reserve a big region in virtual address space and grow the
accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
is set to "never" under Linux, huge reservations will succeed. If there is
not enough memory when resizing (to populate parts of the reserved region),
trying to resize will fail. Only the actually used size is reserved in the
OS.

E.g., virtio-mem [1] wants to reserve big resizable memory regions and
grow the usable part on demand. I think this change is worth sending out
individually. Accompanied by a bunch of minor fixes and cleanups.

[1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

David Hildenbrand (13):
  util: vfio-helpers: Factor out and fix processing of existings ram
    blocks
  exec: Factor out setting ram settings (madvise ...) into
    qemu_ram_apply_settings()
  exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
  exec: Drop "shared" parameter from ram_block_add()
  util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
  util/mmap-alloc: Factor out reserving of a memory region to
    mmap_reserve()
  util/mmap-alloc: Factor out populating of memory to mmap_populate()
  util/mmap-alloc: Prepare for resizable mmaps
  util/mmap-alloc: Implement resizable mmaps
  numa: Introduce ram_block_notify_resized() and
    ram_block_notifiers_support_resize()
  util: vfio-helpers: Implement ram_block_resized()
  util: oslib: Resizable anonymous allocations under POSIX
  exec: Ram blocks with resizable anonymous allocations under POSIX

 exec.c                    |  99 ++++++++++++++++++----
 hw/core/numa.c            |  39 +++++++++
 include/exec/cpu-common.h |   3 +
 include/exec/memory.h     |   8 ++
 include/exec/ramlist.h    |   4 +
 include/qemu/mmap-alloc.h |  21 +++--
 include/qemu/osdep.h      |   6 +-
 stubs/ram-block.c         |  20 -----
 util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
 util/oslib-posix.c        |  37 ++++++++-
 util/oslib-win32.c        |  14 ++++
 util/trace-events         |   5 +-
 util/vfio-helpers.c       |  33 ++++----
 13 files changed, 331 insertions(+), 126 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 01/13] util: vfio-helpers: Factor out and fix processing of existings ram blocks
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings() David Hildenbrand
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Richard Henderson

Factor it out into common code when a new notifier is registered, just
as done with the memory region notifier. This allows us to have the
logic about how to process existing ram blocks at a central place (which
will be extended soon).

Just like when adding a new ram block, we have to register the max_length.
We don't have a way to get notified about resizes yet, and some memory
would not be mapped when growing the ram block.

Note: Currently, ram blocks are only "fake resized". All memory
(max_length) is accessible.

We can get rid of a bunch of functions in stubs/ram-block.c . Print the
warning from inside qemu_vfio_ram_block_added().

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c                    |  5 +++++
 hw/core/numa.c            | 14 ++++++++++++++
 include/exec/cpu-common.h |  1 +
 stubs/ram-block.c         | 20 --------------------
 util/vfio-helpers.c       | 28 +++++++---------------------
 5 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/exec.c b/exec.c
index 67e520d18e..05cfe868ab 100644
--- a/exec.c
+++ b/exec.c
@@ -2017,6 +2017,11 @@ ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
     return rb->used_length;
 }
 
+ram_addr_t qemu_ram_get_max_length(RAMBlock *rb)
+{
+    return rb->max_length;
+}
+
 bool qemu_ram_is_shared(RAMBlock *rb)
 {
     return rb->flags & RAM_SHARED;
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 0d1b4be76a..fed4046680 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -899,9 +899,23 @@ void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
     }
 }
 
+static int ram_block_notify_add_single(RAMBlock *rb, void *opaque)
+{
+    ram_addr_t size = qemu_ram_get_max_length(rb);
+    void *host = qemu_ram_get_host_addr(rb);
+    RAMBlockNotifier *notifier = opaque;
+
+    if (host) {
+        notifier->ram_block_added(notifier, host, size);
+    }
+    return 0;
+}
+
 void ram_block_notifier_add(RAMBlockNotifier *n)
 {
     QLIST_INSERT_HEAD(&ram_list.ramblock_notifiers, n, next);
+    /* Notify about all existing ram blocks. */
+    qemu_ram_foreach_block(ram_block_notify_add_single, n);
 }
 
 void ram_block_notifier_remove(RAMBlockNotifier *n)
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 81753bbb34..9760ac9068 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -59,6 +59,7 @@ const char *qemu_ram_get_idstr(RAMBlock *rb);
 void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
+ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
diff --git a/stubs/ram-block.c b/stubs/ram-block.c
index 73c0a3ee08..10855b52dd 100644
--- a/stubs/ram-block.c
+++ b/stubs/ram-block.c
@@ -2,21 +2,6 @@
 #include "exec/ramlist.h"
 #include "exec/cpu-common.h"
 
-void *qemu_ram_get_host_addr(RAMBlock *rb)
-{
-    return 0;
-}
-
-ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
-{
-    return 0;
-}
-
-ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
-{
-    return 0;
-}
-
 void ram_block_notifier_add(RAMBlockNotifier *n)
 {
 }
@@ -24,8 +9,3 @@ void ram_block_notifier_add(RAMBlockNotifier *n)
 void ram_block_notifier_remove(RAMBlockNotifier *n)
 {
 }
-
-int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
-{
-    return 0;
-}
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 813f7ec564..71e02e7f35 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -376,8 +376,13 @@ static void qemu_vfio_ram_block_added(RAMBlockNotifier *n,
                                       void *host, size_t size)
 {
     QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+    int ret;
+
     trace_qemu_vfio_ram_block_added(s, host, size);
-    qemu_vfio_dma_map(s, host, size, false, NULL);
+    ret = qemu_vfio_dma_map(s, host, size, false, NULL);
+    if (ret) {
+        error_report("qemu_vfio_dma_map(%p, %zu) failed: %d", host, size, ret);
+    }
 }
 
 static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n,
@@ -390,33 +395,14 @@ static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n,
     }
 }
 
-static int qemu_vfio_init_ramblock(RAMBlock *rb, void *opaque)
-{
-    void *host_addr = qemu_ram_get_host_addr(rb);
-    ram_addr_t length = qemu_ram_get_used_length(rb);
-    int ret;
-    QEMUVFIOState *s = opaque;
-
-    if (!host_addr) {
-        return 0;
-    }
-    ret = qemu_vfio_dma_map(s, host_addr, length, false, NULL);
-    if (ret) {
-        fprintf(stderr, "qemu_vfio_init_ramblock: failed %p %" PRId64 "\n",
-                host_addr, (uint64_t)length);
-    }
-    return 0;
-}
-
 static void qemu_vfio_open_common(QEMUVFIOState *s)
 {
     qemu_mutex_init(&s->lock);
     s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
     s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
-    ram_block_notifier_add(&s->ram_notifier);
     s->low_water_mark = QEMU_VFIO_IOVA_MIN;
     s->high_water_mark = QEMU_VFIO_IOVA_MAX;
-    qemu_ram_foreach_block(qemu_vfio_init_ramblock, s);
+    ram_block_notifier_add(&s->ram_notifier);
 }
 
 /**
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 01/13] util: vfio-helpers: Factor out and fix processing of existings ram blocks David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-06 11:42   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap() David Hildenbrand
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

Factor all settings out into qemu_ram_apply_settings().

For memory_try_enable_merging(), the important bit is that it won't be
called with XEN - which is now still the case as new_block->host will
remain NULL.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/exec.c b/exec.c
index 05cfe868ab..31a462a7d3 100644
--- a/exec.c
+++ b/exec.c
@@ -2121,6 +2121,15 @@ static int memory_try_enable_merging(void *addr, size_t len)
     return qemu_madvise(addr, len, QEMU_MADV_MERGEABLE);
 }
 
+static void qemu_ram_apply_settings(void *host, size_t length)
+{
+    memory_try_enable_merging(host, length);
+    qemu_ram_setup_dump(host, length);
+    qemu_madvise(host, length, QEMU_MADV_HUGEPAGE);
+    /* MADV_DONTFORK is also needed by KVM in absence of synchronous MMU */
+    qemu_madvise(host, length, QEMU_MADV_DONTFORK);
+}
+
 /* Only legal before guest might have detected the memory size: e.g. on
  * incoming migration, or right after reset.
  *
@@ -2271,7 +2280,6 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
         }
     }
 
@@ -2309,10 +2317,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
                                         DIRTY_CLIENTS_ALL);
 
     if (new_block->host) {
-        qemu_ram_setup_dump(new_block->host, new_block->max_length);
-        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);
-        /* MADV_DONTFORK is also needed by KVM in absence of synchronous MMU */
-        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_DONTFORK);
+        qemu_ram_apply_settings(new_block->host, new_block->max_length);
         ram_block_notify_add(new_block->host, new_block->max_length);
     }
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 01/13] util: vfio-helpers: Factor out and fix processing of existings ram blocks David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-06 11:43   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add() David Hildenbrand
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

I don't see why we shouldn't apply all settings to make it look like the
surrounding RAM (and enable proper VMA merging).

Note: memory backend settings might have overridden these settings. We
would need a callback to let the memory backend fix that up.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 31a462a7d3..f7525867ec 100644
--- a/exec.c
+++ b/exec.c
@@ -2552,8 +2552,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                                  length, addr);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                qemu_ram_apply_settings(vaddr, length);
             }
         }
     }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (2 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-06 11:44   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize() David Hildenbrand
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

Properly store it in the flags of the ram block instead (and the flag
even already exists and is used).

E.g., qemu_ram_is_shared() now properly succeeds on all ram blocks that are
actually shared.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index f7525867ec..fc65c4f7ca 100644
--- a/exec.c
+++ b/exec.c
@@ -2249,7 +2249,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
+static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     RAMBlock *block;
     RAMBlock *last_block = NULL;
@@ -2272,7 +2272,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
             }
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align, shared);
+                                             &new_block->mr->align,
+                                             qemu_ram_is_shared(new_block));
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2376,7 +2377,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    ram_block_add(new_block, &local_err, ram_flags & RAM_SHARED);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2438,10 +2439,13 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     if (host) {
         new_block->flags |= RAM_PREALLOC;
     }
+    if (share) {
+        new_block->flags |= RAM_SHARED;
+    }
     if (resizeable) {
         new_block->flags |= RAM_RESIZEABLE;
     }
-    ram_block_add(new_block, &local_err, share);
+    ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (3 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-05 19:37   ` Murilo Opsfelder Araújo
  2020-02-06 11:46   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Greg Kurz, Murilo Opsfelder Araujo,
	Paolo Bonzini, Richard Henderson

Factor it out and add a comment.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 27dcccd8ec..82f02a2cec 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,17 +82,27 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+    /* Mappings in the same segment must share the same page size */
+    return qemu_fd_getpagesize(fd);
+#else
+    return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
                     bool shared,
                     bool is_pmem)
 {
+    const size_t pagesize = mmap_pagesize(fd);
     int flags;
     int map_sync_flags = 0;
     int guardfd;
     size_t offset;
-    size_t pagesize;
     size_t total;
     void *guardptr;
     void *ptr;
@@ -113,7 +123,6 @@ void *qemu_ram_mmap(int fd,
      * anonymous memory is OK.
      */
     flags = MAP_PRIVATE;
-    pagesize = qemu_fd_getpagesize(fd);
     if (fd == -1 || pagesize == qemu_real_host_page_size) {
         guardfd = -1;
         flags |= MAP_ANONYMOUS;
@@ -123,7 +132,6 @@ void *qemu_ram_mmap(int fd,
     }
 #else
     guardfd = -1;
-    pagesize = qemu_real_host_page_size;
     flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -198,15 +206,10 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-    size_t pagesize;
+    const size_t pagesize = mmap_pagesize(fd);
 
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-        pagesize = qemu_fd_getpagesize(fd);
-#else
-        pagesize = qemu_real_host_page_size;
-#endif
         munmap(ptr, size + pagesize);
     }
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (4 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-05 19:40   ` Murilo Opsfelder Araújo
  2020-02-06 11:55   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate() David Hildenbrand
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Greg Kurz, Murilo Opsfelder Araujo,
	Paolo Bonzini, Richard Henderson

We want to reserve a memory region without actually populating memory.
Let's factor that out.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 82f02a2cec..43a26f38a8 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
     return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+    int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+    /*
+     * On ppc64 mappings in the same segment (aka slice) must share the same
+     * page size. Since we will be re-allocating part of this segment
+     * from the supplied fd, we should make sure to use the same page size, to
+     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+     * avoid allocating backing store memory.
+     * We do this unless we are using the system page size, in which case
+     * anonymous memory is OK.
+     */
+    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+        fd = -1;
+        flags |= MAP_ANONYMOUS;
+    } else {
+        flags |= MAP_NORESERVE;
+    }
+#else
+    fd = -1;
+    flags |= MAP_ANONYMOUS;
+#endif
+
+    return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -101,7 +133,6 @@ void *qemu_ram_mmap(int fd,
     const size_t pagesize = mmap_pagesize(fd);
     int flags;
     int map_sync_flags = 0;
-    int guardfd;
     size_t offset;
     size_t total;
     void *guardptr;
@@ -113,30 +144,7 @@ void *qemu_ram_mmap(int fd,
      */
     total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-    /* On ppc64 mappings in the same segment (aka slice) must share the same
-     * page size. Since we will be re-allocating part of this segment
-     * from the supplied fd, we should make sure to use the same page size, to
-     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
-     * avoid allocating backing store memory.
-     * We do this unless we are using the system page size, in which case
-     * anonymous memory is OK.
-     */
-    flags = MAP_PRIVATE;
-    if (fd == -1 || pagesize == qemu_real_host_page_size) {
-        guardfd = -1;
-        flags |= MAP_ANONYMOUS;
-    } else {
-        guardfd = fd;
-        flags |= MAP_NORESERVE;
-    }
-#else
-    guardfd = -1;
-    flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-    guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+    guardptr = mmap_reserve(total, fd);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (5 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-05 19:56   ` Murilo Opsfelder Araújo
  2020-02-06 11:59   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps David Hildenbrand
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Greg Kurz, Murilo Opsfelder Araujo,
	Paolo Bonzini, Richard Henderson

We want to populate memory within a reserved memory region. Let's factor
that out.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 89 +++++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 43a26f38a8..f043ccb0ab 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,50 @@ static void *mmap_reserve(size_t size, int fd)
     return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Populate memory in a reserved region from the given fd (if any).
+ */
+static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
+                           bool is_pmem)
+{
+    int map_sync_flags = 0;
+    int flags = MAP_FIXED;
+    void *new_ptr;
+
+    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+    if (shared && is_pmem) {
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+    }
+
+    new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags | map_sync_flags,
+                   fd, 0);
+    if (new_ptr == MAP_FAILED && map_sync_flags) {
+        if (errno == ENOTSUP) {
+            char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+            char *file_name = g_malloc0(PATH_MAX);
+            int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+            if (len < 0) {
+                len = 0;
+            }
+            file_name[len] = '\0';
+            fprintf(stderr, "Warning: requesting persistence across crashes "
+                    "for backend file %s failed. Proceeding without "
+                    "persistence, data might become corrupted in case of host "
+                    "crash.\n", file_name);
+            g_free(proc_link);
+            g_free(file_name);
+        }
+        /*
+         * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+         * again without these flags to handle backwards compatibility.
+         */
+        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
+    }
+    return new_ptr;
+}
+
 static inline size_t mmap_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -131,12 +175,8 @@ void *qemu_ram_mmap(int fd,
                     bool is_pmem)
 {
     const size_t pagesize = mmap_pagesize(fd);
-    int flags;
-    int map_sync_flags = 0;
-    size_t offset;
-    size_t total;
-    void *guardptr;
-    void *ptr;
+    size_t offset, total;
+    void *ptr, *guardptr;
 
     /*
      * Note: this always allocates at least one extra page of virtual address
@@ -153,44 +193,9 @@ void *qemu_ram_mmap(int fd,
     /* Always align to host page size */
     assert(align >= pagesize);
 
-    flags = MAP_FIXED;
-    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-    if (shared && is_pmem) {
-        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-    }
-
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
-               flags | map_sync_flags, fd, 0);
-
-    if (ptr == MAP_FAILED && map_sync_flags) {
-        if (errno == ENOTSUP) {
-            char *proc_link, *file_name;
-            int len;
-            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-            file_name = g_malloc0(PATH_MAX);
-            len = readlink(proc_link, file_name, PATH_MAX - 1);
-            if (len < 0) {
-                len = 0;
-            }
-            file_name[len] = '\0';
-            fprintf(stderr, "Warning: requesting persistence across crashes "
-                    "for backend file %s failed. Proceeding without "
-                    "persistence, data might become corrupted in case of host "
-                    "crash.\n", file_name);
-            g_free(proc_link);
-            g_free(file_name);
-        }
-        /*
-         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
-         * we will remove these flags to handle compatibility.
-         */
-        ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
-                   flags, fd, 0);
-    }
-
+    ptr = mmap_populate(guardptr + offset, size, fd, shared, is_pmem);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (6 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-05 23:00   ` Murilo Opsfelder Araújo
  2020-02-06 12:02   ` Richard Henderson
  2020-02-03 18:31 ` [PATCH v1 09/13] util/mmap-alloc: Implement " David Hildenbrand
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Greg Kurz, Murilo Opsfelder Araujo,
	Paolo Bonzini, Richard Henderson

When shrinking a mmap we want to re-reserve the already populated area.
When growing a memory region, we want to populate starting with a given
fd_offset. Prepare by allowing to pass these parameters.

Also, let's make sure we always process full pages, to avoid
unmapping/remapping pages that are already in use when
growing/shrinking. (existing callers seem to guarantee this, but that's
not obvious)

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index f043ccb0ab..63ad6893b7 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 }
 
 /*
- * Reserve a new memory region of the requested size to be used for mapping
- * from the given fd (if any).
+ * Reserve a new memory region of the requested size or re-reserve parts
+ * of an existing region to be used for mapping from the given fd (if any).
  */
-static void *mmap_reserve(size_t size, int fd)
+static void *mmap_reserve(void *ptr, size_t size, int fd)
 {
-    int flags = MAP_PRIVATE;
+    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
 
 #if defined(__powerpc64__) && defined(__linux__)
     /*
@@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
     flags |= MAP_ANONYMOUS;
 #endif
 
-    return mmap(0, size, PROT_NONE, flags, fd, 0);
+    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
 }
 
 /*
  * Populate memory in a reserved region from the given fd (if any).
  */
-static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
-                           bool is_pmem)
+static void *mmap_populate(void *ptr, size_t size, int fd, size_t fd_offset,
+                           bool shared, bool is_pmem)
 {
     int map_sync_flags = 0;
     int flags = MAP_FIXED;
     void *new_ptr;
 
+    if (fd == -1) {
+        fd_offset = 0;
+    }
+
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
     if (shared && is_pmem) {
@@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
     }
 
     new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags | map_sync_flags,
-                   fd, 0);
+                   fd, fd_offset);
     if (new_ptr == MAP_FAILED && map_sync_flags) {
         if (errno == ENOTSUP) {
             char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
@@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
          * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
          * again without these flags to handle backwards compatibility.
          */
-        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
+        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, fd_offset);
     }
     return new_ptr;
 }
@@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
     size_t offset, total;
     void *ptr, *guardptr;
 
+    /* we can only map whole pages */
+    size = QEMU_ALIGN_UP(size, pagesize);
+
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
      */
     total = size + align;
 
-    guardptr = mmap_reserve(total, fd);
+    guardptr = mmap_reserve(0, total, fd);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
@@ -195,7 +202,7 @@ void *qemu_ram_mmap(int fd,
 
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-    ptr = mmap_populate(guardptr + offset, size, fd, shared, is_pmem);
+    ptr = mmap_populate(guardptr + offset, size, fd, 0, shared, is_pmem);
     if (ptr == MAP_FAILED) {
         munmap(guardptr, total);
         return MAP_FAILED;
@@ -221,6 +228,9 @@ void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
     const size_t pagesize = mmap_pagesize(fd);
 
+    /* we can only map whole pages */
+    size = QEMU_ALIGN_UP(size, pagesize);
+
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
         munmap(ptr, size + pagesize);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (7 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-06 12:08   ` Richard Henderson
  2020-02-07  0:29   ` Murilo Opsfelder Araújo
  2020-02-03 18:31 ` [PATCH v1 10/13] numa: Introduce ram_block_notify_resized() and ram_block_notifiers_support_resize() David Hildenbrand
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Greg Kurz, Murilo Opsfelder Araujo,
	Paolo Bonzini, Richard Henderson

Implement resizable mmaps. For now, the actual resizing is not wired up.
Introduce qemu_ram_mmap_resizable() and qemu_ram_mmap_resize(). Make
qemu_ram_mmap() a wrapper of qemu_ram_mmap_resizable().

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/mmap-alloc.h | 21 ++++++++++++-------
 util/mmap-alloc.c         | 44 ++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index e786266b92..70bc8e9637 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,11 +7,13 @@ size_t qemu_fd_getpagesize(int fd);
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
 /**
- * qemu_ram_mmap: mmap the specified file or device.
+ * qemu_ram_mmap_resizable: reserve a memory region of @max_size to mmap the
+ *                          specified file or device and mmap @size of it.
  *
  * Parameters:
  *  @fd: the file or the device to mmap
  *  @size: the number of bytes to be mmaped
+ *  @max_size: the number of bytes to be reserved
  *  @align: if not zero, specify the alignment of the starting mapping address;
  *          otherwise, the alignment in use will be determined by QEMU.
  *  @shared: map has RAM_SHARED flag.
@@ -21,12 +23,15 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  On success, return a pointer to the mapped area.
  *  On failure, return MAP_FAILED.
  */
-void *qemu_ram_mmap(int fd,
-                    size_t size,
-                    size_t align,
-                    bool shared,
-                    bool is_pmem);
-
-void qemu_ram_munmap(int fd, void *ptr, size_t size);
+void *qemu_ram_mmap_resizable(int fd, size_t size, size_t max_size,
+                              size_t align, bool shared, bool is_pmem);
+void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t new_size,
+                           bool shared, bool is_pmem);
+static inline void *qemu_ram_mmap(int fd, size_t size, size_t align,
+                                  bool shared, bool is_pmem)
+{
+    return qemu_ram_mmap_resizable(fd, size, size, align, shared, is_pmem);
+}
+void qemu_ram_munmap(int fd, void *ptr, size_t max_size);
 
 #endif
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 63ad6893b7..2d562145e9 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -172,11 +172,8 @@ static inline size_t mmap_pagesize(int fd)
 #endif
 }
 
-void *qemu_ram_mmap(int fd,
-                    size_t size,
-                    size_t align,
-                    bool shared,
-                    bool is_pmem)
+void *qemu_ram_mmap_resizable(int fd, size_t size, size_t max_size,
+                              size_t align, bool shared, bool is_pmem)
 {
     const size_t pagesize = mmap_pagesize(fd);
     size_t offset, total;
@@ -184,12 +181,14 @@ void *qemu_ram_mmap(int fd,
 
     /* we can only map whole pages */
     size = QEMU_ALIGN_UP(size, pagesize);
+    max_size = QEMU_ALIGN_UP(max_size, pagesize);
 
     /*
      * Note: this always allocates at least one extra page of virtual address
-     * space, even if size is already aligned.
+     * space, even if the size is already aligned. We will reserve an area of
+     * at least max_size, but only populate the requested part of it.
      */
-    total = size + align;
+    total = max_size + align;
 
     guardptr = mmap_reserve(0, total, fd);
     if (guardptr == MAP_FAILED) {
@@ -217,22 +216,43 @@ void *qemu_ram_mmap(int fd,
      * a guard page guarding against potential buffer overflows.
      */
     total -= offset;
-    if (total > size + pagesize) {
-        munmap(ptr + size + pagesize, total - size - pagesize);
+    if (total > max_size + pagesize) {
+        munmap(ptr + max_size + pagesize, total - max_size - pagesize);
     }
 
     return ptr;
 }
 
-void qemu_ram_munmap(int fd, void *ptr, size_t size)
+void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t new_size,
+                           bool shared, bool is_pmem)
 {
     const size_t pagesize = mmap_pagesize(fd);
 
     /* we can only map whole pages */
-    size = QEMU_ALIGN_UP(size, pagesize);
+    old_size = QEMU_ALIGN_UP(old_size, pagesize);
+    new_size = QEMU_ALIGN_UP(new_size, pagesize);
+
+    /* we support actually resizable memory regions only on Linux */
+    if (old_size < new_size) {
+        /* populate the missing piece into the reserved area */
+        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd, old_size,
+                            shared, is_pmem);
+    } else if (old_size > new_size) {
+        /* discard this piece, keeping the area reserved (should never fail) */
+        ptr = mmap_reserve(ptr + new_size, old_size - new_size, fd);
+    }
+    return ptr;
+}
+
+void qemu_ram_munmap(int fd, void *ptr, size_t max_size)
+{
+    const size_t pagesize = mmap_pagesize(fd);
+
+    /* we can only map whole pages */
+    max_size = QEMU_ALIGN_UP(max_size, pagesize);
 
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-        munmap(ptr, size + pagesize);
+        munmap(ptr, max_size + pagesize);
     }
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 10/13] numa: Introduce ram_block_notify_resized() and ram_block_notifiers_support_resize()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (8 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 09/13] util/mmap-alloc: Implement " David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized() David Hildenbrand
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

We want to actually resize ram blocks (make everything between
used_length and max_length inaccessible) - however, not all ram block
notifiers will support that.

So introduce a way to detect if any registered notifier does not support
it and add a way to notify all notifiers that support it.

Using ram_block_notifiers_support_resize(), we can keep the existing
handling for these special cases until they implement support (e.g.,
SEV, HAX) to resize.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/numa.c         | 21 +++++++++++++++++++++
 include/exec/ramlist.h |  4 ++++
 util/trace-events      |  1 +
 3 files changed, 26 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index fed4046680..5ccfcbcd41 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -940,3 +940,24 @@ void ram_block_notify_remove(void *host, size_t size)
         notifier->ram_block_removed(notifier, host, size);
     }
 }
+
+void ram_block_notify_resized(void *host, size_t oldsize, size_t newsize)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        notifier->ram_block_resized(notifier, host, oldsize, newsize);
+    }
+}
+
+bool ram_block_notifiers_support_resize(void)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (!notifier->ram_block_resized) {
+            return false;
+        }
+    }
+    return true;
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index bc4faa1b00..33a380cbee 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -67,6 +67,8 @@ void qemu_mutex_unlock_ramlist(void);
 struct RAMBlockNotifier {
     void (*ram_block_added)(RAMBlockNotifier *n, void *host, size_t size);
     void (*ram_block_removed)(RAMBlockNotifier *n, void *host, size_t size);
+    void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t oldsize,
+                              size_t newsize);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -74,6 +76,8 @@ void ram_block_notifier_add(RAMBlockNotifier *n);
 void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size);
 void ram_block_notify_remove(void *host, size_t size);
+void ram_block_notify_resized(void *host, size_t oldsize, size_t newsize);
+bool ram_block_notifiers_support_resize(void);
 
 void ram_block_dump(Monitor *mon);
 
diff --git a/util/trace-events b/util/trace-events
index 83b6639018..226f406c46 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -76,6 +76,7 @@ qemu_mutex_unlock(void *mutex, const char *file, const int line) "released mutex
 qemu_vfio_dma_reset_temporary(void *s) "s %p"
 qemu_vfio_ram_block_added(void *s, void *p, size_t size) "s %p host %p size 0x%zx"
 qemu_vfio_ram_block_removed(void *s, void *p, size_t size) "s %p host %p size 0x%zx"
+qemu_vfio_ram_block_resized(void *s, void *p, size_t oldsize, size_t newsizze) "s %p host %p oldsize 0x%zx newsize 0x%zx"
 qemu_vfio_find_mapping(void *s, void *p) "s %p host %p"
 qemu_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_t iova) "s %p host %p size %zu index %d iova 0x%"PRIx64
 qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p host %p size %zu iova 0x%"PRIx64
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized()
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (9 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 10/13] numa: Introduce ram_block_notify_resized() and ram_block_notifiers_support_resize() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-10 13:41   ` David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 12/13] util: oslib: Resizable anonymous allocations under POSIX David Hildenbrand
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Markus Armbruster, Alex Williamson,
	Paolo Bonzini, Richard Henderson

Let's implement ram_block_resized().

Note: Resizing is currently only allowed during reboot or when migration
starts.

Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/vfio-helpers.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 71e02e7f35..57d77e9480 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -395,11 +395,24 @@ static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n,
     }
 }
 
+static void qemu_vfio_ram_block_resized(RAMBlockNotifier *n, void *host,
+                                        size_t oldsize, size_t newsize)
+{
+    QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+    if (host) {
+        trace_qemu_vfio_ram_block_resized(s, host, oldsize, newsize);
+        /* Note: Not atomic - we need a new ioctl for that. */
+        qemu_vfio_ram_block_removed(n, host, oldsize);
+        qemu_vfio_ram_block_added(n, host, newsize);
+    }
+}
+
 static void qemu_vfio_open_common(QEMUVFIOState *s)
 {
     qemu_mutex_init(&s->lock);
     s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
     s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
+    s->ram_notifier.ram_block_resized = qemu_vfio_ram_block_resized;
     s->low_water_mark = QEMU_VFIO_IOVA_MIN;
     s->high_water_mark = QEMU_VFIO_IOVA_MAX;
     ram_block_notifier_add(&s->ram_notifier);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 12/13] util: oslib: Resizable anonymous allocations under POSIX
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (10 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized() David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-03 18:31 ` [PATCH v1 13/13] exec: Ram blocks with resizable " David Hildenbrand
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Stefan Weil,
	David Hildenbrand, Dr . David Alan Gilbert, Paolo Bonzini,
	Richard Henderson

Introduce qemu_anon_ram_alloc_resizable() and qemu_anon_ram_resize().
Implement them under POSIX and make them return NULL under WIN32.

Under POSIX, we make use of resizable mmaps. An implementation under
WIN32 is theoretically possible AFAIK and can be added later.

In qemu_anon_ram_free(), rename the size parameter to max_size, to make
it clearer that we have to use the maximum size when freeing resizable
anonymous allocations.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Stefan Weil <sw@weilnetz.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/osdep.h |  6 +++++-
 util/oslib-posix.c   | 37 ++++++++++++++++++++++++++++++++++---
 util/oslib-win32.c   | 14 ++++++++++++++
 util/trace-events    |  4 +++-
 4 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 9bd3dcfd13..57b7f40f56 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -311,8 +311,12 @@ int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
 void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
+void *qemu_anon_ram_alloc_resizable(size_t size, size_t max_size,
+                                    uint64_t *align, bool shared);
+void *qemu_anon_ram_resize(void *ptr, size_t old_size, size_t new_size,
+                           bool shared);
 void qemu_vfree(void *ptr);
-void qemu_anon_ram_free(void *ptr, size_t size);
+void qemu_anon_ram_free(void *ptr, size_t max_size);
 
 #define QEMU_MADV_INVALID -1
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 5a291cc982..e487a0e2c2 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -219,16 +219,47 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
     return ptr;
 }
 
+void *qemu_anon_ram_alloc_resizable(size_t size, size_t max_size,
+                                    uint64_t *alignment, bool shared)
+{
+    size_t align = QEMU_VMALLOC_ALIGN;
+    void *ptr = qemu_ram_mmap_resizable(-1, size, max_size, align, shared,
+                                        false);
+
+    if (ptr == MAP_FAILED) {
+        return NULL;
+    }
+
+    if (alignment) {
+        *alignment = align;
+    }
+
+    trace_qemu_anon_ram_alloc_resizable(size, max_size, ptr);
+    return ptr;
+}
+
+void *qemu_anon_ram_resize(void *ptr, size_t old_size, size_t new_size,
+                           bool shared)
+{
+    ptr = qemu_ram_mmap_resize(ptr, -1, old_size, new_size, shared, false);
+    if (ptr == MAP_FAILED) {
+        return NULL;
+    }
+
+    trace_qemu_anon_ram_resize(old_size, new_size, ptr);
+    return ptr;
+}
+
 void qemu_vfree(void *ptr)
 {
     trace_qemu_vfree(ptr);
     free(ptr);
 }
 
-void qemu_anon_ram_free(void *ptr, size_t size)
+void qemu_anon_ram_free(void *ptr, size_t max_size)
 {
-    trace_qemu_anon_ram_free(ptr, size);
-    qemu_ram_munmap(-1, ptr, size);
+    trace_qemu_anon_ram_free(ptr, max_size);
+    qemu_ram_munmap(-1, ptr, max_size);
 }
 
 void qemu_set_block(int fd)
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index e9b14ab178..caec028041 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -90,6 +90,20 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
     return ptr;
 }
 
+void *qemu_anon_ram_alloc_resizable(size_t size, size_t max_size,
+                                    uint64_t *align, bool shared)
+{
+    /* resizable ram not implemented yet */
+    return NULL;
+}
+
+void *qemu_anon_ram_resize(void *ptr, size_t old_size, size_t new_size,
+                           bool shared)
+{
+    /* resizable ram not implemented yet */
+    return NULL;
+}
+
 void qemu_vfree(void *ptr)
 {
     trace_qemu_vfree(ptr);
diff --git a/util/trace-events b/util/trace-events
index 226f406c46..05ec1eb9f3 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -46,8 +46,10 @@ qemu_co_mutex_unlock_return(void *mutex, void *self) "mutex %p self %p"
 # oslib-posix.c
 qemu_memalign(size_t alignment, size_t size, void *ptr) "alignment %zu size %zu ptr %p"
 qemu_anon_ram_alloc(size_t size, void *ptr) "size %zu ptr %p"
+qemu_anon_ram_alloc_resizable(size_t size, size_t max_size, void *ptr) "size %zu max_size %zu ptr %p"
+qemu_anon_ram_resize(size_t old_size, size_t new_size, void *ptr) "old_size %zu new_size %zu ptr %p"
 qemu_vfree(void *ptr) "ptr %p"
-qemu_anon_ram_free(void *ptr, size_t size) "ptr %p size %zu"
+qemu_anon_ram_free(void *ptr, size_t max_size) "ptr %p max_size %zu"
 
 # hbitmap.c
 hbitmap_iter_skip_words(const void *hb, void *hbi, uint64_t pos, unsigned long cur) "hb %p hbi %p pos %"PRId64" cur 0x%lx"
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 13/13] exec: Ram blocks with resizable anonymous allocations under POSIX
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (11 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 12/13] util: oslib: Resizable anonymous allocations under POSIX David Hildenbrand
@ 2020-02-03 18:31 ` David Hildenbrand
  2020-02-10 10:12   ` David Hildenbrand
  2020-02-06  9:27 ` [PATCH v1 00/13] " Michael S. Tsirkin
  2020-02-06 20:11 ` Dr. David Alan Gilbert
  14 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-03 18:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Stefan Weil,
	David Hildenbrand, Dr . David Alan Gilbert, Paolo Bonzini,
	Richard Henderson

We can now make use of resizable anonymous allocations to implement
actually resizable ram blocks. Resizable anonymous allocations are
not implemented under WIN32 yet and are not available when using
alternative allocators. Fall back to the existing handling.

We also have to fallback to the existing handling in case any ram block
notifier does not support resizing (esp., AMD SEV, HAX) yet. Remember
in RAM_RESIZEABLE_ALLOC if we are using resizable anonymous allocations.

As the mmap()-hackery will invalidate some madvise settings, we have to
re-apply them after resizing. After resizing, notify the ram block
notifiers.

The benefit of actually resizable ram blocks is that e.g., under Linux,
only the actual size will be reserved (even if
"/proc/sys/vm/overcommit_memory" is set to "never"). Additional memory will
be reserved when trying to resize, which allows to have ram blocks that
start small but can theoretically grow very large.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Stefan Weil <sw@weilnetz.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c                    | 68 +++++++++++++++++++++++++++++++++++----
 hw/core/numa.c            | 10 ++++--
 include/exec/cpu-common.h |  2 ++
 include/exec/memory.h     |  8 +++++
 4 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/exec.c b/exec.c
index fc65c4f7ca..a59d1efde3 100644
--- a/exec.c
+++ b/exec.c
@@ -2053,6 +2053,16 @@ void qemu_ram_unset_migratable(RAMBlock *rb)
     rb->flags &= ~RAM_MIGRATABLE;
 }
 
+bool qemu_ram_is_resizable(RAMBlock *rb)
+{
+    return rb->flags & RAM_RESIZEABLE;
+}
+
+bool qemu_ram_is_resizable_alloc(RAMBlock *rb)
+{
+    return rb->flags & RAM_RESIZEABLE_ALLOC;
+}
+
 /* Called with iothread lock held.  */
 void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
 {
@@ -2139,6 +2149,8 @@ static void qemu_ram_apply_settings(void *host, size_t length)
  */
 int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
 {
+    const uint64_t oldsize = block->used_length;
+
     assert(block);
 
     newsize = HOST_PAGE_ALIGN(newsize);
@@ -2147,7 +2159,7 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
         return 0;
     }
 
-    if (!(block->flags & RAM_RESIZEABLE)) {
+    if (!qemu_ram_is_resizable(block)) {
         error_setg_errno(errp, EINVAL,
                          "Length mismatch: %s: 0x" RAM_ADDR_FMT
                          " in != 0x" RAM_ADDR_FMT, block->idstr,
@@ -2163,10 +2175,26 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
         return -EINVAL;
     }
 
+    if (qemu_ram_is_resizable_alloc(block)) {
+        g_assert(ram_block_notifiers_support_resize());
+        if (qemu_anon_ram_resize(block->host, block->used_length,
+                                 newsize, block->flags & RAM_SHARED) == NULL) {
+            error_setg_errno(errp, -ENOMEM,
+                             "Could not allocate enough memory.");
+            return -ENOMEM;
+        }
+    }
+
     cpu_physical_memory_clear_dirty_range(block->offset, block->used_length);
     block->used_length = newsize;
     cpu_physical_memory_set_dirty_range(block->offset, block->used_length,
                                         DIRTY_CLIENTS_ALL);
+    if (block->host && qemu_ram_is_resizable_alloc(block)) {
+        /* re-apply settings that might have been overriden by the resize */
+        qemu_ram_apply_settings(block->host, block->max_length);
+        ram_block_notify_resized(block->host, oldsize, block->used_length);
+    }
+
     memory_region_set_size(block->mr, newsize);
     if (block->resized) {
         block->resized(block->idstr, newsize, block->host);
@@ -2249,6 +2277,28 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
+static void ram_block_alloc_ram(RAMBlock *rb)
+{
+    const bool shared = qemu_ram_is_shared(rb);
+
+    /*
+     * If we can, try to allocate actually resizable ram. Will also fail
+     * if qemu_anon_ram_alloc_resizable() is not implemented.
+     */
+    if (phys_mem_alloc == qemu_anon_ram_alloc &&
+        qemu_ram_is_resizable(rb) &&
+        ram_block_notifiers_support_resize()) {
+        rb->host = qemu_anon_ram_alloc_resizable(rb->used_length,
+                                                 rb->max_length, &rb->mr->align,
+                                                 shared);
+        if (rb->host) {
+            rb->flags |= RAM_RESIZEABLE_ALLOC;
+            return;
+        }
+    }
+    rb->host = phys_mem_alloc(rb->max_length, &rb->mr->align, shared);
+}
+
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     RAMBlock *block;
@@ -2271,9 +2321,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 return;
             }
         } else {
-            new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align,
-                                             qemu_ram_is_shared(new_block));
+            ram_block_alloc_ram(new_block);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2319,7 +2367,11 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
 
     if (new_block->host) {
         qemu_ram_apply_settings(new_block->host, new_block->max_length);
-        ram_block_notify_add(new_block->host, new_block->max_length);
+        if (qemu_ram_is_resizable_alloc(new_block)) {
+            ram_block_notify_add(new_block->host, new_block->used_length);
+        } else {
+            ram_block_notify_add(new_block->host, new_block->max_length);
+        }
     }
 }
 
@@ -2502,7 +2554,11 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     if (block->host) {
-        ram_block_notify_remove(block->host, block->max_length);
+        if (qemu_ram_is_resizable_alloc(block)) {
+            ram_block_notify_remove(block->host, block->used_length);
+        } else {
+            ram_block_notify_remove(block->host, block->max_length);
+        }
     }
 
     qemu_mutex_lock_ramlist();
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 5ccfcbcd41..cb75097b26 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -901,12 +901,16 @@ void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
 
 static int ram_block_notify_add_single(RAMBlock *rb, void *opaque)
 {
-    ram_addr_t size = qemu_ram_get_max_length(rb);
     void *host = qemu_ram_get_host_addr(rb);
     RAMBlockNotifier *notifier = opaque;
 
-    if (host) {
-        notifier->ram_block_added(notifier, host, size);
+    if (!host) {
+        return 0;
+    }
+    if (qemu_ram_is_resizable_alloc(rb)) {
+        notifier->ram_block_added(notifier, host, qemu_ram_get_used_length(rb));
+    } else {
+        notifier->ram_block_added(notifier, host, qemu_ram_get_max_length(rb));
     }
     return 0;
 }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 9760ac9068..a9c76bd5ef 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -66,6 +66,8 @@ void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
 void qemu_ram_set_migratable(RAMBlock *rb);
 void qemu_ram_unset_migratable(RAMBlock *rb);
+bool qemu_ram_is_resizable(RAMBlock *rb);
+bool qemu_ram_is_resizable_alloc(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e85b7de99a..19417943a2 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -129,6 +129,14 @@ typedef struct IOMMUNotifier IOMMUNotifier;
 /* RAM is a persistent kind memory */
 #define RAM_PMEM (1 << 5)
 
+/*
+ * Implies RAM_RESIZEABLE. Memory beyond the used_length is inaccessible
+ * (esp. initially and after resizing). For such memory blocks, only the
+ * used_length is reserved in the OS - resizing might fail. Will only be
+ * used with host OS support and if all ram block notifiers support resizing.
+ */
+#define RAM_RESIZEABLE_ALLOC (1 << 6)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
  2020-02-03 18:31 ` [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize() David Hildenbrand
@ 2020-02-05 19:37   ` Murilo Opsfelder Araújo
  2020-02-06 11:46   ` Richard Henderson
  1 sibling, 0 replies; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-05 19:37 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

Hello, David.

On 2/3/20 3:31 PM, David Hildenbrand wrote:
> Factor it out and add a comment.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>

> ---
>   util/mmap-alloc.c | 21 ++++++++++++---------
>   1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 27dcccd8ec..82f02a2cec 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -82,17 +82,27 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>       return qemu_real_host_page_size;
>   }
>   
> +static inline size_t mmap_pagesize(int fd)
> +{
> +#if defined(__powerpc64__) && defined(__linux__)
> +    /* Mappings in the same segment must share the same page size */
> +    return qemu_fd_getpagesize(fd);
> +#else
> +    return qemu_real_host_page_size;
> +#endif
> +}
> +
>   void *qemu_ram_mmap(int fd,
>                       size_t size,
>                       size_t align,
>                       bool shared,
>                       bool is_pmem)
>   {
> +    const size_t pagesize = mmap_pagesize(fd);
>       int flags;
>       int map_sync_flags = 0;
>       int guardfd;
>       size_t offset;
> -    size_t pagesize;
>       size_t total;
>       void *guardptr;
>       void *ptr;
> @@ -113,7 +123,6 @@ void *qemu_ram_mmap(int fd,
>        * anonymous memory is OK.
>        */
>       flags = MAP_PRIVATE;
> -    pagesize = qemu_fd_getpagesize(fd);
>       if (fd == -1 || pagesize == qemu_real_host_page_size) {
>           guardfd = -1;
>           flags |= MAP_ANONYMOUS;
> @@ -123,7 +132,6 @@ void *qemu_ram_mmap(int fd,
>       }
>   #else
>       guardfd = -1;
> -    pagesize = qemu_real_host_page_size;
>       flags = MAP_PRIVATE | MAP_ANONYMOUS;
>   #endif
>   
> @@ -198,15 +206,10 @@ void *qemu_ram_mmap(int fd,
>   
>   void qemu_ram_munmap(int fd, void *ptr, size_t size)
>   {
> -    size_t pagesize;
> +    const size_t pagesize = mmap_pagesize(fd);
>   
>       if (ptr) {
>           /* Unmap both the RAM block and the guard page */
> -#if defined(__powerpc64__) && defined(__linux__)
> -        pagesize = qemu_fd_getpagesize(fd);
> -#else
> -        pagesize = qemu_real_host_page_size;
> -#endif
>           munmap(ptr, size + pagesize);
>       }
>   }
> 

-- 
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2020-02-03 18:31 ` [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
@ 2020-02-05 19:40   ` Murilo Opsfelder Araújo
  2020-02-06 11:55   ` Richard Henderson
  1 sibling, 0 replies; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-05 19:40 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

Hello, David.

On 2/3/20 3:31 PM, David Hildenbrand wrote:
> We want to reserve a memory region without actually populating memory.
> Let's factor that out.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>

> ---
>   util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
>   1 file changed, 33 insertions(+), 25 deletions(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 82f02a2cec..43a26f38a8 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>       return qemu_real_host_page_size;
>   }
>   
> +/*
> + * Reserve a new memory region of the requested size to be used for mapping
> + * from the given fd (if any).
> + */
> +static void *mmap_reserve(size_t size, int fd)
> +{
> +    int flags = MAP_PRIVATE;
> +
> +#if defined(__powerpc64__) && defined(__linux__)
> +    /*
> +     * On ppc64 mappings in the same segment (aka slice) must share the same
> +     * page size. Since we will be re-allocating part of this segment
> +     * from the supplied fd, we should make sure to use the same page size, to
> +     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
> +     * avoid allocating backing store memory.
> +     * We do this unless we are using the system page size, in which case
> +     * anonymous memory is OK.
> +     */
> +    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
> +        fd = -1;
> +        flags |= MAP_ANONYMOUS;
> +    } else {
> +        flags |= MAP_NORESERVE;
> +    }
> +#else
> +    fd = -1;
> +    flags |= MAP_ANONYMOUS;
> +#endif
> +
> +    return mmap(0, size, PROT_NONE, flags, fd, 0);
> +}
> +
>   static inline size_t mmap_pagesize(int fd)
>   {
>   #if defined(__powerpc64__) && defined(__linux__)
> @@ -101,7 +133,6 @@ void *qemu_ram_mmap(int fd,
>       const size_t pagesize = mmap_pagesize(fd);
>       int flags;
>       int map_sync_flags = 0;
> -    int guardfd;
>       size_t offset;
>       size_t total;
>       void *guardptr;
> @@ -113,30 +144,7 @@ void *qemu_ram_mmap(int fd,
>        */
>       total = size + align;
>   
> -#if defined(__powerpc64__) && defined(__linux__)
> -    /* On ppc64 mappings in the same segment (aka slice) must share the same
> -     * page size. Since we will be re-allocating part of this segment
> -     * from the supplied fd, we should make sure to use the same page size, to
> -     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
> -     * avoid allocating backing store memory.
> -     * We do this unless we are using the system page size, in which case
> -     * anonymous memory is OK.
> -     */
> -    flags = MAP_PRIVATE;
> -    if (fd == -1 || pagesize == qemu_real_host_page_size) {
> -        guardfd = -1;
> -        flags |= MAP_ANONYMOUS;
> -    } else {
> -        guardfd = fd;
> -        flags |= MAP_NORESERVE;
> -    }
> -#else
> -    guardfd = -1;
> -    flags = MAP_PRIVATE | MAP_ANONYMOUS;
> -#endif
> -
> -    guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
> -
> +    guardptr = mmap_reserve(total, fd);
>       if (guardptr == MAP_FAILED) {
>           return MAP_FAILED;
>       }
> 

-- 
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate()
  2020-02-03 18:31 ` [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate() David Hildenbrand
@ 2020-02-05 19:56   ` Murilo Opsfelder Araújo
  2020-02-06  9:26     ` David Hildenbrand
  2020-02-06 11:59   ` Richard Henderson
  1 sibling, 1 reply; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-05 19:56 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

Hello, David.

On 2/3/20 3:31 PM, David Hildenbrand wrote:
> We want to populate memory within a reserved memory region. Let's factor
> that out.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>

A minor comment below.

>   util/mmap-alloc.c | 89 +++++++++++++++++++++++++----------------------
>   1 file changed, 47 insertions(+), 42 deletions(-)
>
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 43a26f38a8..f043ccb0ab 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -114,6 +114,50 @@ static void *mmap_reserve(size_t size, int fd)
>       return mmap(0, size, PROT_NONE, flags, fd, 0);
>   }
>
> +/*
> + * Populate memory in a reserved region from the given fd (if any).
> + */
> +static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
> +                           bool is_pmem)
> +{
> +    int map_sync_flags = 0;
> +    int flags = MAP_FIXED;
> +    void *new_ptr;

Do you think another name would be welcome here, e.g.: "populated_ptr" or
"populated_memptr" or just "populated"?

> +
> +    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
> +    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
> +    if (shared && is_pmem) {
> +        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
> +    }
> +
> +    new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags | map_sync_flags,
> +                   fd, 0);
> +    if (new_ptr == MAP_FAILED && map_sync_flags) {
> +        if (errno == ENOTSUP) {
> +            char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
> +            char *file_name = g_malloc0(PATH_MAX);
> +            int len = readlink(proc_link, file_name, PATH_MAX - 1);
> +
> +            if (len < 0) {
> +                len = 0;
> +            }
> +            file_name[len] = '\0';
> +            fprintf(stderr, "Warning: requesting persistence across crashes "
> +                    "for backend file %s failed. Proceeding without "
> +                    "persistence, data might become corrupted in case of host "
> +                    "crash.\n", file_name);
> +            g_free(proc_link);
> +            g_free(file_name);
> +        }
> +        /*
> +         * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
> +         * again without these flags to handle backwards compatibility.
> +         */
> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
> +    }
> +    return new_ptr;
> +}
> +
>   static inline size_t mmap_pagesize(int fd)
>   {
>   #if defined(__powerpc64__) && defined(__linux__)
> @@ -131,12 +175,8 @@ void *qemu_ram_mmap(int fd,
>                       bool is_pmem)
>   {
>       const size_t pagesize = mmap_pagesize(fd);
> -    int flags;
> -    int map_sync_flags = 0;
> -    size_t offset;
> -    size_t total;
> -    void *guardptr;
> -    void *ptr;
> +    size_t offset, total;
> +    void *ptr, *guardptr;
>
>       /*
>        * Note: this always allocates at least one extra page of virtual address
> @@ -153,44 +193,9 @@ void *qemu_ram_mmap(int fd,
>       /* Always align to host page size */
>       assert(align >= pagesize);
>
> -    flags = MAP_FIXED;
> -    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
> -    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
> -    if (shared && is_pmem) {
> -        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
> -    }
> -
>       offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
>
> -    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
> -               flags | map_sync_flags, fd, 0);
> -
> -    if (ptr == MAP_FAILED && map_sync_flags) {
> -        if (errno == ENOTSUP) {
> -            char *proc_link, *file_name;
> -            int len;
> -            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
> -            file_name = g_malloc0(PATH_MAX);
> -            len = readlink(proc_link, file_name, PATH_MAX - 1);
> -            if (len < 0) {
> -                len = 0;
> -            }
> -            file_name[len] = '\0';
> -            fprintf(stderr, "Warning: requesting persistence across crashes "
> -                    "for backend file %s failed. Proceeding without "
> -                    "persistence, data might become corrupted in case of host "
> -                    "crash.\n", file_name);
> -            g_free(proc_link);
> -            g_free(file_name);
> -        }
> -        /*
> -         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> -         * we will remove these flags to handle compatibility.
> -         */
> -        ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
> -                   flags, fd, 0);
> -    }
> -
> +    ptr = mmap_populate(guardptr + offset, size, fd, shared, is_pmem);
>       if (ptr == MAP_FAILED) {
>           munmap(guardptr, total);
>           return MAP_FAILED;
>
--
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-03 18:31 ` [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps David Hildenbrand
@ 2020-02-05 23:00   ` Murilo Opsfelder Araújo
  2020-02-06  8:52     ` David Hildenbrand
  2020-02-06 15:13     ` David Hildenbrand
  2020-02-06 12:02   ` Richard Henderson
  1 sibling, 2 replies; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-05 23:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

Hello, David.

On Monday, February 3, 2020 3:31:20 PM -03 David Hildenbrand wrote:
> When shrinking a mmap we want to re-reserve the already populated area.
> When growing a memory region, we want to populate starting with a given
> fd_offset. Prepare by allowing to pass these parameters.
> 
> Also, let's make sure we always process full pages, to avoid
> unmapping/remapping pages that are already in use when
> growing/shrinking. (existing callers seem to guarantee this, but that's
> not obvious)
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index f043ccb0ab..63ad6893b7 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>  }
> 
>  /*
> - * Reserve a new memory region of the requested size to be used for mapping
> - * from the given fd (if any).
> + * Reserve a new memory region of the requested size or re-reserve parts
> + * of an existing region to be used for mapping from the given fd (if any).
> */
> -static void *mmap_reserve(size_t size, int fd)
> +static void *mmap_reserve(void *ptr, size_t size, int fd)
>  {
> -    int flags = MAP_PRIVATE;
> +    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
> 
>  #if defined(__powerpc64__) && defined(__linux__)
>      /*
> @@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
>      flags |= MAP_ANONYMOUS;
>  #endif
> 
> -    return mmap(0, size, PROT_NONE, flags, fd, 0);
> +    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
>  }
> 
>  /*
>   * Populate memory in a reserved region from the given fd (if any).
>   */
> -static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
> -                           bool is_pmem)
> +static void *mmap_populate(void *ptr, size_t size, int fd, size_t
> fd_offset, +                           bool shared, bool is_pmem)
>  {
>      int map_sync_flags = 0;
>      int flags = MAP_FIXED;
>      void *new_ptr;
> 
> +    if (fd == -1) {
> +        fd_offset = 0;
> +    }
> +
>      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>      if (shared && is_pmem) {
> @@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size, int
> fd, bool shared, }
> 
>      new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags |
> map_sync_flags, -                   fd, 0);
> +                   fd, fd_offset);
>      if (new_ptr == MAP_FAILED && map_sync_flags) {
>          if (errno == ENOTSUP) {
>              char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
> @@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size, int
> fd, bool shared, * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we
> will try * again without these flags to handle backwards compatibility. */
> -        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd,
> fd_offset); }
>      return new_ptr;
>  }
> @@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
>      size_t offset, total;
>      void *ptr, *guardptr;
> 
> +    /* we can only map whole pages */
> +    size = QEMU_ALIGN_UP(size, pagesize);
> +

Caller already rounds up size to block->page_size.

Why this QEMU_ALIGN_UP is necessary?

>      /*
>       * Note: this always allocates at least one extra page of virtual
> address * space, even if size is already aligned.
>       */
>      total = size + align;

If size was aligned above with pagesize boundary, why would this align be 
necessary?

Can the pagesize differ from memory region align?

> 
> -    guardptr = mmap_reserve(total, fd);
> +    guardptr = mmap_reserve(0, total, fd);
>      if (guardptr == MAP_FAILED) {
>          return MAP_FAILED;
>      }
> @@ -195,7 +202,7 @@ void *qemu_ram_mmap(int fd,
> 
>      offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) -
> (uintptr_t)guardptr;
> 
> -    ptr = mmap_populate(guardptr + offset, size, fd, shared, is_pmem);
> +    ptr = mmap_populate(guardptr + offset, size, fd, 0, shared, is_pmem);
>      if (ptr == MAP_FAILED) {
>          munmap(guardptr, total);
>          return MAP_FAILED;
> @@ -221,6 +228,9 @@ void qemu_ram_munmap(int fd, void *ptr, size_t size)
>  {
>      const size_t pagesize = mmap_pagesize(fd);
> 
> +    /* we can only map whole pages */
> +    size = QEMU_ALIGN_UP(size, pagesize);
> +

I'm trying to understand why this align is necessary, too.

>      if (ptr) {
>          /* Unmap both the RAM block and the guard page */
>          munmap(ptr, size + pagesize);

-- 
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-05 23:00   ` Murilo Opsfelder Araújo
@ 2020-02-06  8:52     ` David Hildenbrand
  2020-02-06 12:31       ` Murilo Opsfelder Araújo
  2020-02-06 15:13     ` David Hildenbrand
  1 sibling, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06  8:52 UTC (permalink / raw)
  To: Murilo Opsfelder Araújo
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

On 06.02.20 00:00, Murilo Opsfelder Araújo wrote:
> Hello, David.
> 
> On Monday, February 3, 2020 3:31:20 PM -03 David Hildenbrand wrote:
>> When shrinking a mmap we want to re-reserve the already populated area.
>> When growing a memory region, we want to populate starting with a given
>> fd_offset. Prepare by allowing to pass these parameters.
>>
>> Also, let's make sure we always process full pages, to avoid
>> unmapping/remapping pages that are already in use when
>> growing/shrinking. (existing callers seem to guarantee this, but that's
>> not obvious)
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Greg Kurz <groug@kaod.org>
>> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index f043ccb0ab..63ad6893b7 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>>  }
>>
>>  /*
>> - * Reserve a new memory region of the requested size to be used for mapping
>> - * from the given fd (if any).
>> + * Reserve a new memory region of the requested size or re-reserve parts
>> + * of an existing region to be used for mapping from the given fd (if any).
>> */
>> -static void *mmap_reserve(size_t size, int fd)
>> +static void *mmap_reserve(void *ptr, size_t size, int fd)
>>  {
>> -    int flags = MAP_PRIVATE;
>> +    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
>>
>>  #if defined(__powerpc64__) && defined(__linux__)
>>      /*
>> @@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
>>      flags |= MAP_ANONYMOUS;
>>  #endif
>>
>> -    return mmap(0, size, PROT_NONE, flags, fd, 0);
>> +    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
>>  }
>>
>>  /*
>>   * Populate memory in a reserved region from the given fd (if any).
>>   */
>> -static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
>> -                           bool is_pmem)
>> +static void *mmap_populate(void *ptr, size_t size, int fd, size_t
>> fd_offset, +                           bool shared, bool is_pmem)
>>  {
>>      int map_sync_flags = 0;
>>      int flags = MAP_FIXED;
>>      void *new_ptr;
>>
>> +    if (fd == -1) {
>> +        fd_offset = 0;
>> +    }
>> +
>>      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>>      if (shared && is_pmem) {
>> @@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size, int
>> fd, bool shared, }
>>
>>      new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags |
>> map_sync_flags, -                   fd, 0);
>> +                   fd, fd_offset);
>>      if (new_ptr == MAP_FAILED && map_sync_flags) {
>>          if (errno == ENOTSUP) {
>>              char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
>> @@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size, int
>> fd, bool shared, * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we
>> will try * again without these flags to handle backwards compatibility. */
>> -        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
>> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd,
>> fd_offset); }
>>      return new_ptr;
>>  }
>> @@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
>>      size_t offset, total;
>>      void *ptr, *guardptr;
>>
>> +    /* we can only map whole pages */
>> +    size = QEMU_ALIGN_UP(size, pagesize);
>> +
> 
> Caller already rounds up size to block->page_size.
> 
> Why this QEMU_ALIGN_UP is necessary?

Thanks for having a look

I guess you read the patch description, right? :)

"(existing callers seem to guarantee this, but that's
  not obvious)"

Do you prefer a g_assert(IS_ALIGNED()) instead?

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate()
  2020-02-05 19:56   ` Murilo Opsfelder Araújo
@ 2020-02-06  9:26     ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06  9:26 UTC (permalink / raw)
  To: muriloo, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

On 05.02.20 20:56, Murilo Opsfelder Araújo wrote:
> Hello, David.
> 
> On 2/3/20 3:31 PM, David Hildenbrand wrote:
>> We want to populate memory within a reserved memory region. Let's factor
>> that out.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Greg Kurz <groug@kaod.org>
>> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
> 
> Acked-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> 
> A minor comment below.
> 
>>   util/mmap-alloc.c | 89 +++++++++++++++++++++++++----------------------
>>   1 file changed, 47 insertions(+), 42 deletions(-)
>>
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index 43a26f38a8..f043ccb0ab 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -114,6 +114,50 @@ static void *mmap_reserve(size_t size, int fd)
>>       return mmap(0, size, PROT_NONE, flags, fd, 0);
>>   }
>>
>> +/*
>> + * Populate memory in a reserved region from the given fd (if any).
>> + */
>> +static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
>> +                           bool is_pmem)
>> +{
>> +    int map_sync_flags = 0;
>> +    int flags = MAP_FIXED;
>> +    void *new_ptr;
> 
> Do you think another name would be welcome here, e.g.: "populated_ptr" or
> "populated_memptr" or just "populated"?

I'll go with populated_ptr - thanks!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (12 preceding siblings ...)
  2020-02-03 18:31 ` [PATCH v1 13/13] exec: Ram blocks with resizable " David Hildenbrand
@ 2020-02-06  9:27 ` Michael S. Tsirkin
  2020-02-06  9:45   ` David Hildenbrand
  2020-02-06 20:11 ` Dr. David Alan Gilbert
  14 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2020-02-06  9:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Stefan Weil, Eduardo Habkost, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Alex Williamson,
	Murilo Opsfelder Araujo, Paolo Bonzini, Greg Kurz,
	Richard Henderson

On Mon, Feb 03, 2020 at 07:31:12PM +0100, David Hildenbrand wrote:
> We already allow resizable ram blocks for anonymous memory, however, they
> are not actually resized. All memory is mmaped() R/W, including the memory
> exceeding the used_length, up to the max_length.
> 
> When resizing, effectively only the boundary is moved. Implement actually
> resizable anonymous allocations and make use of them in resizable ram
> blocks when possible. Memory exceeding the used_length will be
> inaccessible. Especially ram block notifiers require care.
> 
> Having actually resizable anonymous allocations (via mmap-hackery) allows
> to reserve a big region in virtual address space and grow the
> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> is set to "never" under Linux, huge reservations will succeed. If there is
> not enough memory when resizing (to populate parts of the reserved region),
> trying to resize will fail. Only the actually used size is reserved in the
> OS.
> 
> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> grow the usable part on demand. I think this change is worth sending out
> individually. Accompanied by a bunch of minor fixes and cleanups.
> 
> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

How does this inteact with all the prealloc/mlock things designed
for realtime?

> David Hildenbrand (13):
>   util: vfio-helpers: Factor out and fix processing of existings ram
>     blocks
>   exec: Factor out setting ram settings (madvise ...) into
>     qemu_ram_apply_settings()
>   exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
>   exec: Drop "shared" parameter from ram_block_add()
>   util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
>   util/mmap-alloc: Factor out reserving of a memory region to
>     mmap_reserve()
>   util/mmap-alloc: Factor out populating of memory to mmap_populate()
>   util/mmap-alloc: Prepare for resizable mmaps
>   util/mmap-alloc: Implement resizable mmaps
>   numa: Introduce ram_block_notify_resized() and
>     ram_block_notifiers_support_resize()
>   util: vfio-helpers: Implement ram_block_resized()
>   util: oslib: Resizable anonymous allocations under POSIX
>   exec: Ram blocks with resizable anonymous allocations under POSIX
> 
>  exec.c                    |  99 ++++++++++++++++++----
>  hw/core/numa.c            |  39 +++++++++
>  include/exec/cpu-common.h |   3 +
>  include/exec/memory.h     |   8 ++
>  include/exec/ramlist.h    |   4 +
>  include/qemu/mmap-alloc.h |  21 +++--
>  include/qemu/osdep.h      |   6 +-
>  stubs/ram-block.c         |  20 -----
>  util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
>  util/oslib-posix.c        |  37 ++++++++-
>  util/oslib-win32.c        |  14 ++++
>  util/trace-events         |   5 +-
>  util/vfio-helpers.c       |  33 ++++----
>  13 files changed, 331 insertions(+), 126 deletions(-)
> 
> -- 
> 2.24.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-06  9:27 ` [PATCH v1 00/13] " Michael S. Tsirkin
@ 2020-02-06  9:45   ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06  9:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Weil, Eduardo Habkost, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Alex Williamson,
	Murilo Opsfelder Araujo, Paolo Bonzini, Greg Kurz,
	Richard Henderson

On 06.02.20 10:27, Michael S. Tsirkin wrote:
> On Mon, Feb 03, 2020 at 07:31:12PM +0100, David Hildenbrand wrote:
>> We already allow resizable ram blocks for anonymous memory, however, they
>> are not actually resized. All memory is mmaped() R/W, including the memory
>> exceeding the used_length, up to the max_length.
>>
>> When resizing, effectively only the boundary is moved. Implement actually
>> resizable anonymous allocations and make use of them in resizable ram
>> blocks when possible. Memory exceeding the used_length will be
>> inaccessible. Especially ram block notifiers require care.
>>
>> Having actually resizable anonymous allocations (via mmap-hackery) allows
>> to reserve a big region in virtual address space and grow the
>> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
>> is set to "never" under Linux, huge reservations will succeed. If there is
>> not enough memory when resizing (to populate parts of the reserved region),
>> trying to resize will fail. Only the actually used size is reserved in the
>> OS.
>>
>> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
>> grow the usable part on demand. I think this change is worth sending out
>> individually. Accompanied by a bunch of minor fixes and cleanups.
>>
>> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> 
> How does this inteact with all the prealloc/mlock things designed
> for realtime?

- Prealloc: we don't support resizable ram blocks with prealloc
-- qemu_ram_alloc_from_ptr() is the only way to get "real" preallocated
   ram blocks from a pointer. They are never resizable.
-- "prealloc" with memory backends (e.g., backends/hostmem.c): Memory
   backends don't support resizable ram blocks yet. I have patches to
   support that, and disallow prealloc for them.
-- file-based ram blocks are not resizable

- mlock
-- os_mlock() does a mlockall(MCL_CURRENT | MCL_FUTURE)
-- That should work just fine with ordinary mmap() invalidating old
   mmaps and creating new mmaps(). Just like when hotplugging/unplugging
   a DIMM.

Resizing currently only happens during reboot/migration. If mlock_all
would result in issues, we could fallback to old handling (if
(enable_mlock) ...) - but I don't think this is necessary.

So I don't see an issues with that :) Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings()
  2020-02-03 18:31 ` [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings() David Hildenbrand
@ 2020-02-06 11:42   ` Richard Henderson
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:42 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Dr . David Alan Gilbert, Michael S . Tsirkin

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> Factor all settings out into qemu_ram_apply_settings().
> 
> For memory_try_enable_merging(), the important bit is that it won't be
> called with XEN - which is now still the case as new_block->host will
> remain NULL.
> 
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  exec.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
  2020-02-03 18:31 ` [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap() David Hildenbrand
@ 2020-02-06 11:43   ` Richard Henderson
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:43 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Dr . David Alan Gilbert, Michael S . Tsirkin

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> I don't see why we shouldn't apply all settings to make it look like the
> surrounding RAM (and enable proper VMA merging).
> 
> Note: memory backend settings might have overridden these settings. We
> would need a callback to let the memory backend fix that up.
> 
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  exec.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add()
  2020-02-03 18:31 ` [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add() David Hildenbrand
@ 2020-02-06 11:44   ` Richard Henderson
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:44 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Dr . David Alan Gilbert, Michael S . Tsirkin

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> Properly store it in the flags of the ram block instead (and the flag
> even already exists and is used).
> 
> E.g., qemu_ram_is_shared() now properly succeeds on all ram blocks that are
> actually shared.
> 
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  exec.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
  2020-02-03 18:31 ` [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize() David Hildenbrand
  2020-02-05 19:37   ` Murilo Opsfelder Araújo
@ 2020-02-06 11:46   ` Richard Henderson
  1 sibling, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:46 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> Factor it out and add a comment.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/mmap-alloc.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2020-02-03 18:31 ` [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
  2020-02-05 19:40   ` Murilo Opsfelder Araújo
@ 2020-02-06 11:55   ` Richard Henderson
  2020-02-06 13:16     ` David Hildenbrand
  1 sibling, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:55 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> We want to reserve a memory region without actually populating memory.
> Let's factor that out.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
>  1 file changed, 33 insertions(+), 25 deletions(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 82f02a2cec..43a26f38a8 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>      return qemu_real_host_page_size;
>  }
>  
> +/*
> + * Reserve a new memory region of the requested size to be used for mapping
> + * from the given fd (if any).
> + */
> +static void *mmap_reserve(size_t size, int fd)
> +{
> +    int flags = MAP_PRIVATE;
> +
> +#if defined(__powerpc64__) && defined(__linux__)
> +    /*
> +     * On ppc64 mappings in the same segment (aka slice) must share the same
> +     * page size. Since we will be re-allocating part of this segment
> +     * from the supplied fd, we should make sure to use the same page size, to
> +     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
> +     * avoid allocating backing store memory.
> +     * We do this unless we are using the system page size, in which case
> +     * anonymous memory is OK.
> +     */
> +    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
> +        fd = -1;
> +        flags |= MAP_ANONYMOUS;
> +    } else {
> +        flags |= MAP_NORESERVE;
> +    }
> +#else
> +    fd = -1;
> +    flags |= MAP_ANONYMOUS;
> +#endif

Because this is just code movement,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

But is there a reason not to add MAP_NORESERVE all of the time?
It seems to match intent, in that we're reserving vma but not planning to use
the memory, anonymous or not.


r~


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate()
  2020-02-03 18:31 ` [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate() David Hildenbrand
  2020-02-05 19:56   ` Murilo Opsfelder Araújo
@ 2020-02-06 11:59   ` Richard Henderson
  1 sibling, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 11:59 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> We want to populate memory within a reserved memory region. Let's factor
> that out.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/mmap-alloc.c | 89 +++++++++++++++++++++++++----------------------
>  1 file changed, 47 insertions(+), 42 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-03 18:31 ` [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps David Hildenbrand
  2020-02-05 23:00   ` Murilo Opsfelder Araújo
@ 2020-02-06 12:02   ` Richard Henderson
  1 sibling, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 12:02 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> When shrinking a mmap we want to re-reserve the already populated area.
> When growing a memory region, we want to populate starting with a given
> fd_offset. Prepare by allowing to pass these parameters.
> 
> Also, let's make sure we always process full pages, to avoid
> unmapping/remapping pages that are already in use when
> growing/shrinking. (existing callers seem to guarantee this, but that's
> not obvious)
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-03 18:31 ` [PATCH v1 09/13] util/mmap-alloc: Implement " David Hildenbrand
@ 2020-02-06 12:08   ` Richard Henderson
  2020-02-06 13:22     ` David Hildenbrand
  2020-02-07  0:29   ` Murilo Opsfelder Araújo
  1 sibling, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2020-02-06 12:08 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 2/3/20 6:31 PM, David Hildenbrand wrote:
> +void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t new_size,
> +                           bool shared, bool is_pmem)
>  {
>      const size_t pagesize = mmap_pagesize(fd);
>  
>      /* we can only map whole pages */
> -    size = QEMU_ALIGN_UP(size, pagesize);
> +    old_size = QEMU_ALIGN_UP(old_size, pagesize);
> +    new_size = QEMU_ALIGN_UP(new_size, pagesize);
> +
> +    /* we support actually resizable memory regions only on Linux */
> +    if (old_size < new_size) {
> +        /* populate the missing piece into the reserved area */
> +        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd, old_size,
> +                            shared, is_pmem);
> +    } else if (old_size > new_size) {
> +        /* discard this piece, keeping the area reserved (should never fail) */
> +        ptr = mmap_reserve(ptr + new_size, old_size - new_size, fd);
> +    }
> +    return ptr;
> +}

What does the return value indicate?
Is it just for != MAP_FAILED?

Would we be better off with an assert?  There's the comment re mmap_reserve,
but I can't see why mmap_populate should fail either.

Assuming an assert isn't viable, are we better off with a boolean return?  With
an Error **ptr?


r~


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-06  8:52     ` David Hildenbrand
@ 2020-02-06 12:31       ` Murilo Opsfelder Araújo
  2020-02-06 13:16         ` David Hildenbrand
  0 siblings, 1 reply; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-06 12:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

Hello, David.

On Thursday, February 6, 2020 5:52:26 AM -03 David Hildenbrand wrote:
> On 06.02.20 00:00, Murilo Opsfelder Araújo wrote:
> > Hello, David.
> >
> > On Monday, February 3, 2020 3:31:20 PM -03 David Hildenbrand wrote:
> >> When shrinking a mmap we want to re-reserve the already populated area.
> >> When growing a memory region, we want to populate starting with a given
> >> fd_offset. Prepare by allowing to pass these parameters.
> >>
> >> Also, let's make sure we always process full pages, to avoid
> >> unmapping/remapping pages that are already in use when
> >> growing/shrinking. (existing callers seem to guarantee this, but that's
> >> not obvious)
> >>
> >> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >> Cc: Greg Kurz <groug@kaod.org>
> >> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> >> Cc: Eduardo Habkost <ehabkost@redhat.com>
> >> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>
> >>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
> >>  1 file changed, 21 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> >> index f043ccb0ab..63ad6893b7 100644
> >> --- a/util/mmap-alloc.c
> >> +++ b/util/mmap-alloc.c
> >> @@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
> >>
> >>  }
> >>
> >>  /*
> >>
> >> - * Reserve a new memory region of the requested size to be used for
> >> mapping - * from the given fd (if any).
> >> + * Reserve a new memory region of the requested size or re-reserve parts
> >> + * of an existing region to be used for mapping from the given fd (if
> >> any). */
> >> -static void *mmap_reserve(size_t size, int fd)
> >> +static void *mmap_reserve(void *ptr, size_t size, int fd)
> >>
> >>  {
> >>
> >> -    int flags = MAP_PRIVATE;
> >> +    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
> >>
> >>  #if defined(__powerpc64__) && defined(__linux__)
> >>
> >>      /*
> >>
> >> @@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
> >>
> >>      flags |= MAP_ANONYMOUS;
> >>
> >>  #endif
> >>
> >> -    return mmap(0, size, PROT_NONE, flags, fd, 0);
> >> +    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
> >>
> >>  }
> >>
> >>  /*
> >>
> >>   * Populate memory in a reserved region from the given fd (if any).
> >>   */
> >>
> >> -static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
> >> -                           bool is_pmem)
> >> +static void *mmap_populate(void *ptr, size_t size, int fd, size_t
> >> fd_offset, +                           bool shared, bool is_pmem)
> >>
> >>  {
> >>
> >>      int map_sync_flags = 0;
> >>      int flags = MAP_FIXED;
> >>      void *new_ptr;
> >>
> >> +    if (fd == -1) {
> >> +        fd_offset = 0;
> >> +    }
> >> +
> >>
> >>      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
> >>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
> >>      if (shared && is_pmem) {
> >>
> >> @@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size,
> >> int
> >> fd, bool shared, }
> >>
> >>      new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags |
> >>
> >> map_sync_flags, -                   fd, 0);
> >> +                   fd, fd_offset);
> >>
> >>      if (new_ptr == MAP_FAILED && map_sync_flags) {
> >>
> >>          if (errno == ENOTSUP) {
> >>
> >>              char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
> >>
> >> @@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size,
> >> int
> >> fd, bool shared, * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we
> >> will try * again without these flags to handle backwards compatibility.
> >> */
> >> -        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
> >> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd,
> >> fd_offset); }
> >>
> >>      return new_ptr;
> >>
> >>  }
> >>
> >> @@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
> >>
> >>      size_t offset, total;
> >>      void *ptr, *guardptr;
> >>
> >> +    /* we can only map whole pages */
> >> +    size = QEMU_ALIGN_UP(size, pagesize);
> >> +
> >
> > Caller already rounds up size to block->page_size.
> >
> > Why this QEMU_ALIGN_UP is necessary?
>
> Thanks for having a look
>
> I guess you read the patch description, right? :)
>
> "(existing callers seem to guarantee this, but that's
>   not obvious)"
>
> Do you prefer a g_assert(IS_ALIGNED()) instead?

I guess you mean QEMU_IS_ALIGNED().  But yes, I think we could just check
alignments here, so callers do the alignments first.

We could have, for example, a new helper function mmap_align(int size) that
returned size already aligned.

>
> Thanks!

--
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
  2020-02-06 11:55   ` Richard Henderson
@ 2020-02-06 13:16     ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 13:16 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 06.02.20 12:55, Richard Henderson wrote:
> On 2/3/20 6:31 PM, David Hildenbrand wrote:
>> We want to reserve a memory region without actually populating memory.
>> Let's factor that out.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Greg Kurz <groug@kaod.org>
>> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  util/mmap-alloc.c | 58 +++++++++++++++++++++++++++--------------------
>>  1 file changed, 33 insertions(+), 25 deletions(-)
>>
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index 82f02a2cec..43a26f38a8 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>>      return qemu_real_host_page_size;
>>  }
>>  
>> +/*
>> + * Reserve a new memory region of the requested size to be used for mapping
>> + * from the given fd (if any).
>> + */
>> +static void *mmap_reserve(size_t size, int fd)
>> +{
>> +    int flags = MAP_PRIVATE;
>> +
>> +#if defined(__powerpc64__) && defined(__linux__)
>> +    /*
>> +     * On ppc64 mappings in the same segment (aka slice) must share the same
>> +     * page size. Since we will be re-allocating part of this segment
>> +     * from the supplied fd, we should make sure to use the same page size, to
>> +     * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
>> +     * avoid allocating backing store memory.
>> +     * We do this unless we are using the system page size, in which case
>> +     * anonymous memory is OK.
>> +     */
>> +    if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
>> +        fd = -1;
>> +        flags |= MAP_ANONYMOUS;
>> +    } else {
>> +        flags |= MAP_NORESERVE;
>> +    }
>> +#else
>> +    fd = -1;
>> +    flags |= MAP_ANONYMOUS;
>> +#endif
> 
> Because this is just code movement,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
> 
> But is there a reason not to add MAP_NORESERVE all of the time?
> It seems to match intent, in that we're reserving vma but not planning to use
> the memory, anonymous or not.

AFAIK, if you mmap something PROT_NONE, it's an implicit MAP_NORESERVE.
I keep setting in conditionally, because I am not sure if any POSIX
system (or older kernel) might choke on MAP_NORESERVE. e.g.,

"In kernels before 2.6, this flag had effect only for private writable
mappings." sounds like it would get ignored. But also sounds like it's
somewhat Linux specific.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-06 12:31       ` Murilo Opsfelder Araújo
@ 2020-02-06 13:16         ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 13:16 UTC (permalink / raw)
  To: Murilo Opsfelder Araújo
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

On 06.02.20 13:31, Murilo Opsfelder Araújo wrote:
> Hello, David.
> 
> On Thursday, February 6, 2020 5:52:26 AM -03 David Hildenbrand wrote:
>> On 06.02.20 00:00, Murilo Opsfelder Araújo wrote:
>>> Hello, David.
>>>
>>> On Monday, February 3, 2020 3:31:20 PM -03 David Hildenbrand wrote:
>>>> When shrinking a mmap we want to re-reserve the already populated area.
>>>> When growing a memory region, we want to populate starting with a given
>>>> fd_offset. Prepare by allowing to pass these parameters.
>>>>
>>>> Also, let's make sure we always process full pages, to avoid
>>>> unmapping/remapping pages that are already in use when
>>>> growing/shrinking. (existing callers seem to guarantee this, but that's
>>>> not obvious)
>>>>
>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>> Cc: Greg Kurz <groug@kaod.org>
>>>> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>>>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>>>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>
>>>>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
>>>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>>>> index f043ccb0ab..63ad6893b7 100644
>>>> --- a/util/mmap-alloc.c
>>>> +++ b/util/mmap-alloc.c
>>>> @@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>>>>
>>>>  }
>>>>
>>>>  /*
>>>>
>>>> - * Reserve a new memory region of the requested size to be used for
>>>> mapping - * from the given fd (if any).
>>>> + * Reserve a new memory region of the requested size or re-reserve parts
>>>> + * of an existing region to be used for mapping from the given fd (if
>>>> any). */
>>>> -static void *mmap_reserve(size_t size, int fd)
>>>> +static void *mmap_reserve(void *ptr, size_t size, int fd)
>>>>
>>>>  {
>>>>
>>>> -    int flags = MAP_PRIVATE;
>>>> +    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
>>>>
>>>>  #if defined(__powerpc64__) && defined(__linux__)
>>>>
>>>>      /*
>>>>
>>>> @@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
>>>>
>>>>      flags |= MAP_ANONYMOUS;
>>>>
>>>>  #endif
>>>>
>>>> -    return mmap(0, size, PROT_NONE, flags, fd, 0);
>>>> +    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
>>>>
>>>>  }
>>>>
>>>>  /*
>>>>
>>>>   * Populate memory in a reserved region from the given fd (if any).
>>>>   */
>>>>
>>>> -static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
>>>> -                           bool is_pmem)
>>>> +static void *mmap_populate(void *ptr, size_t size, int fd, size_t
>>>> fd_offset, +                           bool shared, bool is_pmem)
>>>>
>>>>  {
>>>>
>>>>      int map_sync_flags = 0;
>>>>      int flags = MAP_FIXED;
>>>>      void *new_ptr;
>>>>
>>>> +    if (fd == -1) {
>>>> +        fd_offset = 0;
>>>> +    }
>>>> +
>>>>
>>>>      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>>>>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>>>>      if (shared && is_pmem) {
>>>>
>>>> @@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size,
>>>> int
>>>> fd, bool shared, }
>>>>
>>>>      new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags |
>>>>
>>>> map_sync_flags, -                   fd, 0);
>>>> +                   fd, fd_offset);
>>>>
>>>>      if (new_ptr == MAP_FAILED && map_sync_flags) {
>>>>
>>>>          if (errno == ENOTSUP) {
>>>>
>>>>              char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
>>>>
>>>> @@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size,
>>>> int
>>>> fd, bool shared, * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we
>>>> will try * again without these flags to handle backwards compatibility.
>>>> */
>>>> -        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
>>>> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd,
>>>> fd_offset); }
>>>>
>>>>      return new_ptr;
>>>>
>>>>  }
>>>>
>>>> @@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
>>>>
>>>>      size_t offset, total;
>>>>      void *ptr, *guardptr;
>>>>
>>>> +    /* we can only map whole pages */
>>>> +    size = QEMU_ALIGN_UP(size, pagesize);
>>>> +
>>>
>>> Caller already rounds up size to block->page_size.
>>>
>>> Why this QEMU_ALIGN_UP is necessary?
>>
>> Thanks for having a look
>>
>> I guess you read the patch description, right? :)
>>
>> "(existing callers seem to guarantee this, but that's
>>   not obvious)"
>>
>> Do you prefer a g_assert(IS_ALIGNED()) instead?
> 
> I guess you mean QEMU_IS_ALIGNED().  But yes, I think we could just check
> alignments here, so callers do the alignments first.

Yeh, you got the idea :)

I'll convert these to asserts for now. Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-06 12:08   ` Richard Henderson
@ 2020-02-06 13:22     ` David Hildenbrand
  2020-02-06 15:27       ` David Hildenbrand
  0 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 13:22 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 06.02.20 13:08, Richard Henderson wrote:
> On 2/3/20 6:31 PM, David Hildenbrand wrote:
>> +void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t new_size,
>> +                           bool shared, bool is_pmem)
>>  {
>>      const size_t pagesize = mmap_pagesize(fd);
>>  
>>      /* we can only map whole pages */
>> -    size = QEMU_ALIGN_UP(size, pagesize);
>> +    old_size = QEMU_ALIGN_UP(old_size, pagesize);
>> +    new_size = QEMU_ALIGN_UP(new_size, pagesize);
>> +
>> +    /* we support actually resizable memory regions only on Linux */
>> +    if (old_size < new_size) {
>> +        /* populate the missing piece into the reserved area */
>> +        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd, old_size,
>> +                            shared, is_pmem);
>> +    } else if (old_size > new_size) {
>> +        /* discard this piece, keeping the area reserved (should never fail) */
>> +        ptr = mmap_reserve(ptr + new_size, old_size - new_size, fd);
>> +    }
>> +    return ptr;
>> +}
> 
> What does the return value indicate?
> Is it just for != MAP_FAILED?

It indicates if resizing succeeded. In a previous version I returned an
int via

ptr == MAP_FAILED ? -errno : 0;


Populating will usually only fail because we're out of memory.

Populating and reserving *might* fail if we are out of VMAs in the
kernel. VMA merging will make sure that the number of VMAs will not
explode (usually 2-3 VMAs for one resizable region: populated VMA +
Reserved VMA + Guard page VMA). But once we would be close to the VMA
limit, it could happen - but it's highly unlikely.

> Assuming an assert isn't viable, are we better off with a boolean return?  With
> an Error **ptr?

either that or an int. What do you prefer?

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps
  2020-02-05 23:00   ` Murilo Opsfelder Araújo
  2020-02-06  8:52     ` David Hildenbrand
@ 2020-02-06 15:13     ` David Hildenbrand
  1 sibling, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 15:13 UTC (permalink / raw)
  To: Murilo Opsfelder Araújo
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

On 06.02.20 00:00, Murilo Opsfelder Araújo wrote:
> Hello, David.
> 
> On Monday, February 3, 2020 3:31:20 PM -03 David Hildenbrand wrote:
>> When shrinking a mmap we want to re-reserve the already populated area.
>> When growing a memory region, we want to populate starting with a given
>> fd_offset. Prepare by allowing to pass these parameters.
>>
>> Also, let's make sure we always process full pages, to avoid
>> unmapping/remapping pages that are already in use when
>> growing/shrinking. (existing callers seem to guarantee this, but that's
>> not obvious)
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Greg Kurz <groug@kaod.org>
>> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  util/mmap-alloc.c | 32 +++++++++++++++++++++-----------
>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index f043ccb0ab..63ad6893b7 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -83,12 +83,12 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
>>  }
>>
>>  /*
>> - * Reserve a new memory region of the requested size to be used for mapping
>> - * from the given fd (if any).
>> + * Reserve a new memory region of the requested size or re-reserve parts
>> + * of an existing region to be used for mapping from the given fd (if any).
>> */
>> -static void *mmap_reserve(size_t size, int fd)
>> +static void *mmap_reserve(void *ptr, size_t size, int fd)
>>  {
>> -    int flags = MAP_PRIVATE;
>> +    int flags = MAP_PRIVATE | (ptr ? MAP_FIXED : 0);
>>
>>  #if defined(__powerpc64__) && defined(__linux__)
>>      /*
>> @@ -111,19 +111,23 @@ static void *mmap_reserve(size_t size, int fd)
>>      flags |= MAP_ANONYMOUS;
>>  #endif
>>
>> -    return mmap(0, size, PROT_NONE, flags, fd, 0);
>> +    return mmap(ptr, size, PROT_NONE, flags, fd, 0);
>>  }
>>
>>  /*
>>   * Populate memory in a reserved region from the given fd (if any).
>>   */
>> -static void *mmap_populate(void *ptr, size_t size, int fd, bool shared,
>> -                           bool is_pmem)
>> +static void *mmap_populate(void *ptr, size_t size, int fd, size_t
>> fd_offset, +                           bool shared, bool is_pmem)
>>  {
>>      int map_sync_flags = 0;
>>      int flags = MAP_FIXED;
>>      void *new_ptr;
>>
>> +    if (fd == -1) {
>> +        fd_offset = 0;
>> +    }
>> +
>>      flags |= fd == -1 ? MAP_ANONYMOUS : 0;
>>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>>      if (shared && is_pmem) {
>> @@ -131,7 +135,7 @@ static void *mmap_populate(void *ptr, size_t size, int
>> fd, bool shared, }
>>
>>      new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags |
>> map_sync_flags, -                   fd, 0);
>> +                   fd, fd_offset);
>>      if (new_ptr == MAP_FAILED && map_sync_flags) {
>>          if (errno == ENOTSUP) {
>>              char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
>> @@ -153,7 +157,7 @@ static void *mmap_populate(void *ptr, size_t size, int
>> fd, bool shared, * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we
>> will try * again without these flags to handle backwards compatibility. */
>> -        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd, 0);
>> +        new_ptr = mmap(ptr, size, PROT_READ | PROT_WRITE, flags, fd,
>> fd_offset); }
>>      return new_ptr;
>>  }
>> @@ -178,13 +182,16 @@ void *qemu_ram_mmap(int fd,
>>      size_t offset, total;
>>      void *ptr, *guardptr;
>>
>> +    /* we can only map whole pages */
>> +    size = QEMU_ALIGN_UP(size, pagesize);
>> +
> 
> Caller already rounds up size to block->page_size.
> 
> Why this QEMU_ALIGN_UP is necessary?
> 
>>      /*
>>       * Note: this always allocates at least one extra page of virtual
>> address * space, even if size is already aligned.
>>       */
>>      total = size + align;
> 
> If size was aligned above with pagesize boundary, why would this align be 
> necessary?
> 
> Can the pagesize differ from memory region align?

Sorry, skipped this comment.

Yes, e.g., we want to align ram blocks for KVM to hugepage size, to
allow for transparent huge pages. So the comment still holds.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-06 13:22     ` David Hildenbrand
@ 2020-02-06 15:27       ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 15:27 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Greg Kurz,
	Dr . David Alan Gilbert, Murilo Opsfelder Araujo, Paolo Bonzini,
	Richard Henderson

On 06.02.20 14:22, David Hildenbrand wrote:
> On 06.02.20 13:08, Richard Henderson wrote:
>> On 2/3/20 6:31 PM, David Hildenbrand wrote:
>>> +void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t new_size,
>>> +                           bool shared, bool is_pmem)
>>>  {
>>>      const size_t pagesize = mmap_pagesize(fd);
>>>  
>>>      /* we can only map whole pages */
>>> -    size = QEMU_ALIGN_UP(size, pagesize);
>>> +    old_size = QEMU_ALIGN_UP(old_size, pagesize);
>>> +    new_size = QEMU_ALIGN_UP(new_size, pagesize);
>>> +
>>> +    /* we support actually resizable memory regions only on Linux */
>>> +    if (old_size < new_size) {
>>> +        /* populate the missing piece into the reserved area */
>>> +        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd, old_size,
>>> +                            shared, is_pmem);
>>> +    } else if (old_size > new_size) {
>>> +        /* discard this piece, keeping the area reserved (should never fail) */
>>> +        ptr = mmap_reserve(ptr + new_size, old_size - new_size, fd);
>>> +    }
>>> +    return ptr;
>>> +}
>>
>> What does the return value indicate?
>> Is it just for != MAP_FAILED?
> 
> It indicates if resizing succeeded. In a previous version I returned an
> int via
> 
> ptr == MAP_FAILED ? -errno : 0;
> 
> 
> Populating will usually only fail because we're out of memory.
> 
> Populating and reserving *might* fail if we are out of VMAs in the
> kernel. VMA merging will make sure that the number of VMAs will not
> explode (usually 2-3 VMAs for one resizable region: populated VMA +
> Reserved VMA + Guard page VMA). But once we would be close to the VMA
> limit, it could happen - but it's highly unlikely.
> 
>> Assuming an assert isn't viable, are we better off with a boolean return?  With
>> an Error **ptr?
> 
> either that or an int. What do you prefer?

I'll go with a bool for now. "return ptr != MAP_FAILED;"

The actual error will usually be -ENOMEM, so there is no real benefit in
returning it.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
                   ` (13 preceding siblings ...)
  2020-02-06  9:27 ` [PATCH v1 00/13] " Michael S. Tsirkin
@ 2020-02-06 20:11 ` Dr. David Alan Gilbert
  2020-02-06 20:31   ` David Hildenbrand
  14 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-02-06 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	qemu-devel, Greg Kurz, Alex Williamson, Murilo Opsfelder Araujo,
	Paolo Bonzini, Stefan Weil, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> We already allow resizable ram blocks for anonymous memory, however, they
> are not actually resized. All memory is mmaped() R/W, including the memory
> exceeding the used_length, up to the max_length.
> 
> When resizing, effectively only the boundary is moved. Implement actually
> resizable anonymous allocations and make use of them in resizable ram
> blocks when possible. Memory exceeding the used_length will be
> inaccessible. Especially ram block notifiers require care.
> 
> Having actually resizable anonymous allocations (via mmap-hackery) allows
> to reserve a big region in virtual address space and grow the
> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> is set to "never" under Linux, huge reservations will succeed. If there is
> not enough memory when resizing (to populate parts of the reserved region),
> trying to resize will fail. Only the actually used size is reserved in the
> OS.
> 
> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> grow the usable part on demand. I think this change is worth sending out
> individually. Accompanied by a bunch of minor fixes and cleanups.
> 
> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

There's a few bits I've not understood from skimming the patches:

  a) Am I correct in thinking you PROT_NONE the extra space so you can
gkrow/shrink it?
  b) What does kvm see - does it have a slot for the whole space or for
only the used space?
     I ask because we found with virtiofs/DAX experiments that on Power,
kvm gets upset if you give it a mapping with PROT_NONE.
     (That maybe less of an issue if you change the mapping after the
slot is created).

  c) It's interesting this is keyed off the RAMBlock notifiers - do
     memory_listener's on the address space the block is mapped into get
    triggered?  I'm wondering how vhost (and vhost-user) in particular
    see this.

Dave

> 
> David Hildenbrand (13):
>   util: vfio-helpers: Factor out and fix processing of existings ram
>     blocks
>   exec: Factor out setting ram settings (madvise ...) into
>     qemu_ram_apply_settings()
>   exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
>   exec: Drop "shared" parameter from ram_block_add()
>   util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
>   util/mmap-alloc: Factor out reserving of a memory region to
>     mmap_reserve()
>   util/mmap-alloc: Factor out populating of memory to mmap_populate()
>   util/mmap-alloc: Prepare for resizable mmaps
>   util/mmap-alloc: Implement resizable mmaps
>   numa: Introduce ram_block_notify_resized() and
>     ram_block_notifiers_support_resize()
>   util: vfio-helpers: Implement ram_block_resized()
>   util: oslib: Resizable anonymous allocations under POSIX
>   exec: Ram blocks with resizable anonymous allocations under POSIX
> 
>  exec.c                    |  99 ++++++++++++++++++----
>  hw/core/numa.c            |  39 +++++++++
>  include/exec/cpu-common.h |   3 +
>  include/exec/memory.h     |   8 ++
>  include/exec/ramlist.h    |   4 +
>  include/qemu/mmap-alloc.h |  21 +++--
>  include/qemu/osdep.h      |   6 +-
>  stubs/ram-block.c         |  20 -----
>  util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
>  util/oslib-posix.c        |  37 ++++++++-
>  util/oslib-win32.c        |  14 ++++
>  util/trace-events         |   5 +-
>  util/vfio-helpers.c       |  33 ++++----
>  13 files changed, 331 insertions(+), 126 deletions(-)
> 
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-06 20:11 ` Dr. David Alan Gilbert
@ 2020-02-06 20:31   ` David Hildenbrand
  2020-02-07 15:28     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2020-02-06 20:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	David Hildenbrand, qemu-devel, Greg Kurz, Alex Williamson,
	Murilo Opsfelder Araujo, Paolo Bonzini, Stefan Weil,
	Richard Henderson



> Am 06.02.2020 um 21:11 schrieb Dr. David Alan Gilbert <dgilbert@redhat.com>:
> 
> * David Hildenbrand (david@redhat.com) wrote:
>> We already allow resizable ram blocks for anonymous memory, however, they
>> are not actually resized. All memory is mmaped() R/W, including the memory
>> exceeding the used_length, up to the max_length.
>> 
>> When resizing, effectively only the boundary is moved. Implement actually
>> resizable anonymous allocations and make use of them in resizable ram
>> blocks when possible. Memory exceeding the used_length will be
>> inaccessible. Especially ram block notifiers require care.
>> 
>> Having actually resizable anonymous allocations (via mmap-hackery) allows
>> to reserve a big region in virtual address space and grow the
>> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
>> is set to "never" under Linux, huge reservations will succeed. If there is
>> not enough memory when resizing (to populate parts of the reserved region),
>> trying to resize will fail. Only the actually used size is reserved in the
>> OS.
>> 
>> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
>> grow the usable part on demand. I think this change is worth sending out
>> individually. Accompanied by a bunch of minor fixes and cleanups.
>> 
>> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> 
> There's a few bits I've not understood from skimming the patches:
> 

Thanks for having a look!

>  a) Am I correct in thinking you PROT_NONE the extra space so you can
> gkrow/shrink it?

Yes!

>  b) What does kvm see - does it have a slot for the whole space or for
> only the used space?

Only the used space. Resizing triggers a resize of the memory region. That triggers memory notifiers, which remove the old kvm memslot and re-add the new kvm memslot. (That‘s existing handling, so nothing new).

So KVM will not see PROT_NONE when creating a slot.

>     I ask because we found with virtiofs/DAX experiments that on Power,
> kvm gets upset if you give it a mapping with PROT_NONE.
>     (That maybe less of an issue if you change the mapping after the
> slot is created).

That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.

> 
>  c) It's interesting this is keyed off the RAMBlock notifiers - do
>     memory_listener's on the address space the block is mapped into get
>    triggered?  I'm wondering how vhost (and vhost-user) in particular
>    see this.

Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.

The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.

E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.

Cheers!



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-03 18:31 ` [PATCH v1 09/13] util/mmap-alloc: Implement " David Hildenbrand
  2020-02-06 12:08   ` Richard Henderson
@ 2020-02-07  0:29   ` Murilo Opsfelder Araújo
  2020-02-10  9:39     ` David Hildenbrand
  1 sibling, 1 reply; 44+ messages in thread
From: Murilo Opsfelder Araújo @ 2020-02-07  0:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

Hello, David.

On Monday, February 3, 2020 3:31:21 PM -03 David Hildenbrand wrote:
> Implement resizable mmaps. For now, the actual resizing is not wired up.
> Introduce qemu_ram_mmap_resizable() and qemu_ram_mmap_resize(). Make
> qemu_ram_mmap() a wrapper of qemu_ram_mmap_resizable().
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/mmap-alloc.h | 21 ++++++++++++-------
>  util/mmap-alloc.c         | 44 ++++++++++++++++++++++++++++-----------
>  2 files changed, 45 insertions(+), 20 deletions(-)
>
> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> index e786266b92..70bc8e9637 100644
> --- a/include/qemu/mmap-alloc.h
> +++ b/include/qemu/mmap-alloc.h
> @@ -7,11 +7,13 @@ size_t qemu_fd_getpagesize(int fd);
>  size_t qemu_mempath_getpagesize(const char *mem_path);
>
>  /**
> - * qemu_ram_mmap: mmap the specified file or device.
> + * qemu_ram_mmap_resizable: reserve a memory region of @max_size to mmap
> the + *                          specified file or device and mmap @size of
> it. *
>   * Parameters:
>   *  @fd: the file or the device to mmap
>   *  @size: the number of bytes to be mmaped
> + *  @max_size: the number of bytes to be reserved
>   *  @align: if not zero, specify the alignment of the starting mapping
> address; *          otherwise, the alignment in use will be determined by
> QEMU. *  @shared: map has RAM_SHARED flag.
> @@ -21,12 +23,15 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
>   *  On success, return a pointer to the mapped area.
>   *  On failure, return MAP_FAILED.
>   */
> -void *qemu_ram_mmap(int fd,
> -                    size_t size,
> -                    size_t align,
> -                    bool shared,
> -                    bool is_pmem);
> -
> -void qemu_ram_munmap(int fd, void *ptr, size_t size);
> +void *qemu_ram_mmap_resizable(int fd, size_t size, size_t max_size,
> +                              size_t align, bool shared, bool is_pmem);
> +void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t
> new_size, +                           bool shared, bool is_pmem);
> +static inline void *qemu_ram_mmap(int fd, size_t size, size_t align,
> +                                  bool shared, bool is_pmem)
> +{
> +    return qemu_ram_mmap_resizable(fd, size, size, align, shared, is_pmem);
> +}
> +void qemu_ram_munmap(int fd, void *ptr, size_t max_size);
>
>  #endif
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 63ad6893b7..2d562145e9 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -172,11 +172,8 @@ static inline size_t mmap_pagesize(int fd)
>  #endif
>  }
>
> -void *qemu_ram_mmap(int fd,
> -                    size_t size,
> -                    size_t align,
> -                    bool shared,
> -                    bool is_pmem)
> +void *qemu_ram_mmap_resizable(int fd, size_t size, size_t max_size,
> +                              size_t align, bool shared, bool is_pmem)
>  {
>      const size_t pagesize = mmap_pagesize(fd);
>      size_t offset, total;
> @@ -184,12 +181,14 @@ void *qemu_ram_mmap(int fd,
>
>      /* we can only map whole pages */
>      size = QEMU_ALIGN_UP(size, pagesize);
> +    max_size = QEMU_ALIGN_UP(max_size, pagesize);
>
>      /*
>       * Note: this always allocates at least one extra page of virtual
> address -     * space, even if size is already aligned.
> +     * space, even if the size is already aligned. We will reserve an area
> of +     * at least max_size, but only populate the requested part of it.
> */
> -    total = size + align;
> +    total = max_size + align;
>
>      guardptr = mmap_reserve(0, total, fd);
>      if (guardptr == MAP_FAILED) {
> @@ -217,22 +216,43 @@ void *qemu_ram_mmap(int fd,
>       * a guard page guarding against potential buffer overflows.
>       */
>      total -= offset;
> -    if (total > size + pagesize) {
> -        munmap(ptr + size + pagesize, total - size - pagesize);
> +    if (total > max_size + pagesize) {
> +        munmap(ptr + max_size + pagesize, total - max_size - pagesize);
>      }
>
>      return ptr;
>  }
>
> -void qemu_ram_munmap(int fd, void *ptr, size_t size)
> +void *qemu_ram_mmap_resize(void *ptr, int fd, size_t old_size, size_t
> new_size, +                           bool shared, bool is_pmem)
>  {
>      const size_t pagesize = mmap_pagesize(fd);
>
>      /* we can only map whole pages */
> -    size = QEMU_ALIGN_UP(size, pagesize);
> +    old_size = QEMU_ALIGN_UP(old_size, pagesize);
> +    new_size = QEMU_ALIGN_UP(new_size, pagesize);

Shouldn't we just assert old_size and new_size are aligned with
pagesize?

> +
> +    /* we support actually resizable memory regions only on Linux */
> +    if (old_size < new_size) {
> +        /* populate the missing piece into the reserved area */
> +        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd,
> old_size, +                            shared, is_pmem);
> +    } else if (old_size > new_size) {
> +        /* discard this piece, keeping the area reserved (should never
> fail) */ +        ptr = mmap_reserve(ptr + new_size, old_size - new_size,
> fd); +    }

I find the behaviour of this function somewhat confusing.  Perhaps I'm
missing something and need your help to clarify.  Please bear with me.

For the case where we want to grow in size, it returns a populated area
(PROT_READ | PROT_WRITE flags).

And for the case where we want to shrink in size, it returns a reserved
area (PROT_NONE flag), requiring the caller to call mmap_populate()
again to be able to use that memory.

I believe the behaviour should be consistent regardless if we want to
grow or shrink in size.  Either return a reserved or an already
populated area.  Not both.

Would "old_size == new_size" situation be possible?  In this case, ptr
would be returned without changing protection flags of the mapping.

Shouldn't we also assert that new_size <= max_size?

> +    return ptr;
> +}
> +
> +void qemu_ram_munmap(int fd, void *ptr, size_t max_size)
> +{
> +    const size_t pagesize = mmap_pagesize(fd);
> +
> +    /* we can only map whole pages */
> +    max_size = QEMU_ALIGN_UP(max_size, pagesize);

Shouldn't we just assert this and leave the alignment to the caller?

>
>      if (ptr) {
>          /* Unmap both the RAM block and the guard page */
> -        munmap(ptr, size + pagesize);
> +        munmap(ptr, max_size + pagesize);
>      }
>  }

--
Murilo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-06 20:31   ` David Hildenbrand
@ 2020-02-07 15:28     ` Dr. David Alan Gilbert
  2020-02-10  9:47       ` David Hildenbrand
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-02-07 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	qemu-devel, Greg Kurz, Alex Williamson, Murilo Opsfelder Araujo,
	Paolo Bonzini, Stefan Weil, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> 
> 
> > Am 06.02.2020 um 21:11 schrieb Dr. David Alan Gilbert <dgilbert@redhat.com>:
> > 
> > * David Hildenbrand (david@redhat.com) wrote:
> >> We already allow resizable ram blocks for anonymous memory, however, they
> >> are not actually resized. All memory is mmaped() R/W, including the memory
> >> exceeding the used_length, up to the max_length.
> >> 
> >> When resizing, effectively only the boundary is moved. Implement actually
> >> resizable anonymous allocations and make use of them in resizable ram
> >> blocks when possible. Memory exceeding the used_length will be
> >> inaccessible. Especially ram block notifiers require care.
> >> 
> >> Having actually resizable anonymous allocations (via mmap-hackery) allows
> >> to reserve a big region in virtual address space and grow the
> >> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> >> is set to "never" under Linux, huge reservations will succeed. If there is
> >> not enough memory when resizing (to populate parts of the reserved region),
> >> trying to resize will fail. Only the actually used size is reserved in the
> >> OS.
> >> 
> >> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> >> grow the usable part on demand. I think this change is worth sending out
> >> individually. Accompanied by a bunch of minor fixes and cleanups.
> >> 
> >> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> > 
> > There's a few bits I've not understood from skimming the patches:
> > 
> 
> Thanks for having a look!
> 
> >  a) Am I correct in thinking you PROT_NONE the extra space so you can
> > gkrow/shrink it?
> 
> Yes!
> 
> >  b) What does kvm see - does it have a slot for the whole space or for
> > only the used space?
> 
> Only the used space. Resizing triggers a resize of the memory region. That triggers memory notifiers, which remove the old kvm memslot and re-add the new kvm memslot. (That‘s existing handling, so nothing new).
> 
> So KVM will not see PROT_NONE when creating a slot.

OK, that's easy then.

> >     I ask because we found with virtiofs/DAX experiments that on Power,
> > kvm gets upset if you give it a mapping with PROT_NONE.
> >     (That maybe less of an issue if you change the mapping after the
> > slot is created).
> 
> That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.

Hmm 'when' during an incoming migration; I ask because of userfaultfd
setup for postcopy.  Also note those things can combine - i.e. a reboot
that happens during a migration (we've already got a pile of related
bugs).

> > 
> >  c) It's interesting this is keyed off the RAMBlock notifiers - do
> >     memory_listener's on the address space the block is mapped into get
> >    triggered?  I'm wondering how vhost (and vhost-user) in particular
> >    see this.
> 
> Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.
> 
> The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.
> 
> E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.

I guess for a memory listener it just sees a new layout after the commit
and then can figure out what changed.

Dave

> Cheers!
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 09/13] util/mmap-alloc: Implement resizable mmaps
  2020-02-07  0:29   ` Murilo Opsfelder Araújo
@ 2020-02-10  9:39     ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-10  9:39 UTC (permalink / raw)
  To: Murilo Opsfelder Araújo
  Cc: Eduardo Habkost, Michael S . Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Greg Kurz, Paolo Bonzini,
	Richard Henderson

>>      /* we can only map whole pages */
>> -    size = QEMU_ALIGN_UP(size, pagesize);
>> +    old_size = QEMU_ALIGN_UP(old_size, pagesize);
>> +    new_size = QEMU_ALIGN_UP(new_size, pagesize);
> 
> Shouldn't we just assert old_size and new_size are aligned with
> pagesize?
> 

Already reworked :)

>> +
>> +    /* we support actually resizable memory regions only on Linux */
>> +    if (old_size < new_size) {
>> +        /* populate the missing piece into the reserved area */
>> +        ptr = mmap_populate(ptr + old_size, new_size - old_size, fd,
>> old_size, +                            shared, is_pmem);
>> +    } else if (old_size > new_size) {
>> +        /* discard this piece, keeping the area reserved (should never
>> fail) */ +        ptr = mmap_reserve(ptr + new_size, old_size - new_size,
>> fd); +    }
> 
> I find the behaviour of this function somewhat confusing.  Perhaps I'm
> missing something and need your help to clarify.  Please bear with me.
> 
> For the case where we want to grow in size, it returns a populated area
> (PROT_READ | PROT_WRITE flags).
> 
> And for the case where we want to shrink in size, it returns a reserved
> area (PROT_NONE flag), requiring the caller to call mmap_populate()
> again to be able to use that memory.
> 
> I believe the behaviour should be consistent regardless if we want to
> grow or shrink in size.  Either return a reserved or an already
> populated area.  Not both.

The return value is only used to differentiate between success (!=
MAP_FAILED) and failure (== MAP_FAILED), nothing else.

For now, I switched to returning a bool instead (resized vs. not
resized), that avoids this inconsistency. See my reply to Richard's comment.

> 
> Would "old_size == new_size" situation be possible?  In this case, ptr
> would be returned without changing protection flags of the mapping.

It could, although in this patch set, it never would :) The result would
be a NOP, which is the right thing to do.

> 
> Shouldn't we also assert that new_size <= max_size?

Then we would have to pass max_size, just for asserting. I'll leave that
to the callers for now.

> 
>> +    return ptr;
>> +}
>> +
>> +void qemu_ram_munmap(int fd, void *ptr, size_t max_size)
>> +{
>> +    const size_t pagesize = mmap_pagesize(fd);
>> +
>> +    /* we can only map whole pages */
>> +    max_size = QEMU_ALIGN_UP(max_size, pagesize);
> 
> Shouldn't we just assert this and leave the alignment to the caller?

Already reworked.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
  2020-02-07 15:28     ` Dr. David Alan Gilbert
@ 2020-02-10  9:47       ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-10  9:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	qemu-devel, Greg Kurz, Alex Williamson, Murilo Opsfelder Araujo,
	Paolo Bonzini, Stefan Weil, Richard Henderson

>> That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.
> 
> Hmm 'when' during an incoming migration; I ask because of userfaultfd
> setup for postcopy.  Also note those things can combine - i.e. a reboot
> that happens during a migration (we've already got a pile of related
> bugs).

Yes, resizing while migration is in progress (source already sent the rem block sizes,
receiving side in postcopy), or when dumping is active is bad. In general,
resizing while *somebody* thinks the used_length of a ram block won't change is bad.

Incoming migration: migration/ram.c: ram_load_precopy(): RAM_SAVE_FLAG_MEM_SIZE

So in preload, before the guest is running and before postcopy has started.
At that point, performing the resize is fine.

> 
>>>
>>>  c) It's interesting this is keyed off the RAMBlock notifiers - do
>>>     memory_listener's on the address space the block is mapped into get
>>>    triggered?  I'm wondering how vhost (and vhost-user) in particular
>>>    see this.
>>
>> Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.
>>
>> The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.
>>
>> E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.
> 
> I guess for a memory listener it just sees a new layout after the commit
> and then can figure out what changed.

Exactly.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 13/13] exec: Ram blocks with resizable anonymous allocations under POSIX
  2020-02-03 18:31 ` [PATCH v1 13/13] exec: Ram blocks with resizable " David Hildenbrand
@ 2020-02-10 10:12   ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-10 10:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Stefan Weil,
	Dr . David Alan Gilbert, Paolo Bonzini, Richard Henderson

On 03.02.20 19:31, David Hildenbrand wrote:
> We can now make use of resizable anonymous allocations to implement
> actually resizable ram blocks. Resizable anonymous allocations are
> not implemented under WIN32 yet and are not available when using
> alternative allocators. Fall back to the existing handling.
> 
> We also have to fallback to the existing handling in case any ram block
> notifier does not support resizing (esp., AMD SEV, HAX) yet. Remember
> in RAM_RESIZEABLE_ALLOC if we are using resizable anonymous allocations.
> 
> As the mmap()-hackery will invalidate some madvise settings, we have to
> re-apply them after resizing. After resizing, notify the ram block
> notifiers.
> 
> The benefit of actually resizable ram blocks is that e.g., under Linux,
> only the actual size will be reserved (even if
> "/proc/sys/vm/overcommit_memory" is set to "never"). Additional memory will
> be reserved when trying to resize, which allows to have ram blocks that
> start small but can theoretically grow very large.
> 
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Stefan Weil <sw@weilnetz.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  exec.c                    | 68 +++++++++++++++++++++++++++++++++++----
>  hw/core/numa.c            | 10 ++++--
>  include/exec/cpu-common.h |  2 ++
>  include/exec/memory.h     |  8 +++++
>  4 files changed, 79 insertions(+), 9 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index fc65c4f7ca..a59d1efde3 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2053,6 +2053,16 @@ void qemu_ram_unset_migratable(RAMBlock *rb)
>      rb->flags &= ~RAM_MIGRATABLE;
>  }
>  
> +bool qemu_ram_is_resizable(RAMBlock *rb)
> +{
> +    return rb->flags & RAM_RESIZEABLE;
> +}
> +
> +bool qemu_ram_is_resizable_alloc(RAMBlock *rb)
> +{
> +    return rb->flags & RAM_RESIZEABLE_ALLOC;
> +}
> +
>  /* Called with iothread lock held.  */
>  void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
>  {
> @@ -2139,6 +2149,8 @@ static void qemu_ram_apply_settings(void *host, size_t length)
>   */
>  int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
>  {
> +    const uint64_t oldsize = block->used_length;
> +
>      assert(block);
>  
>      newsize = HOST_PAGE_ALIGN(newsize);
> @@ -2147,7 +2159,7 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
>          return 0;
>      }
>  
> -    if (!(block->flags & RAM_RESIZEABLE)) {
> +    if (!qemu_ram_is_resizable(block)) {
>          error_setg_errno(errp, EINVAL,
>                           "Length mismatch: %s: 0x" RAM_ADDR_FMT
>                           " in != 0x" RAM_ADDR_FMT, block->idstr,
> @@ -2163,10 +2175,26 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp)
>          return -EINVAL;
>      }
>  
> +    if (qemu_ram_is_resizable_alloc(block)) {
> +        g_assert(ram_block_notifiers_support_resize());
> +        if (qemu_anon_ram_resize(block->host, block->used_length,
> +                                 newsize, block->flags & RAM_SHARED) == NULL) {
> +            error_setg_errno(errp, -ENOMEM,
> +                             "Could not allocate enough memory.");
> +            return -ENOMEM;
> +        }
> +    }
> +

I'll most probably rework to have separate paths when growing/shrinking,
with a different sequence of steps. (e.g., perform actual shrinking as
last step, so all mappings remain valid, and ignore errors (which are
unlikely either way)).


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized()
  2020-02-03 18:31 ` [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized() David Hildenbrand
@ 2020-02-10 13:41   ` David Hildenbrand
  0 siblings, 0 replies; 44+ messages in thread
From: David Hildenbrand @ 2020-02-10 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	Dr . David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Richard Henderson

On 03.02.20 19:31, David Hildenbrand wrote:
> Let's implement ram_block_resized().
> 
> Note: Resizing is currently only allowed during reboot or when migration
> starts.
> 
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/vfio-helpers.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 71e02e7f35..57d77e9480 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -395,11 +395,24 @@ static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n,
>      }
>  }
>  
> +static void qemu_vfio_ram_block_resized(RAMBlockNotifier *n, void *host,
> +                                        size_t oldsize, size_t newsize)
> +{
> +    QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
> +    if (host) {
> +        trace_qemu_vfio_ram_block_resized(s, host, oldsize, newsize);
> +        /* Note: Not atomic - we need a new ioctl for that. */
> +        qemu_vfio_ram_block_removed(n, host, oldsize);
> +        qemu_vfio_ram_block_added(n, host, newsize);
> +    }
> +}
> +
>  static void qemu_vfio_open_common(QEMUVFIOState *s)
>  {
>      qemu_mutex_init(&s->lock);
>      s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
>      s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
> +    s->ram_notifier.ram_block_resized = qemu_vfio_ram_block_resized;
>      s->low_water_mark = QEMU_VFIO_IOVA_MIN;
>      s->high_water_mark = QEMU_VFIO_IOVA_MAX;
>      ram_block_notifier_add(&s->ram_notifier);
> 

I'll most probably change the handling, to reserve the IOVA for max_size
of the ram block, but only map the usable size. Addresses in the IOVA
won't be reused.

If I am not wrong, hotplugging+unplugging DIMMs a couple of times can
easily eat up the whole IOVA. Same would be true on every resize.

At this point, I really detest ram block notifiers. :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-02-10 13:42 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-03 18:31 [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 01/13] util: vfio-helpers: Factor out and fix processing of existings ram blocks David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 02/13] exec: Factor out setting ram settings (madvise ...) into qemu_ram_apply_settings() David Hildenbrand
2020-02-06 11:42   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 03/13] exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap() David Hildenbrand
2020-02-06 11:43   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 04/13] exec: Drop "shared" parameter from ram_block_add() David Hildenbrand
2020-02-06 11:44   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 05/13] util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize() David Hildenbrand
2020-02-05 19:37   ` Murilo Opsfelder Araújo
2020-02-06 11:46   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 06/13] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() David Hildenbrand
2020-02-05 19:40   ` Murilo Opsfelder Araújo
2020-02-06 11:55   ` Richard Henderson
2020-02-06 13:16     ` David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 07/13] util/mmap-alloc: Factor out populating of memory to mmap_populate() David Hildenbrand
2020-02-05 19:56   ` Murilo Opsfelder Araújo
2020-02-06  9:26     ` David Hildenbrand
2020-02-06 11:59   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 08/13] util/mmap-alloc: Prepare for resizable mmaps David Hildenbrand
2020-02-05 23:00   ` Murilo Opsfelder Araújo
2020-02-06  8:52     ` David Hildenbrand
2020-02-06 12:31       ` Murilo Opsfelder Araújo
2020-02-06 13:16         ` David Hildenbrand
2020-02-06 15:13     ` David Hildenbrand
2020-02-06 12:02   ` Richard Henderson
2020-02-03 18:31 ` [PATCH v1 09/13] util/mmap-alloc: Implement " David Hildenbrand
2020-02-06 12:08   ` Richard Henderson
2020-02-06 13:22     ` David Hildenbrand
2020-02-06 15:27       ` David Hildenbrand
2020-02-07  0:29   ` Murilo Opsfelder Araújo
2020-02-10  9:39     ` David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 10/13] numa: Introduce ram_block_notify_resized() and ram_block_notifiers_support_resize() David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 11/13] util: vfio-helpers: Implement ram_block_resized() David Hildenbrand
2020-02-10 13:41   ` David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 12/13] util: oslib: Resizable anonymous allocations under POSIX David Hildenbrand
2020-02-03 18:31 ` [PATCH v1 13/13] exec: Ram blocks with resizable " David Hildenbrand
2020-02-10 10:12   ` David Hildenbrand
2020-02-06  9:27 ` [PATCH v1 00/13] " Michael S. Tsirkin
2020-02-06  9:45   ` David Hildenbrand
2020-02-06 20:11 ` Dr. David Alan Gilbert
2020-02-06 20:31   ` David Hildenbrand
2020-02-07 15:28     ` Dr. David Alan Gilbert
2020-02-10  9:47       ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.