All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
@ 2016-06-28  9:01 Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
                   ` (16 more replies)
  0 siblings, 17 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

I recently found that Qemu is using several hundred megabytes of RSS memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global release
   pool. The choosen poolsize and the changed algorithm could lead to up to
   192 free coroutines with just a single iothread. Each of the coroutines
   in the pool each having 1MB of stack memory.

2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
   of memory. This lead to higher heap allocations which could not effectively
   be returned to kernel (most likely due to fragmentation).

The following series is what I came up with. Beside the coroutine patches I changed
some allocations to forcibly use mmap. All these allocations are not repeatly made
during runtime so the impact of using mmap should be neglectible.

There are still some big malloced allocations left which cannot be easily changed
(e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap threshold for
malloc since this threshold seems to be in the order of several Megabytes on modern systems.

Peter Lieven (15):
  coroutine-ucontext: mmap stack memory
  coroutine-ucontext: add a switch to monitor maximum stack size
  coroutine-ucontext: reduce stack size to 64kB
  coroutine: add a knob to disable the shared release pool
  util: add a helper to mmap private anonymous memory
  exec: use mmap for subpages
  qapi: use mmap for QmpInputVisitor
  virtio: use mmap for VirtQueue
  loader: use mmap for ROMs
  vmware_svga: use mmap for scratch pad
  qom: use mmap for bigger Objects
  util: add a function to realloc mmapped memory
  exec: use mmap for PhysPageMap->nodes
  vnc-tight: make the encoding palette static
  vnc: use mmap for VncState

 configure                 | 33 ++++++++++++++++++--
 exec.c                    | 11 ++++---
 hw/core/loader.c          | 16 +++++-----
 hw/display/vmware_vga.c   |  3 +-
 hw/virtio/virtio.c        |  5 +--
 include/qemu/mmap-alloc.h |  7 +++++
 include/qom/object.h      |  1 +
 qapi/qmp-input-visitor.c  |  5 +--
 qom/object.c              | 20 ++++++++++--
 ui/vnc-enc-tight.c        | 21 ++++++-------
 ui/vnc.c                  |  5 +--
 ui/vnc.h                  |  1 +
 util/coroutine-ucontext.c | 66 +++++++++++++++++++++++++++++++++++++--
 util/mmap-alloc.c         | 27 ++++++++++++++++
 util/qemu-coroutine.c     | 79 ++++++++++++++++++++++++++---------------------
 15 files changed, 225 insertions(+), 75 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:02   ` Peter Maydell
  2016-06-28 11:04   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 02/15] coroutine-ucontext: add a switch to monitor maximum stack size Peter Lieven
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

coroutine-ucontext currently allocates stack memory from heap as on most systems the
stack size lays below the threshold for mmapping memory. This patch forces mmaping
of stacks to avoid large holes on the heap when a coroutine is deleted. It additionally
allows us for adding a guard page at the bottom of the stack to avoid overflows.

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/coroutine-ucontext.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
index 2bb7e10..841e7db 100644
--- a/util/coroutine-ucontext.c
+++ b/util/coroutine-ucontext.c
@@ -80,9 +80,10 @@ static void coroutine_trampoline(int i0, int i1)
     }
 }
 
+#define COROUTINE_STACK_SIZE (1 << 20)
+
 Coroutine *qemu_coroutine_new(void)
 {
-    const size_t stack_size = 1 << 20;
     CoroutineUContext *co;
     ucontext_t old_uc, uc;
     sigjmp_buf old_env;
@@ -101,17 +102,32 @@ Coroutine *qemu_coroutine_new(void)
     }
 
     co = g_malloc0(sizeof(*co));
+
+#ifdef MAP_GROWSDOWN
+    co->stack = mmap(NULL, COROUTINE_STACK_SIZE, PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
+    if (co->stack == MAP_FAILED) {
+        abort();
+    }
+    /* add a guard page at bottom of the stack */
+    if (mmap(co->stack, getpagesize(), PROT_NONE,
+        MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0) == MAP_FAILED) {
+        abort();
+    }
+#else
     co->stack = g_malloc(stack_size);
+#endif
+
     co->base.entry_arg = &old_env; /* stash away our jmp_buf */
 
     uc.uc_link = &old_uc;
     uc.uc_stack.ss_sp = co->stack;
-    uc.uc_stack.ss_size = stack_size;
+    uc.uc_stack.ss_size = COROUTINE_STACK_SIZE;
     uc.uc_stack.ss_flags = 0;
 
 #ifdef CONFIG_VALGRIND_H
     co->valgrind_stack_id =
-        VALGRIND_STACK_REGISTER(co->stack, co->stack + stack_size);
+        VALGRIND_STACK_REGISTER(co->stack, co->stack + COROUTINE_STACK_SIZE);
 #endif
 
     arg.p = co;
@@ -149,7 +165,11 @@ void qemu_coroutine_delete(Coroutine *co_)
     valgrind_stack_deregister(co);
 #endif
 
+#ifdef MAP_GROWSDOWN
+    munmap(co->stack, COROUTINE_STACK_SIZE);
+#else
     g_free(co->stack);
+#endif
     g_free(co);
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 02/15] coroutine-ucontext: add a switch to monitor maximum stack size
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB Peter Lieven
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

this adds a debug configure switch to enable monitoring of the
maximum used stack size by all coroutines.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 configure                 | 18 ++++++++++++++++++
 util/coroutine-ucontext.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/configure b/configure
index 5929aba..82bcc25 100755
--- a/configure
+++ b/configure
@@ -296,6 +296,7 @@ libiscsi=""
 libnfs=""
 coroutine=""
 coroutine_pool=""
+coroutine_stack_size_debug="no"
 seccomp=""
 glusterfs=""
 glusterfs_xlator_opt="no"
@@ -1004,6 +1005,8 @@ for opt do
   ;;
   --enable-coroutine-pool) coroutine_pool="yes"
   ;;
+  --enable-coroutine-stack-size-debug) coroutine_stack_size_debug="yes"
+  ;;
   --disable-docs) docs="no"
   ;;
   --enable-docs) docs="yes"
@@ -1361,6 +1364,8 @@ disabled with --disable-FEATURE, default is enabled if available:
                   (for reading bzip2-compressed dmg images)
   seccomp         seccomp support
   coroutine-pool  coroutine freelist (better performance)
+  coroutine-stack-size-debug
+                  report coroutine max stack usage (only for debugging)
   glusterfs       GlusterFS backend
   archipelago     Archipelago backend
   tpm             TPM support
@@ -4298,6 +4303,15 @@ fi
 if test "$coroutine" = "gthread" -a "$coroutine_pool" = "yes"; then
   error_exit "'gthread' coroutine backend does not support pool (use --disable-coroutine-pool)"
 fi
+if test "$coroutine_stack_size_debug" = "yes"; then
+  if test "$coroutine" != "ucontext"; then
+    error_exit "coroutine stack size debugging currently only works with ucontext"
+  fi
+  if test "$coroutine_pool" = "yes"; then
+    echo "WARN: disabling coroutine pool for stack size debugging"
+    coroutine_pool=no
+  fi
+fi
 
 ##########################################
 # check if we have open_by_handle_at
@@ -4866,6 +4880,7 @@ echo "QGA MSI support   $guest_agent_msi"
 echo "seccomp support   $seccomp"
 echo "coroutine backend $coroutine"
 echo "coroutine pool    $coroutine_pool"
+echo "coroutine stack size debug $coroutine_stack_size_debug"
 echo "GlusterFS support $glusterfs"
 echo "Archipelago support $archipelago"
 echo "gcov              $gcov_tool"
@@ -5335,6 +5350,9 @@ if test "$coroutine_pool" = "yes" ; then
 else
   echo "CONFIG_COROUTINE_POOL=0" >> $config_host_mak
 fi
+if test "$coroutine_stack_size_debug" = "yes" ; then
+  echo "CONFIG_COROUTINE_STACK_SIZE_DEBUG=y" >> $config_host_mak
+fi
 
 if test "$open_by_handle_at" = "yes" ; then
   echo "CONFIG_OPEN_BY_HANDLE=y" >> $config_host_mak
diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
index 841e7db..27c61f3 100644
--- a/util/coroutine-ucontext.c
+++ b/util/coroutine-ucontext.c
@@ -31,6 +31,10 @@
 #include <valgrind/valgrind.h>
 #endif
 
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+#include "qemu/error-report.h"
+#endif
+
 typedef struct {
     Coroutine base;
     void *stack;
@@ -48,6 +52,10 @@ typedef struct {
 static __thread CoroutineUContext leader;
 static __thread Coroutine *current;
 
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+static uint32_t max_stack_usage;
+#endif
+
 /*
  * va_args to makecontext() must be type 'int', so passing
  * the pointer we need may require several int args. This
@@ -88,6 +96,9 @@ Coroutine *qemu_coroutine_new(void)
     ucontext_t old_uc, uc;
     sigjmp_buf old_env;
     union cc_arg arg = {0};
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+    void *ptr;
+#endif
 
     /* The ucontext functions preserve signal masks which incurs a
      * system call overhead.  sigsetjmp(buf, 0)/siglongjmp() does not
@@ -118,6 +129,13 @@ Coroutine *qemu_coroutine_new(void)
     co->stack = g_malloc(stack_size);
 #endif
 
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+    for (ptr = co->stack + getpagesize();
+         ptr < co->stack + COROUTINE_STACK_SIZE; ptr += sizeof(u_int32_t)) {
+        *(u_int32_t *)ptr = 0xdeadbeaf;
+    }
+#endif
+
     co->base.entry_arg = &old_env; /* stash away our jmp_buf */
 
     uc.uc_link = &old_uc;
@@ -161,6 +179,20 @@ void qemu_coroutine_delete(Coroutine *co_)
 {
     CoroutineUContext *co = DO_UPCAST(CoroutineUContext, base, co_);
 
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+    void *ptr;
+    for (ptr = co->stack + getpagesize();
+         ptr < co->stack + COROUTINE_STACK_SIZE; ptr += sizeof(u_int32_t)) {
+        if (*(u_int32_t *)ptr != 0xdeadbeaf) {
+            break;
+        }
+    }
+    /* we only want to estimate the max stack usage, the OR will overestimate
+     * the stack usage, but this is ok here and avoids the usage of a mutex */
+    atomic_or(&max_stack_usage,
+              COROUTINE_STACK_SIZE - (uintptr_t) (ptr - co->stack));
+#endif
+
 #ifdef CONFIG_VALGRIND_H
     valgrind_stack_deregister(co);
 #endif
@@ -210,3 +242,11 @@ bool qemu_in_coroutine(void)
 {
     return current && current->caller;
 }
+
+#ifdef CONFIG_COROUTINE_STACK_SIZE_DEBUG
+static void __attribute__((destructor)) print_max_stack_usage(void)
+{
+   error_report("coroutine-ucontext: max stack usage was less or equal to "
+                "%"PRIu32" bytes.", max_stack_usage);
+}
+#endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 02/15] coroutine-ucontext: add a switch to monitor maximum stack size Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:54   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool Peter Lieven
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

evaluation with the recently introduced maximum stack size monitoring revealed
that the actual used stack size was never above 4kB so allocating 1MB stack
for each coroutine is a lot of wasted memory. So reduce the stack size to
64kB which should still give enough head room.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/coroutine-ucontext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
index 27c61f3..7f1d541 100644
--- a/util/coroutine-ucontext.c
+++ b/util/coroutine-ucontext.c
@@ -88,7 +88,7 @@ static void coroutine_trampoline(int i0, int i1)
     }
 }
 
-#define COROUTINE_STACK_SIZE (1 << 20)
+#define COROUTINE_STACK_SIZE (1 << 16)
 
 Coroutine *qemu_coroutine_new(void)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (2 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:41   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory Peter Lieven
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

the current coroutine freelist implementation has 2 kinds of pools.
One shared release pool between all threads and additionally one
allocation pool per thread. The release pool is especially necessary
if the coroutine is created in a different thread than it is released.
This is e.g. the case if an IDE interface is used.

But in times of virtio and dataplane the the release pool adds costs
which are not entirely necessary. At first if virtio is used the release
pool tends to fill up to 100% because all coroutines are first handed
back to the release pool. On coroutine create a thread can steal this
release pool and make it its local allocation pool, but during mixed
I/O pattern at the end the release pool is full of useless coroutines
and the alloc_pool has also filled to maximum size.

So this patch introduces a knob to disable the release pool to avoid
this behaviour. If this switch is used it should be made sure that
all fast block devices use virtio and each virtio device
has its own thread (dataplane).

An IDE cdrom might still be used, but coroutine creation will be slow,
but a CDROM is considred slow anyway.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 configure             | 15 ++++++++--
 util/qemu-coroutine.c | 79 ++++++++++++++++++++++++++++-----------------------
 2 files changed, 56 insertions(+), 38 deletions(-)

diff --git a/configure b/configure
index 82bcc25..fb29034 100755
--- a/configure
+++ b/configure
@@ -296,6 +296,7 @@ libiscsi=""
 libnfs=""
 coroutine=""
 coroutine_pool=""
+coroutine_release_pool="yes"
 coroutine_stack_size_debug="no"
 seccomp=""
 glusterfs=""
@@ -1001,10 +1002,14 @@ for opt do
   ;;
   --with-coroutine=*) coroutine="$optarg"
   ;;
-  --disable-coroutine-pool) coroutine_pool="no"
+  --disable-coroutine-pool)
+      coroutine_pool="no"
+      coroutine_release_pool="no"
   ;;
   --enable-coroutine-pool) coroutine_pool="yes"
   ;;
+  --disable-coroutine-release-pool) coroutine_release_pool="no"
+  ;;
   --enable-coroutine-stack-size-debug) coroutine_stack_size_debug="yes"
   ;;
   --disable-docs) docs="no"
@@ -1364,6 +1369,7 @@ disabled with --disable-FEATURE, default is enabled if available:
                   (for reading bzip2-compressed dmg images)
   seccomp         seccomp support
   coroutine-pool  coroutine freelist (better performance)
+  coroutine-release-pool  coroutine freelist is shared between threads
   coroutine-stack-size-debug
                   report coroutine max stack usage (only for debugging)
   glusterfs       GlusterFS backend
@@ -4310,6 +4316,7 @@ if test "$coroutine_stack_size_debug" = "yes"; then
   if test "$coroutine_pool" = "yes"; then
     echo "WARN: disabling coroutine pool for stack size debugging"
     coroutine_pool=no
+    coroutine_release_pool=no
   fi
 fi
 
@@ -4880,6 +4887,7 @@ echo "QGA MSI support   $guest_agent_msi"
 echo "seccomp support   $seccomp"
 echo "coroutine backend $coroutine"
 echo "coroutine pool    $coroutine_pool"
+echo "coroutine release pool    $coroutine_release_pool"
 echo "coroutine stack size debug $coroutine_stack_size_debug"
 echo "GlusterFS support $glusterfs"
 echo "Archipelago support $archipelago"
@@ -5347,12 +5355,13 @@ fi
 echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
 if test "$coroutine_pool" = "yes" ; then
   echo "CONFIG_COROUTINE_POOL=1" >> $config_host_mak
-else
-  echo "CONFIG_COROUTINE_POOL=0" >> $config_host_mak
 fi
 if test "$coroutine_stack_size_debug" = "yes" ; then
   echo "CONFIG_COROUTINE_STACK_SIZE_DEBUG=y" >> $config_host_mak
 fi
+if test "$coroutine_release_pool" = "yes"; then
+  echo "CONFIG_COROUTINE_RELEASE_POOL=y" >> $config_host_mak
+fi
 
 if test "$open_by_handle_at" = "yes" ; then
   echo "CONFIG_OPEN_BY_HANDLE=y" >> $config_host_mak
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..7dda0ca 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -20,13 +20,12 @@
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
 
+#ifdef CONFIG_COROUTINE_POOL
+/* per thread free list to speed up creation */
 enum {
     POOL_BATCH_SIZE = 64,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -41,35 +40,43 @@ static void coroutine_pool_cleanup(Notifier *n, void *value)
         qemu_coroutine_delete(co);
     }
 }
+#endif
+#ifdef CONFIG_COROUTINE_RELEASE_POOL
+/* add an additional shared release pool */
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
+static unsigned int release_pool_size;
+#endif
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
 
-    if (CONFIG_COROUTINE_POOL) {
-        co = QSLIST_FIRST(&alloc_pool);
-        if (!co) {
-            if (release_pool_size > POOL_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                if (!coroutine_pool_cleanup_notifier.notify) {
-                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
-                }
-
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
-                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
-                co = QSLIST_FIRST(&alloc_pool);
-            }
+#ifdef CONFIG_COROUTINE_POOL
+    co = QSLIST_FIRST(&alloc_pool);
+    if (!co) {
+        /* Slow path; a good place to register the destructor, too.  */
+        if (!coroutine_pool_cleanup_notifier.notify) {
+            coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+            qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
         }
-        if (co) {
-            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
-            alloc_pool_size--;
+#ifdef CONFIG_COROUTINE_RELEASE_POOL
+        if (release_pool_size > POOL_BATCH_SIZE) {
+
+            /* This is not exact; there could be a little skew between
+             * release_pool_size and the actual size of release_pool.  But
+             * it is just a heuristic, it does not need to be perfect.
+             */
+            alloc_pool_size = atomic_xchg(&release_pool_size, 0);
+            QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+            co = QSLIST_FIRST(&alloc_pool);
         }
+#endif
+    }
+    if (co) {
+        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+        alloc_pool_size--;
     }
+#endif
 
     if (!co) {
         co = qemu_coroutine_new();
@@ -84,18 +91,20 @@ static void coroutine_delete(Coroutine *co)
 {
     co->caller = NULL;
 
-    if (CONFIG_COROUTINE_POOL) {
-        if (release_pool_size < POOL_BATCH_SIZE * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&release_pool_size);
-            return;
-        }
-        if (alloc_pool_size < POOL_BATCH_SIZE) {
-            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
-            alloc_pool_size++;
-            return;
-        }
+#ifdef CONFIG_COROUTINE_RELEASE_POOL
+    if (release_pool_size < POOL_BATCH_SIZE * 2) {
+        QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
+        atomic_inc(&release_pool_size);
+        return;
+    }
+#endif
+#ifdef CONFIG_COROUTINE_POOL
+    if (alloc_pool_size < POOL_BATCH_SIZE) {
+        QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+        alloc_pool_size++;
+        return;
     }
+#endif
 
     qemu_coroutine_delete(co);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (3 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-10-16  2:10   ` Michael S. Tsirkin
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages Peter Lieven
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu/mmap-alloc.h |  6 ++++++
 util/mmap-alloc.c         | 17 +++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 0899b2f..a457721 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -9,4 +9,10 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
 
 void qemu_ram_munmap(void *ptr, size_t size);
 
+/* qemu_anon_ram_mmap maps private anonymous memory using mmap and
+ * aborts if the allocation fails. its meant to act as an replacement
+ * for g_malloc0 and friends. */
+void *qemu_anon_ram_mmap(size_t size);
+void qemu_anon_ram_munmap(void *ptr, size_t size);
+
 #endif
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 629d97a..c099858 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -107,3 +107,20 @@ void qemu_ram_munmap(void *ptr, size_t size)
         munmap(ptr, size + getpagesize());
     }
 }
+
+void *qemu_anon_ram_mmap(size_t size)
+{
+    void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (ptr == MAP_FAILED) {
+        abort();
+    }
+    return ptr;
+}
+
+void qemu_anon_ram_munmap(void *ptr, size_t size)
+{
+    if (ptr) {
+        munmap(ptr, size);
+    }
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (4 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:48   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

a lot of subpages are created and freed at startup, but RCU delays
the freeing so the heap gets fragmented.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 exec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 0122ef7..1b7be2a 100644
--- a/exec.c
+++ b/exec.c
@@ -49,6 +49,7 @@
 #include "exec/cpu-all.h"
 #include "qemu/rcu_queue.h"
 #include "qemu/main-loop.h"
+#include "qemu/mmap-alloc.h"
 #include "translate-all.h"
 #include "sysemu/replay.h"
 
@@ -1150,7 +1151,7 @@ static void phys_section_destroy(MemoryRegion *mr)
     if (have_sub_page) {
         subpage_t *subpage = container_of(mr, subpage_t, iomem);
         object_unref(OBJECT(&subpage->iomem));
-        g_free(subpage);
+        qemu_anon_ram_munmap(subpage, sizeof(subpage_t));
     }
 }
 
@@ -2270,7 +2271,7 @@ static subpage_t *subpage_init(AddressSpace *as, hwaddr base)
 {
     subpage_t *mmio;
 
-    mmio = g_malloc0(sizeof(subpage_t));
+    mmio = qemu_anon_ram_mmap(sizeof(subpage_t));
 
     mmio->as = as;
     mmio->base = base;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (5 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28  9:29   ` Dr. David Alan Gilbert
                     ` (2 more replies)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 08/15] virtio: use mmap for VirtQueue Peter Lieven
                   ` (9 subsequent siblings)
  16 siblings, 3 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

this struct is approx 75kB

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 qapi/qmp-input-visitor.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/qapi/qmp-input-visitor.c b/qapi/qmp-input-visitor.c
index aea90a1..b6f5dfd 100644
--- a/qapi/qmp-input-visitor.c
+++ b/qapi/qmp-input-visitor.c
@@ -17,6 +17,7 @@
 #include "qapi/qmp-input-visitor.h"
 #include "qapi/visitor-impl.h"
 #include "qemu/queue.h"
+#include "qemu/mmap-alloc.h"
 #include "qemu-common.h"
 #include "qapi/qmp/types.h"
 #include "qapi/qmp/qerror.h"
@@ -378,14 +379,14 @@ Visitor *qmp_input_get_visitor(QmpInputVisitor *v)
 void qmp_input_visitor_cleanup(QmpInputVisitor *v)
 {
     qobject_decref(v->root);
-    g_free(v);
+    qemu_anon_ram_munmap(v, sizeof(*v));
 }
 
 QmpInputVisitor *qmp_input_visitor_new(QObject *obj, bool strict)
 {
     QmpInputVisitor *v;
 
-    v = g_malloc0(sizeof(*v));
+    v = qemu_anon_ram_mmap(sizeof(*v));
 
     v->visitor.type = VISITOR_INPUT;
     v->visitor.start_struct = qmp_input_start_struct;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 08/15] virtio: use mmap for VirtQueue
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (6 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs Peter Lieven
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

a VirtQueue is approx. 128kB in size.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 hw/virtio/virtio.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7ed06ea..bf4bc4a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -20,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "hw/virtio/virtio.h"
 #include "qemu/atomic.h"
+#include "qemu/mmap-alloc.h"
 #include "hw/virtio/virtio-bus.h"
 #include "migration/migration.h"
 #include "hw/virtio/virtio-access.h"
@@ -1612,7 +1613,7 @@ void virtio_cleanup(VirtIODevice *vdev)
 {
     qemu_del_vm_change_state_handler(vdev->vmstate);
     g_free(vdev->config);
-    g_free(vdev->vq);
+    qemu_anon_ram_munmap(vdev->vq, sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
     g_free(vdev->vector_queues);
 }
 
@@ -1666,7 +1667,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
     vdev->isr = 0;
     vdev->queue_sel = 0;
     vdev->config_vector = VIRTIO_NO_VECTOR;
-    vdev->vq = g_malloc0(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
+    vdev->vq = qemu_anon_ram_mmap(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
     vdev->vm_running = runstate_is_running();
     for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
         vdev->vq[i].vector = VIRTIO_NO_VECTOR;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (7 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 08/15] virtio: use mmap for VirtQueue Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:41   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 10/15] vmware_svga: use mmap for scratch pad Peter Lieven
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

a classic use for mmap here.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 hw/core/loader.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 53e0e41..f217edc 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -55,6 +55,7 @@
 #include "exec/address-spaces.h"
 #include "hw/boards.h"
 #include "qemu/cutils.h"
+#include "qemu/mmap-alloc.h"
 
 #include <zlib.h>
 
@@ -837,7 +838,7 @@ int rom_add_file(const char *file, const char *fw_dir,
 {
     MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
     Rom *rom;
-    int rc, fd = -1;
+    int fd = -1;
     char devpath[100];
 
     rom = g_malloc0(sizeof(*rom));
@@ -867,12 +868,9 @@ int rom_add_file(const char *file, const char *fw_dir,
     }
 
     rom->datasize = rom->romsize;
-    rom->data     = g_malloc0(rom->datasize);
-    lseek(fd, 0, SEEK_SET);
-    rc = read(fd, rom->data, rom->datasize);
-    if (rc != rom->datasize) {
-        fprintf(stderr, "rom: file %-20s: read error: rc=%d (expected %zd)\n",
-                rom->name, rc, rom->datasize);
+    rom->data     = mmap(NULL, rom->datasize, PROT_READ, MAP_SHARED, fd, 0);
+    if (rom->data == MAP_FAILED) {
+        fprintf(stderr, "rom: file %-20s: mmap error\n", rom->name);
         goto err;
     }
     close(fd);
@@ -915,7 +913,7 @@ err:
     if (fd != -1)
         close(fd);
 
-    g_free(rom->data);
+    qemu_anon_ram_munmap(rom->data, rom->romsize);
     g_free(rom->path);
     g_free(rom->name);
     if (fw_dir) {
@@ -1013,7 +1011,7 @@ static void rom_reset(void *unused)
         }
         if (rom->isrom) {
             /* rom needs to be written only once */
-            g_free(rom->data);
+            qemu_anon_ram_munmap(rom->data, rom->datasize);
             rom->data = NULL;
         }
         /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 10/15] vmware_svga: use mmap for scratch pad
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (8 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

the scratch pad is 256kB

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 hw/display/vmware_vga.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/display/vmware_vga.c b/hw/display/vmware_vga.c
index e51a05e..9942b2d 100644
--- a/hw/display/vmware_vga.c
+++ b/hw/display/vmware_vga.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/mmap-alloc.h"
 #include "qapi/error.h"
 #include "hw/hw.h"
 #include "hw/loader.h"
@@ -1247,7 +1248,7 @@ static void vmsvga_init(DeviceState *dev, struct vmsvga_state_s *s,
                         MemoryRegion *address_space, MemoryRegion *io)
 {
     s->scratch_size = SVGA_SCRATCH_SIZE;
-    s->scratch = g_malloc(s->scratch_size * 4);
+    s->scratch = qemu_anon_ram_mmap(s->scratch_size * 4);
 
     s->vga.con = graphic_console_init(dev, 0, &vmsvga_ops, s);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (9 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 10/15] vmware_svga: use mmap for scratch pad Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:08   ` Daniel P. Berrange
                     ` (2 more replies)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 12/15] util: add a function to realloc mmapped memory Peter Lieven
                   ` (5 subsequent siblings)
  16 siblings, 3 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qom/object.h |  1 +
 qom/object.c         | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/qom/object.h b/include/qom/object.h
index 2f8ac47..c612f3a 100644
--- a/include/qom/object.h
+++ b/include/qom/object.h
@@ -400,6 +400,7 @@ struct Object
     GHashTable *properties;
     uint32_t ref;
     Object *parent;
+    size_t instance_size;
 };
 
 /**
diff --git a/qom/object.c b/qom/object.c
index 9743ea4..203162b 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -15,6 +15,7 @@
 #include "qom/object.h"
 #include "qom/object_interfaces.h"
 #include "qemu/cutils.h"
+#include "qemu/mmap-alloc.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
 #include "qapi/string-input-visitor.h"
@@ -453,6 +454,12 @@ static void object_deinit(Object *obj, TypeImpl *type)
     }
 }
 
+static void object_munmap(void *opaque)
+{
+    Object *obj = opaque;
+    qemu_anon_ram_munmap(obj, obj->instance_size);
+}
+
 static void object_finalize(void *data)
 {
     Object *obj = data;
@@ -467,16 +474,23 @@ static void object_finalize(void *data)
     }
 }
 
+#define OBJECT_MMAP_THRESH 4096
+
 Object *object_new_with_type(Type type)
 {
     Object *obj;
 
     g_assert(type != NULL);
     type_initialize(type);
-
-    obj = g_malloc(type->instance_size);
+    if (type->instance_size < OBJECT_MMAP_THRESH) {
+        obj = g_malloc(type->instance_size);
+        obj->free = g_free;
+    } else {
+        obj = qemu_anon_ram_mmap(type->instance_size);
+        obj->free = object_munmap;
+    }
+    obj->instance_size = type->instance_size;
     object_initialize_with_type(obj, type->instance_size, type);
-    obj->free = g_free;
 
     return obj;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 12/15] util: add a function to realloc mmapped memory
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (10 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes Peter Lieven
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu/mmap-alloc.h |  1 +
 util/mmap-alloc.c         | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index a457721..935a907 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -14,5 +14,6 @@ void qemu_ram_munmap(void *ptr, size_t size);
  * for g_malloc0 and friends. */
 void *qemu_anon_ram_mmap(size_t size);
 void qemu_anon_ram_munmap(void *ptr, size_t size);
+void *qemu_anon_ram_remap(void *old_ptr, size_t old_size, size_t new_size);
 
 #endif
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index c099858..5cbe1c5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -124,3 +124,13 @@ void qemu_anon_ram_munmap(void *ptr, size_t size)
         munmap(ptr, size);
     }
 }
+
+void *qemu_anon_ram_remap(void *old_ptr, size_t old_size, size_t new_size)
+{
+    void *ptr = qemu_anon_ram_mmap(new_size);
+    if (old_ptr) {
+        memcpy(ptr, old_ptr, old_size);
+        qemu_anon_ram_munmap(old_ptr, old_size);
+    }
+    return ptr;
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (11 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 12/15] util: add a function to realloc mmapped memory Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 10:43   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static Peter Lieven
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

this was causing serious framentation in conjunction with the
subpages since RCU was introduced. The node space was allocated
at approx 32kB then reallocted to approx 75kB and this a few hundred
times at startup. And thanks to RCU the freeing was delayed.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 exec.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 1b7be2a..b4bcf47 100644
--- a/exec.c
+++ b/exec.c
@@ -189,9 +189,11 @@ struct CPUAddressSpace {
 static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
 {
     if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
+        size_t old_size = map->nodes_nb_alloc * sizeof(Node);
         map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
         map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb + nodes);
-        map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
+        map->nodes = qemu_anon_ram_remap(map->nodes, old_size,
+                                         sizeof(Node) * map->nodes_nb_alloc);
     }
 }
 
@@ -1162,7 +1164,7 @@ static void phys_sections_free(PhysPageMap *map)
         phys_section_destroy(section->mr);
     }
     g_free(map->sections);
-    g_free(map->nodes);
+    qemu_anon_ram_munmap(map->nodes, map->nodes_nb_alloc * sizeof(Node));
 }
 
 static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *section)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (12 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 11:12   ` Paolo Bonzini
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 15/15] vnc: use mmap for VncState Peter Lieven
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

for the calculation of number of subcolors of each subrect a new palette was allocated,
memset to zero and then destroyed. Use a static palette for this instead.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 ui/vnc-enc-tight.c | 21 ++++++++++-----------
 ui/vnc.h           |  1 +
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/ui/vnc-enc-tight.c b/ui/vnc-enc-tight.c
index e5cba0e..d3a9cc5 100644
--- a/ui/vnc-enc-tight.c
+++ b/ui/vnc-enc-tight.c
@@ -349,7 +349,7 @@ tight_detect_smooth_image(VncState *vs, int w, int h)
     tight_fill_palette##bpp(VncState *vs, int x, int y,                 \
                             int max, size_t count,                      \
                             uint32_t *bg, uint32_t *fg,                 \
-                            VncPalette **palette) {                     \
+                            VncPalette *palette) {                      \
         uint##bpp##_t *data;                                            \
         uint##bpp##_t c0, c1, ci;                                       \
         int i, n0, n1;                                                  \
@@ -396,23 +396,23 @@ tight_detect_smooth_image(VncState *vs, int w, int h)
             return 0;                                                   \
         }                                                               \
                                                                         \
-        *palette = palette_new(max, bpp);                               \
-        palette_put(*palette, c0);                                      \
-        palette_put(*palette, c1);                                      \
-        palette_put(*palette, ci);                                      \
+        palette_init(palette, max, bpp);                                \
+        palette_put(palette, c0);                                       \
+        palette_put(palette, c1);                                       \
+        palette_put(palette, ci);                                       \
                                                                         \
         for (i++; i < count; i++) {                                     \
             if (data[i] == ci) {                                        \
                 continue;                                               \
             } else {                                                    \
                 ci = data[i];                                           \
-                if (!palette_put(*palette, (uint32_t)ci)) {             \
+                if (!palette_put(palette, (uint32_t)ci)) {              \
                     return 0;                                           \
                 }                                                       \
             }                                                           \
         }                                                               \
                                                                         \
-        return palette_size(*palette);                                  \
+        return palette_size(palette);                                   \
     }
 
 DEFINE_FILL_PALETTE_FUNCTION(8)
@@ -421,7 +421,7 @@ DEFINE_FILL_PALETTE_FUNCTION(32)
 
 static int tight_fill_palette(VncState *vs, int x, int y,
                               size_t count, uint32_t *bg, uint32_t *fg,
-                              VncPalette **palette)
+                              VncPalette *palette)
 {
     int max;
 
@@ -1459,7 +1459,7 @@ static int send_sub_rect_jpeg(VncState *vs, int x, int y, int w, int h,
 
 static int send_sub_rect(VncState *vs, int x, int y, int w, int h)
 {
-    VncPalette *palette = NULL;
+    VncPalette *palette = &vs->tight.palette;
     uint32_t bg = 0, fg = 0;
     int colors;
     int ret = 0;
@@ -1488,7 +1488,7 @@ static int send_sub_rect(VncState *vs, int x, int y, int w, int h)
     }
 #endif
 
-    colors = tight_fill_palette(vs, x, y, w * h, &bg, &fg, &palette);
+    colors = tight_fill_palette(vs, x, y, w * h, &bg, &fg, palette);
 
 #ifdef CONFIG_VNC_JPEG
     if (allow_jpeg && vs->tight.quality != (uint8_t)-1) {
@@ -1501,7 +1501,6 @@ static int send_sub_rect(VncState *vs, int x, int y, int w, int h)
     ret = send_sub_rect_nojpeg(vs, x, y, w, h, bg, fg, colors, palette);
 #endif
 
-    palette_destroy(palette);
     return ret;
 }
 
diff --git a/ui/vnc.h b/ui/vnc.h
index 6568bca..e24e2cc 100644
--- a/ui/vnc.h
+++ b/ui/vnc.h
@@ -201,6 +201,7 @@ typedef struct VncTight {
 #endif
     int levels[4];
     z_stream stream[4];
+    VncPalette palette;
 } VncTight;
 
 typedef struct VncHextile {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [Qemu-devel] [PATCH 15/15] vnc: use mmap for VncState
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (13 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static Peter Lieven
@ 2016-06-28  9:01 ` Peter Lieven
  2016-06-28 11:37 ` [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Paolo Bonzini
  2016-10-12 21:18 ` Michael R. Hines
  16 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, mreitz, pbonzini, mst, dgilbert, peter.maydell, kraxel,
	Peter Lieven

the VncState is approx. 85kB

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 ui/vnc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/ui/vnc.c b/ui/vnc.c
index 95e4db7..bf87135 100644
--- a/ui/vnc.c
+++ b/ui/vnc.c
@@ -45,6 +45,7 @@
 #include "crypto/tlscredsx509.h"
 #include "qom/object_interfaces.h"
 #include "qemu/cutils.h"
+#include "qemu/mmap-alloc.h"
 
 #define VNC_REFRESH_INTERVAL_BASE GUI_REFRESH_INTERVAL_DEFAULT
 #define VNC_REFRESH_INTERVAL_INC  50
@@ -1234,7 +1235,7 @@ void vnc_disconnect_finish(VncState *vs)
     vs->ioc = NULL;
     object_unref(OBJECT(vs->sioc));
     vs->sioc = NULL;
-    g_free(vs);
+    qemu_anon_ram_munmap(vs, sizeof(VncState));
 }
 
 ssize_t vnc_client_io_error(VncState *vs, ssize_t ret, Error **errp)
@@ -2956,7 +2957,7 @@ static void vnc_refresh(DisplayChangeListener *dcl)
 static void vnc_connect(VncDisplay *vd, QIOChannelSocket *sioc,
                         bool skipauth, bool websocket)
 {
-    VncState *vs = g_new0(VncState, 1);
+    VncState *vs = qemu_anon_ram_mmap(sizeof(VncState));
     int i;
 
     vs->sioc = sioc;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
@ 2016-06-28  9:29   ` Dr. David Alan Gilbert
  2016-06-28  9:39     ` Peter Lieven
  2016-06-28 11:36   ` Paolo Bonzini
  2016-06-30 14:12   ` Markus Armbruster
  2 siblings, 1 reply; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28  9:29 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, mreitz, pbonzini, mst, peter.maydell, kraxel

* Peter Lieven (pl@kamp.de) wrote:
> this struct is approx 75kB

I wonder why it's so large.

The stack size in QmpInputVisitor; it's got a 1024 element stack
(QIV_STACK_SIZE) and I bet we never use anywhere near that.

But even then that's 1024 * a 3 pointer stack object, 24 bytes - 
I don't see where the rest of that 75kB comes from.
I'm a little wary about turning all these malloc's into mmap's
because we do seem to use things like input visitors for small
things; don't the cost of doing the mmap's add up in time
instead of space?

Dave


> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  qapi/qmp-input-visitor.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/qmp-input-visitor.c b/qapi/qmp-input-visitor.c
> index aea90a1..b6f5dfd 100644
> --- a/qapi/qmp-input-visitor.c
> +++ b/qapi/qmp-input-visitor.c
> @@ -17,6 +17,7 @@
>  #include "qapi/qmp-input-visitor.h"
>  #include "qapi/visitor-impl.h"
>  #include "qemu/queue.h"
> +#include "qemu/mmap-alloc.h"
>  #include "qemu-common.h"
>  #include "qapi/qmp/types.h"
>  #include "qapi/qmp/qerror.h"
> @@ -378,14 +379,14 @@ Visitor *qmp_input_get_visitor(QmpInputVisitor *v)
>  void qmp_input_visitor_cleanup(QmpInputVisitor *v)
>  {
>      qobject_decref(v->root);
> -    g_free(v);
> +    qemu_anon_ram_munmap(v, sizeof(*v));
>  }
>  
>  QmpInputVisitor *qmp_input_visitor_new(QObject *obj, bool strict)
>  {
>      QmpInputVisitor *v;
>  
> -    v = g_malloc0(sizeof(*v));
> +    v = qemu_anon_ram_mmap(sizeof(*v));
>  
>      v->visitor.type = VISITOR_INPUT;
>      v->visitor.start_struct = qmp_input_start_struct;
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:29   ` Dr. David Alan Gilbert
@ 2016-06-28  9:39     ` Peter Lieven
  2016-06-28 10:10       ` Daniel P. Berrange
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28  9:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kwolf, mreitz, pbonzini, mst, peter.maydell, kraxel

Am 28.06.2016 um 11:29 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> this struct is approx 75kB
> I wonder why it's so large.
>
> The stack size in QmpInputVisitor; it's got a 1024 element stack
> (QIV_STACK_SIZE) and I bet we never use anywhere near that.
>
> But even then that's 1024 * a 3 pointer stack object, 24 bytes -
> I don't see where the rest of that 75kB comes from.

Sorry, I had a wrong size in mind. Its 24736 bytes. But thats
still larger than expecetd, right?

> I'm a little wary about turning all these malloc's into mmap's
> because we do seem to use things like input visitors for small
> things; don't the cost of doing the mmap's add up in time
> instead of space?

Sure, we should discuss this. The series should act as a base for
discussion.

None of the things I changed into mmap are continously called
during VM runtime. So it most likely only happens at setup of the
vServer.

It seems that the worst impact hat the PhysPageMap in exec.c

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
@ 2016-06-28 10:02   ` Peter Maydell
  2016-06-28 10:21     ` Peter Lieven
  2016-06-28 11:04   ` Paolo Bonzini
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Maydell @ 2016-06-28 10:02 UTC (permalink / raw)
  To: Peter Lieven
  Cc: QEMU Developers, Kevin Wolf, Max Reitz, Paolo Bonzini,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Gerd Hoffmann

On 28 June 2016 at 10:01, Peter Lieven <pl@kamp.de> wrote:
> coroutine-ucontext currently allocates stack memory from heap as on most systems the
> stack size lays below the threshold for mmapping memory. This patch forces mmaping
> of stacks to avoid large holes on the heap when a coroutine is deleted. It additionally
> allows us for adding a guard page at the bottom of the stack to avoid overflows.
>
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  util/coroutine-ucontext.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
> index 2bb7e10..841e7db 100644
> --- a/util/coroutine-ucontext.c
> +++ b/util/coroutine-ucontext.c
> @@ -80,9 +80,10 @@ static void coroutine_trampoline(int i0, int i1)
>      }
>  }
>
> +#define COROUTINE_STACK_SIZE (1 << 20)
> +
>  Coroutine *qemu_coroutine_new(void)
>  {
> -    const size_t stack_size = 1 << 20;
>      CoroutineUContext *co;
>      ucontext_t old_uc, uc;
>      sigjmp_buf old_env;
> @@ -101,17 +102,32 @@ Coroutine *qemu_coroutine_new(void)
>      }
>
>      co = g_malloc0(sizeof(*co));
> +
> +#ifdef MAP_GROWSDOWN
> +    co->stack = mmap(NULL, COROUTINE_STACK_SIZE, PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
> +    if (co->stack == MAP_FAILED) {
> +        abort();
> +    }
> +    /* add a guard page at bottom of the stack */
> +    if (mmap(co->stack, getpagesize(), PROT_NONE,
> +        MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0) == MAP_FAILED) {
> +        abort();
> +    }
> +#else
>      co->stack = g_malloc(stack_size);

I would just mmap() always; then we get the benefit of the
guard page even if there's no MAP_GROWSDOWN.

Also, does MAP_GROWSDOWN help with the RSS issues? I
noticed that glibc itself doesn't use it for pthread
stacks as far as I can tell, so maybe it's obsolete?
(Ulrich Drepper apparently thought so in 2008:
https://lwn.net/Articles/294001/ )

> +#endif

Can we abstract this out into an alloc/dealloc function, please?

/**
 * qemu_alloc_stack:
 * @sz: size of required stack in bytes
 *
 * Allocate memory that can be used as a stack, for instance for
 * coroutines. If the memory cannot be allocated, this function
 * will abort (like g_malloc()). The allocated stack should be
 * freed with qemu_free_stack().
 *
 * Returns: pointer to (the lowest address of) the stack memory.
 */
void *qemu_alloc_stack(size_t sz);

/**
 * qemu_free_stack:
 * @stack: stack to free
 *
 * Free a stack allocated via qemu_alloc_stack().
 */
void qemu_free_stack(void *stack);

util/coroutine-sigaltstack.c can then use the same function
for stack allocation.

I would put the implementation in util/oslib-posix.c and the
header in include/sysemu/os-posix.h, unless somebody has a
better idea.

> +
>      co->base.entry_arg = &old_env; /* stash away our jmp_buf */
>
>      uc.uc_link = &old_uc;
>      uc.uc_stack.ss_sp = co->stack;
> -    uc.uc_stack.ss_size = stack_size;
> +    uc.uc_stack.ss_size = COROUTINE_STACK_SIZE;

Because of the guard page, your code above isn't actually
allocating this much stack.

>      uc.uc_stack.ss_flags = 0;
>
>  #ifdef CONFIG_VALGRIND_H
>      co->valgrind_stack_id =
> -        VALGRIND_STACK_REGISTER(co->stack, co->stack + stack_size);
> +        VALGRIND_STACK_REGISTER(co->stack, co->stack + COROUTINE_STACK_SIZE);
>  #endif
>
>      arg.p = co;
> @@ -149,7 +165,11 @@ void qemu_coroutine_delete(Coroutine *co_)
>      valgrind_stack_deregister(co);
>  #endif
>
> +#ifdef MAP_GROWSDOWN
> +    munmap(co->stack, COROUTINE_STACK_SIZE);
> +#else
>      g_free(co->stack);
> +#endif
>      g_free(co);
>  }
>
> --
> 1.9.1

thanks
-- PMM

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
@ 2016-06-28 10:08   ` Daniel P. Berrange
  2016-06-28 10:10   ` Peter Maydell
  2016-06-28 10:42   ` Paolo Bonzini
  2 siblings, 0 replies; 78+ messages in thread
From: Daniel P. Berrange @ 2016-06-28 10:08 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel,
	pbonzini

On Tue, Jun 28, 2016 at 11:01:35AM +0200, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qom/object.h |  1 +
>  qom/object.c         | 20 +++++++++++++++++---
>  2 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/include/qom/object.h b/include/qom/object.h
> index 2f8ac47..c612f3a 100644
> --- a/include/qom/object.h
> +++ b/include/qom/object.h
> @@ -400,6 +400,7 @@ struct Object
>      GHashTable *properties;
>      uint32_t ref;
>      Object *parent;
> +    size_t instance_size;

This is not required....

>  };
>  
>  /**
> diff --git a/qom/object.c b/qom/object.c
> index 9743ea4..203162b 100644
> --- a/qom/object.c
> +++ b/qom/object.c
> @@ -15,6 +15,7 @@
>  #include "qom/object.h"
>  #include "qom/object_interfaces.h"
>  #include "qemu/cutils.h"
> +#include "qemu/mmap-alloc.h"
>  #include "qapi/visitor.h"
>  #include "qapi-visit.h"
>  #include "qapi/string-input-visitor.h"
> @@ -453,6 +454,12 @@ static void object_deinit(Object *obj, TypeImpl *type)
>      }
>  }
>  
> +static void object_munmap(void *opaque)
> +{
> +    Object *obj = opaque;
> +    qemu_anon_ram_munmap(obj, obj->instance_size);

...since you have an Object pointer, you can get the corresponding
Type,  obj->class->type, and thus directly access type->instance_size

> +}
> +
>  static void object_finalize(void *data)
>  {
>      Object *obj = data;
> @@ -467,16 +474,23 @@ static void object_finalize(void *data)
>      }
>  }
>  
> +#define OBJECT_MMAP_THRESH 4096
> +
>  Object *object_new_with_type(Type type)
>  {
>      Object *obj;
>  
>      g_assert(type != NULL);
>      type_initialize(type);
> -
> -    obj = g_malloc(type->instance_size);
> +    if (type->instance_size < OBJECT_MMAP_THRESH) {
> +        obj = g_malloc(type->instance_size);
> +        obj->free = g_free;
> +    } else {
> +        obj = qemu_anon_ram_mmap(type->instance_size);
> +        obj->free = object_munmap;
> +    }
> +    obj->instance_size = type->instance_size;
>      object_initialize_with_type(obj, type->instance_size, type);
> -    obj->free = g_free;
>  
>      return obj;


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
  2016-06-28 10:08   ` Daniel P. Berrange
@ 2016-06-28 10:10   ` Peter Maydell
  2016-06-28 10:19     ` Peter Lieven
  2016-06-28 10:42   ` Paolo Bonzini
  2 siblings, 1 reply; 78+ messages in thread
From: Peter Maydell @ 2016-06-28 10:10 UTC (permalink / raw)
  To: Peter Lieven
  Cc: QEMU Developers, Kevin Wolf, Max Reitz, Paolo Bonzini,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Gerd Hoffmann

On 28 June 2016 at 10:01, Peter Lieven <pl@kamp.de> wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qom/object.h |  1 +
>  qom/object.c         | 20 +++++++++++++++++---
>  2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/include/qom/object.h b/include/qom/object.h
> index 2f8ac47..c612f3a 100644
> --- a/include/qom/object.h
> +++ b/include/qom/object.h
> @@ -400,6 +400,7 @@ struct Object
>      GHashTable *properties;
>      uint32_t ref;
>      Object *parent;
> +    size_t instance_size;
>  };
>
>  /**
> diff --git a/qom/object.c b/qom/object.c
> index 9743ea4..203162b 100644
> --- a/qom/object.c
> +++ b/qom/object.c
> @@ -15,6 +15,7 @@
>  #include "qom/object.h"
>  #include "qom/object_interfaces.h"
>  #include "qemu/cutils.h"
> +#include "qemu/mmap-alloc.h"
>  #include "qapi/visitor.h"
>  #include "qapi-visit.h"
>  #include "qapi/string-input-visitor.h"
> @@ -453,6 +454,12 @@ static void object_deinit(Object *obj, TypeImpl *type)
>      }
>  }
>
> +static void object_munmap(void *opaque)
> +{
> +    Object *obj = opaque;
> +    qemu_anon_ram_munmap(obj, obj->instance_size);
> +}
> +
>  static void object_finalize(void *data)
>  {
>      Object *obj = data;
> @@ -467,16 +474,23 @@ static void object_finalize(void *data)
>      }
>  }
>
> +#define OBJECT_MMAP_THRESH 4096
> +
>  Object *object_new_with_type(Type type)
>  {
>      Object *obj;
>
>      g_assert(type != NULL);
>      type_initialize(type);
> -
> -    obj = g_malloc(type->instance_size);
> +    if (type->instance_size < OBJECT_MMAP_THRESH) {
> +        obj = g_malloc(type->instance_size);
> +        obj->free = g_free;
> +    } else {
> +        obj = qemu_anon_ram_mmap(type->instance_size);
> +        obj->free = object_munmap;
> +    }
> +    obj->instance_size = type->instance_size;
>      object_initialize_with_type(obj, type->instance_size, type);
> -    obj->free = g_free;

This one I disagree with. We should trust the platform malloc
implementation (or g_malloc() in this case), or get it fixed
if it is broken somehow.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:39     ` Peter Lieven
@ 2016-06-28 10:10       ` Daniel P. Berrange
  2016-06-28 10:17         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrange @ 2016-06-28 10:10 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Dr. David Alan Gilbert, kwolf, peter.maydell, mst, qemu-devel,
	mreitz, kraxel, pbonzini

On Tue, Jun 28, 2016 at 11:39:03AM +0200, Peter Lieven wrote:
> Am 28.06.2016 um 11:29 schrieb Dr. David Alan Gilbert:
> > * Peter Lieven (pl@kamp.de) wrote:
> > > this struct is approx 75kB
> > I wonder why it's so large.
> > 
> > The stack size in QmpInputVisitor; it's got a 1024 element stack
> > (QIV_STACK_SIZE) and I bet we never use anywhere near that.
> > 
> > But even then that's 1024 * a 3 pointer stack object, 24 bytes -
> > I don't see where the rest of that 75kB comes from.
> 
> Sorry, I had a wrong size in mind. Its 24736 bytes. But thats
> still larger than expecetd, right?
> 
> > I'm a little wary about turning all these malloc's into mmap's
> > because we do seem to use things like input visitors for small
> > things; don't the cost of doing the mmap's add up in time
> > instead of space?
> 
> Sure, we should discuss this. The series should act as a base for
> discussion.
> 
> None of the things I changed into mmap are continously called
> during VM runtime. So it most likely only happens at setup of the
> vServer.

QmpInputVisitor is used to parse all QMP monitor commands, so will
be used continuously throughout life of QEMU, often very frequently.
eg When migration is running many monitor commands per second are
expected

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28 10:10       ` Daniel P. Berrange
@ 2016-06-28 10:17         ` Dr. David Alan Gilbert
  2016-06-28 10:21           ` Daniel P. Berrange
  2016-06-28 14:10           ` Eric Blake
  0 siblings, 2 replies; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28 10:17 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Lieven, kwolf, peter.maydell, mst, qemu-devel, mreitz,
	kraxel, pbonzini

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Tue, Jun 28, 2016 at 11:39:03AM +0200, Peter Lieven wrote:
> > Am 28.06.2016 um 11:29 schrieb Dr. David Alan Gilbert:
> > > * Peter Lieven (pl@kamp.de) wrote:
> > > > this struct is approx 75kB
> > > I wonder why it's so large.
> > > 
> > > The stack size in QmpInputVisitor; it's got a 1024 element stack
> > > (QIV_STACK_SIZE) and I bet we never use anywhere near that.
> > > 
> > > But even then that's 1024 * a 3 pointer stack object, 24 bytes -
> > > I don't see where the rest of that 75kB comes from.
> > 
> > Sorry, I had a wrong size in mind. Its 24736 bytes. But thats
> > still larger than expecetd, right?
> > 
> > > I'm a little wary about turning all these malloc's into mmap's
> > > because we do seem to use things like input visitors for small
> > > things; don't the cost of doing the mmap's add up in time
> > > instead of space?
> > 
> > Sure, we should discuss this. The series should act as a base for
> > discussion.
> > 
> > None of the things I changed into mmap are continously called
> > during VM runtime. So it most likely only happens at setup of the
> > vServer.
> 
> QmpInputVisitor is used to parse all QMP monitor commands, so will
> be used continuously throughout life of QEMU, often very frequently.
> eg When migration is running many monitor commands per second are
> expected

Does the same input visitor get reused by each command?

Dave

> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28 10:10   ` Peter Maydell
@ 2016-06-28 10:19     ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 10:19 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Kevin Wolf, Max Reitz, Paolo Bonzini,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Gerd Hoffmann

Am 28.06.2016 um 12:10 schrieb Peter Maydell:
> On 28 June 2016 at 10:01, Peter Lieven <pl@kamp.de> wrote:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   include/qom/object.h |  1 +
>>   qom/object.c         | 20 +++++++++++++++++---
>>   2 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/qom/object.h b/include/qom/object.h
>> index 2f8ac47..c612f3a 100644
>> --- a/include/qom/object.h
>> +++ b/include/qom/object.h
>> @@ -400,6 +400,7 @@ struct Object
>>       GHashTable *properties;
>>       uint32_t ref;
>>       Object *parent;
>> +    size_t instance_size;
>>   };
>>
>>   /**
>> diff --git a/qom/object.c b/qom/object.c
>> index 9743ea4..203162b 100644
>> --- a/qom/object.c
>> +++ b/qom/object.c
>> @@ -15,6 +15,7 @@
>>   #include "qom/object.h"
>>   #include "qom/object_interfaces.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/mmap-alloc.h"
>>   #include "qapi/visitor.h"
>>   #include "qapi-visit.h"
>>   #include "qapi/string-input-visitor.h"
>> @@ -453,6 +454,12 @@ static void object_deinit(Object *obj, TypeImpl *type)
>>       }
>>   }
>>
>> +static void object_munmap(void *opaque)
>> +{
>> +    Object *obj = opaque;
>> +    qemu_anon_ram_munmap(obj, obj->instance_size);
>> +}
>> +
>>   static void object_finalize(void *data)
>>   {
>>       Object *obj = data;
>> @@ -467,16 +474,23 @@ static void object_finalize(void *data)
>>       }
>>   }
>>
>> +#define OBJECT_MMAP_THRESH 4096
>> +
>>   Object *object_new_with_type(Type type)
>>   {
>>       Object *obj;
>>
>>       g_assert(type != NULL);
>>       type_initialize(type);
>> -
>> -    obj = g_malloc(type->instance_size);
>> +    if (type->instance_size < OBJECT_MMAP_THRESH) {
>> +        obj = g_malloc(type->instance_size);
>> +        obj->free = g_free;
>> +    } else {
>> +        obj = qemu_anon_ram_mmap(type->instance_size);
>> +        obj->free = object_munmap;
>> +    }
>> +    obj->instance_size = type->instance_size;
>>       object_initialize_with_type(obj, type->instance_size, type);
>> -    obj->free = g_free;
> This one I disagree with. We should trust the platform malloc
> implementation (or g_malloc() in this case), or get it fixed
> if it is broken somehow.

The ptmalloc implementation can only return memory back to the
kernel if there is unallocated memory at the end of the heap. Every
hole in between cannot be returned.

But I agree I would appreciate if malloc would handle this.

Some Objects we allocate are very large. E.g. some are 70kB as far as
I remember.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory
  2016-06-28 10:02   ` Peter Maydell
@ 2016-06-28 10:21     ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 10:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Kevin Wolf, Max Reitz, Paolo Bonzini,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Gerd Hoffmann

Am 28.06.2016 um 12:02 schrieb Peter Maydell:
> On 28 June 2016 at 10:01, Peter Lieven <pl@kamp.de> wrote:
>> coroutine-ucontext currently allocates stack memory from heap as on most systems the
>> stack size lays below the threshold for mmapping memory. This patch forces mmaping
>> of stacks to avoid large holes on the heap when a coroutine is deleted. It additionally
>> allows us for adding a guard page at the bottom of the stack to avoid overflows.
>>
>> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   util/coroutine-ucontext.c | 26 +++++++++++++++++++++++---
>>   1 file changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
>> index 2bb7e10..841e7db 100644
>> --- a/util/coroutine-ucontext.c
>> +++ b/util/coroutine-ucontext.c
>> @@ -80,9 +80,10 @@ static void coroutine_trampoline(int i0, int i1)
>>       }
>>   }
>>
>> +#define COROUTINE_STACK_SIZE (1 << 20)
>> +
>>   Coroutine *qemu_coroutine_new(void)
>>   {
>> -    const size_t stack_size = 1 << 20;
>>       CoroutineUContext *co;
>>       ucontext_t old_uc, uc;
>>       sigjmp_buf old_env;
>> @@ -101,17 +102,32 @@ Coroutine *qemu_coroutine_new(void)
>>       }
>>
>>       co = g_malloc0(sizeof(*co));
>> +
>> +#ifdef MAP_GROWSDOWN
>> +    co->stack = mmap(NULL, COROUTINE_STACK_SIZE, PROT_READ | PROT_WRITE,
>> +                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
>> +    if (co->stack == MAP_FAILED) {
>> +        abort();
>> +    }
>> +    /* add a guard page at bottom of the stack */
>> +    if (mmap(co->stack, getpagesize(), PROT_NONE,
>> +        MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0) == MAP_FAILED) {
>> +        abort();
>> +    }
>> +#else
>>       co->stack = g_malloc(stack_size);
> I would just mmap() always; then we get the benefit of the
> guard page even if there's no MAP_GROWSDOWN.
>
> Also, does MAP_GROWSDOWN help with the RSS issues? I
> noticed that glibc itself doesn't use it for pthread
> stacks as far as I can tell, so maybe it's obsolete?
> (Ulrich Drepper apparently thought so in 2008:
> https://lwn.net/Articles/294001/ )

I have seen this thread. The MAP_GROWSDOWN does not help
at all as far as it seems. Only reducing the stack size does.

>
>> +#endif
> Can we abstract this out into an alloc/dealloc function, please?
>
> /**
>   * qemu_alloc_stack:
>   * @sz: size of required stack in bytes
>   *
>   * Allocate memory that can be used as a stack, for instance for
>   * coroutines. If the memory cannot be allocated, this function
>   * will abort (like g_malloc()). The allocated stack should be
>   * freed with qemu_free_stack().
>   *
>   * Returns: pointer to (the lowest address of) the stack memory.
>   */
> void *qemu_alloc_stack(size_t sz);
>
> /**
>   * qemu_free_stack:
>   * @stack: stack to free
>   *
>   * Free a stack allocated via qemu_alloc_stack().
>   */
> void qemu_free_stack(void *stack);

we need to pass the size also for munmap.

>
> util/coroutine-sigaltstack.c can then use the same function
> for stack allocation.
>
> I would put the implementation in util/oslib-posix.c and the
> header in include/sysemu/os-posix.h, unless somebody has a
> better idea.
>
>> +
>>       co->base.entry_arg = &old_env; /* stash away our jmp_buf */
>>
>>       uc.uc_link = &old_uc;
>>       uc.uc_stack.ss_sp = co->stack;
>> -    uc.uc_stack.ss_size = stack_size;
>> +    uc.uc_stack.ss_size = COROUTINE_STACK_SIZE;
> Because of the guard page, your code above isn't actually
> allocating this much stack.

oh yes, you are right.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28 10:17         ` Dr. David Alan Gilbert
@ 2016-06-28 10:21           ` Daniel P. Berrange
  2016-06-28 14:10           ` Eric Blake
  1 sibling, 0 replies; 78+ messages in thread
From: Daniel P. Berrange @ 2016-06-28 10:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Lieven, kwolf, peter.maydell, mst, qemu-devel, mreitz,
	kraxel, pbonzini

On Tue, Jun 28, 2016 at 11:17:59AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Tue, Jun 28, 2016 at 11:39:03AM +0200, Peter Lieven wrote:
> > > Am 28.06.2016 um 11:29 schrieb Dr. David Alan Gilbert:
> > > > * Peter Lieven (pl@kamp.de) wrote:
> > > > > this struct is approx 75kB
> > > > I wonder why it's so large.
> > > > 
> > > > The stack size in QmpInputVisitor; it's got a 1024 element stack
> > > > (QIV_STACK_SIZE) and I bet we never use anywhere near that.
> > > > 
> > > > But even then that's 1024 * a 3 pointer stack object, 24 bytes -
> > > > I don't see where the rest of that 75kB comes from.
> > > 
> > > Sorry, I had a wrong size in mind. Its 24736 bytes. But thats
> > > still larger than expecetd, right?
> > > 
> > > > I'm a little wary about turning all these malloc's into mmap's
> > > > because we do seem to use things like input visitors for small
> > > > things; don't the cost of doing the mmap's add up in time
> > > > instead of space?
> > > 
> > > Sure, we should discuss this. The series should act as a base for
> > > discussion.
> > > 
> > > None of the things I changed into mmap are continously called
> > > during VM runtime. So it most likely only happens at setup of the
> > > vServer.
> > 
> > QmpInputVisitor is used to parse all QMP monitor commands, so will
> > be used continuously throughout life of QEMU, often very frequently.
> > eg When migration is running many monitor commands per second are
> > expected
> 
> Does the same input visitor get reused by each command?

Opps, actually no i'm wrong, its only used by the 'object_add' command
but that does allocate a new one for each invokation.


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool Peter Lieven
@ 2016-06-28 10:41   ` Paolo Bonzini
  2016-06-28 10:47     ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:41 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> the current coroutine freelist implementation has 2 kinds of pools.
> One shared release pool between all threads and additionally one
> allocation pool per thread. The release pool is especially necessary
> if the coroutine is created in a different thread than it is released.
> This is e.g. the case if an IDE interface is used.
> 
> But in times of virtio and dataplane the the release pool adds costs
> which are not entirely necessary. At first if virtio is used the release
> pool tends to fill up to 100% because all coroutines are first handed
> back to the release pool. On coroutine create a thread can steal this
> release pool and make it its local allocation pool, but during mixed
> I/O pattern at the end the release pool is full of useless coroutines
> and the alloc_pool has also filled to maximum size.

The cost is 2^16 bytes * 2^6 coroutines, that is 4 MB.  I don't think
this is worth an extra knob that no one will use...

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs Peter Lieven
@ 2016-06-28 10:41   ` Paolo Bonzini
  2016-06-28 11:26     ` Peter Lieven
  2016-07-04  7:30     ` Peter Lieven
  0 siblings, 2 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:41 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> a classic use for mmap here.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

They are never freed, why does mmap help?

Paolo

> ---
>  hw/core/loader.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/loader.c b/hw/core/loader.c
> index 53e0e41..f217edc 100644
> --- a/hw/core/loader.c
> +++ b/hw/core/loader.c
> @@ -55,6 +55,7 @@
>  #include "exec/address-spaces.h"
>  #include "hw/boards.h"
>  #include "qemu/cutils.h"
> +#include "qemu/mmap-alloc.h"
>  
>  #include <zlib.h>
>  
> @@ -837,7 +838,7 @@ int rom_add_file(const char *file, const char *fw_dir,
>  {
>      MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
>      Rom *rom;
> -    int rc, fd = -1;
> +    int fd = -1;
>      char devpath[100];
>  
>      rom = g_malloc0(sizeof(*rom));
> @@ -867,12 +868,9 @@ int rom_add_file(const char *file, const char *fw_dir,
>      }
>  
>      rom->datasize = rom->romsize;
> -    rom->data     = g_malloc0(rom->datasize);
> -    lseek(fd, 0, SEEK_SET);
> -    rc = read(fd, rom->data, rom->datasize);
> -    if (rc != rom->datasize) {
> -        fprintf(stderr, "rom: file %-20s: read error: rc=%d (expected %zd)\n",
> -                rom->name, rc, rom->datasize);
> +    rom->data     = mmap(NULL, rom->datasize, PROT_READ, MAP_SHARED, fd, 0);
> +    if (rom->data == MAP_FAILED) {
> +        fprintf(stderr, "rom: file %-20s: mmap error\n", rom->name);
>          goto err;
>      }
>      close(fd);
> @@ -915,7 +913,7 @@ err:
>      if (fd != -1)
>          close(fd);
>  
> -    g_free(rom->data);
> +    qemu_anon_ram_munmap(rom->data, rom->romsize);
>      g_free(rom->path);
>      g_free(rom->name);
>      if (fw_dir) {
> @@ -1013,7 +1011,7 @@ static void rom_reset(void *unused)
>          }
>          if (rom->isrom) {
>              /* rom needs to be written only once */
> -            g_free(rom->data);
> +            qemu_anon_ram_munmap(rom->data, rom->datasize);
>              rom->data = NULL;
>          }
>          /*
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
  2016-06-28 10:08   ` Daniel P. Berrange
  2016-06-28 10:10   ` Peter Maydell
@ 2016-06-28 10:42   ` Paolo Bonzini
  2016-06-28 10:49     ` Peter Lieven
  2 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:42 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qom/object.h |  1 +
>  qom/object.c         | 20 +++++++++++++++++---
>  2 files changed, 18 insertions(+), 3 deletions(-)

No, please---glibc should be fixed instead.

Paolo

> diff --git a/include/qom/object.h b/include/qom/object.h
> index 2f8ac47..c612f3a 100644
> --- a/include/qom/object.h
> +++ b/include/qom/object.h
> @@ -400,6 +400,7 @@ struct Object
>      GHashTable *properties;
>      uint32_t ref;
>      Object *parent;
> +    size_t instance_size;
>  };
>  
>  /**
> diff --git a/qom/object.c b/qom/object.c
> index 9743ea4..203162b 100644
> --- a/qom/object.c
> +++ b/qom/object.c
> @@ -15,6 +15,7 @@
>  #include "qom/object.h"
>  #include "qom/object_interfaces.h"
>  #include "qemu/cutils.h"
> +#include "qemu/mmap-alloc.h"
>  #include "qapi/visitor.h"
>  #include "qapi-visit.h"
>  #include "qapi/string-input-visitor.h"
> @@ -453,6 +454,12 @@ static void object_deinit(Object *obj, TypeImpl *type)
>      }
>  }
>  
> +static void object_munmap(void *opaque)
> +{
> +    Object *obj = opaque;
> +    qemu_anon_ram_munmap(obj, obj->instance_size);
> +}
> +
>  static void object_finalize(void *data)
>  {
>      Object *obj = data;
> @@ -467,16 +474,23 @@ static void object_finalize(void *data)
>      }
>  }
>  
> +#define OBJECT_MMAP_THRESH 4096
> +
>  Object *object_new_with_type(Type type)
>  {
>      Object *obj;
>  
>      g_assert(type != NULL);
>      type_initialize(type);
> -
> -    obj = g_malloc(type->instance_size);
> +    if (type->instance_size < OBJECT_MMAP_THRESH) {
> +        obj = g_malloc(type->instance_size);
> +        obj->free = g_free;
> +    } else {
> +        obj = qemu_anon_ram_mmap(type->instance_size);
> +        obj->free = object_munmap;
> +    }
> +    obj->instance_size = type->instance_size;
>      object_initialize_with_type(obj, type->instance_size, type);
> -    obj->free = g_free;
>  
>      return obj;
>  }
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes Peter Lieven
@ 2016-06-28 10:43   ` Paolo Bonzini
  2016-06-28 10:48     ` Peter Lieven
  2016-07-11  9:31     ` Peter Lieven
  0 siblings, 2 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:43 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> this was causing serious framentation in conjunction with the
> subpages since RCU was introduced. The node space was allocated
> at approx 32kB then reallocted to approx 75kB and this a few hundred
> times at startup. And thanks to RCU the freeing was delayed.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

The size of the node from the previous as->dispatch could be used as a
hint for the new one perhaps, avoiding the reallocation?

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool
  2016-06-28 10:41   ` Paolo Bonzini
@ 2016-06-28 10:47     ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 10:47 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:41 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> the current coroutine freelist implementation has 2 kinds of pools.
>> One shared release pool between all threads and additionally one
>> allocation pool per thread. The release pool is especially necessary
>> if the coroutine is created in a different thread than it is released.
>> This is e.g. the case if an IDE interface is used.
>>
>> But in times of virtio and dataplane the the release pool adds costs
>> which are not entirely necessary. At first if virtio is used the release
>> pool tends to fill up to 100% because all coroutines are first handed
>> back to the release pool. On coroutine create a thread can steal this
>> release pool and make it its local allocation pool, but during mixed
>> I/O pattern at the end the release pool is full of useless coroutines
>> and the alloc_pool has also filled to maximum size.
> The cost is 2^16 bytes * 2^6 coroutines, that is 4 MB.  I don't think
> this is worth an extra knob that no one will use...

You are right, I was having this patch prior to reducing the stack size.
In fact its 2 * 2^6 coroutines. But maybe its worth of thinking making
the release_pool the same max size as the alloc_pool? Otherwise when
the release_pool is stolen it could make the alloc_pool also that big.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages Peter Lieven
@ 2016-06-28 10:48   ` Paolo Bonzini
  0 siblings, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:48 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> a lot of subpages are created and freed at startup, but RCU delays
> the freeing so the heap gets fragmented.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

I agree that subpages are bad for malloc because they are large (>
4KiB).  It is worth doing something special about them.  However, on
32-bit systems mmap-ing them has the same risk of fragmenting the
process address space, as malloc has of fragmenting the brk heap.

Allocation and freeing of subpages always happens under the BQL, so
perhaps a simple freelist is better?  Another interesting (but harder)
possibility could be to build the radix tree lazily.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-06-28 10:43   ` Paolo Bonzini
@ 2016-06-28 10:48     ` Peter Lieven
  2016-07-11  9:31     ` Peter Lieven
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 10:48 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> this was causing serious framentation in conjunction with the
>> subpages since RCU was introduced. The node space was allocated
>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>> times at startup. And thanks to RCU the freeing was delayed.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> The size of the node from the previous as->dispatch could be used as a
> hint for the new one perhaps, avoiding the reallocation?

I will figure that out. But still all the PhyPageMaps are allocated before
they are freed. Or are you fine with using mmap here?

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28 10:42   ` Paolo Bonzini
@ 2016-06-28 10:49     ` Peter Lieven
  2016-06-30 14:15       ` Markus Armbruster
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 10:49 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:42 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   include/qom/object.h |  1 +
>>   qom/object.c         | 20 +++++++++++++++++---
>>   2 files changed, 18 insertions(+), 3 deletions(-)
> No, please---glibc should be fixed instead.

The objects we allocate are sometimes as big as 70kB...

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB Peter Lieven
@ 2016-06-28 10:54   ` Paolo Bonzini
  2016-06-28 10:57     ` Dr. David Alan Gilbert
  2016-06-28 11:13     ` Peter Lieven
  0 siblings, 2 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 10:54 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> evaluation with the recently introduced maximum stack size monitoring revealed
> that the actual used stack size was never above 4kB so allocating 1MB stack
> for each coroutine is a lot of wasted memory. So reduce the stack size to
> 64kB which should still give enough head room.

If we make the stack this much smaller, there is a non-zero chance of
smashing it.  You must add a guard page if you do this (actually more
than one because QEMU will happily have stack frames as big as 16 KB).
The stack counts for RSS but it's not actually allocated memory, so why
does it matter?

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 10:54   ` Paolo Bonzini
@ 2016-06-28 10:57     ` Dr. David Alan Gilbert
  2016-06-28 11:17       ` Peter Lieven
  2016-06-28 11:13     ` Peter Lieven
  1 sibling, 1 reply; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28 10:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Lieven, qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 28/06/2016 11:01, Peter Lieven wrote:
> > evaluation with the recently introduced maximum stack size monitoring revealed
> > that the actual used stack size was never above 4kB so allocating 1MB stack
> > for each coroutine is a lot of wasted memory. So reduce the stack size to
> > 64kB which should still give enough head room.
> 
> If we make the stack this much smaller, there is a non-zero chance of
> smashing it.  You must add a guard page if you do this (actually more
> than one because QEMU will happily have stack frames as big as 16 KB).
> The stack counts for RSS but it's not actually allocated memory, so why
> does it matter?

I think I'd be interested in seeing the /proc/.../smaps before and after this
change to see if anything is visible and if we can see the difference
in rss etc.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
  2016-06-28 10:02   ` Peter Maydell
@ 2016-06-28 11:04   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 11:04 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> +#ifdef MAP_GROWSDOWN
> +    co->stack = mmap(NULL, COROUTINE_STACK_SIZE, PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
> +    if (co->stack == MAP_FAILED) {
> +        abort();
> +    }
> +    /* add a guard page at bottom of the stack */
> +    if (mmap(co->stack, getpagesize(), PROT_NONE,
> +        MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0) == MAP_FAILED) {
> +        abort();
> +    }

Nevermind, you added a guard page!  Good! :)  And actually it looks like
the stack usage has been mostly tamed, at least for the block layer.

On the other hand MAP_GROWSDOWN automatically adds a guard page since
Linux 3.9 (see commit 09884964335e, "mm: do not grow the stack vma just
because of an overrun on preceding vma", 2013-02-27), so as it turns out
you don't even need the guard page!

Paolo

> +#else
>      co->stack = g_malloc(stack_size);
> +#endif

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static Peter Lieven
@ 2016-06-28 11:12   ` Paolo Bonzini
  2016-06-28 11:18     ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 11:12 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> @@ -201,6 +201,7 @@ typedef struct VncTight {
>  #endif
>      int levels[4];
>      z_stream stream[4];
> +    VncPalette palette;
>  } VncTight;

VncTight is copied back and forth in vnc_async_encoding_start and
vnc_async_encoding_end, so this should not be included in VncTight.
Perhaps however if you include a VncPalette* it allows reuse and avoids
fragmentation?  Or perhaps you can use a thread-local static variable?

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 10:54   ` Paolo Bonzini
  2016-06-28 10:57     ` Dr. David Alan Gilbert
@ 2016-06-28 11:13     ` Peter Lieven
  2016-06-28 11:26       ` Paolo Bonzini
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 11:13 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:54 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> evaluation with the recently introduced maximum stack size monitoring revealed
>> that the actual used stack size was never above 4kB so allocating 1MB stack
>> for each coroutine is a lot of wasted memory. So reduce the stack size to
>> 64kB which should still give enough head room.
> If we make the stack this much smaller, there is a non-zero chance of
> smashing it.  You must add a guard page if you do this (actually more
> than one because QEMU will happily have stack frames as big as 16 KB).
> The stack counts for RSS but it's not actually allocated memory, so why
> does it matter?

Is there an easy way to determinate how much of the RSS is actually allocated?
I erroneously it was all allocated....

So as for the stack, the MAP_GROWSDOWN is it really important? Will the kernel
allocate all pages of the stack otherwise if the last page is written?

I am asking because I don't know if MAP_GROWSDOWN is a good idea as Peter
mentioned there were discussions to deprecate it.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 10:57     ` Dr. David Alan Gilbert
@ 2016-06-28 11:17       ` Peter Lieven
  2016-06-28 11:35         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 11:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini
  Cc: qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

Am 28.06.2016 um 12:57 schrieb Dr. David Alan Gilbert:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>
>> On 28/06/2016 11:01, Peter Lieven wrote:
>>> evaluation with the recently introduced maximum stack size monitoring revealed
>>> that the actual used stack size was never above 4kB so allocating 1MB stack
>>> for each coroutine is a lot of wasted memory. So reduce the stack size to
>>> 64kB which should still give enough head room.
>> If we make the stack this much smaller, there is a non-zero chance of
>> smashing it.  You must add a guard page if you do this (actually more
>> than one because QEMU will happily have stack frames as big as 16 KB).
>> The stack counts for RSS but it's not actually allocated memory, so why
>> does it matter?
> I think I'd be interested in seeing the /proc/.../smaps before and after this
> change to see if anything is visible and if we can see the difference
> in rss etc.

Can you advise what in smaps should be especially looked at.

As for RSS I can report hat the long term usage is significantly lower.
I had the strange observation that when the VM is running for some minutes
the RSS suddenly increases to the whole stack size.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static
  2016-06-28 11:12   ` Paolo Bonzini
@ 2016-06-28 11:18     ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 11:18 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 13:12 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> @@ -201,6 +201,7 @@ typedef struct VncTight {
>>   #endif
>>       int levels[4];
>>       z_stream stream[4];
>> +    VncPalette palette;
>>   } VncTight;
> VncTight is copied back and forth in vnc_async_encoding_start and
> vnc_async_encoding_end, so this should not be included in VncTight.
> Perhaps however if you include a VncPalette* it allows reuse and avoids
> fragmentation?  Or perhaps you can use a thread-local static variable?

I will have look. I missed the copying. However, this palette is only
used to count the number of distinct colors. So it should only be
used in the encoding thread.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 11:13     ` Peter Lieven
@ 2016-06-28 11:26       ` Paolo Bonzini
  0 siblings, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 11:26 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, peter maydell, mst, dgilbert, mreitz, kraxel



----- Original Message -----
> From: "Peter Lieven" <pl@kamp.de>
> To: "Paolo Bonzini" <pbonzini@redhat.com>, qemu-devel@nongnu.org
> Cc: kwolf@redhat.com, "peter maydell" <peter.maydell@linaro.org>, mst@redhat.com, dgilbert@redhat.com,
> mreitz@redhat.com, kraxel@redhat.com
> Sent: Tuesday, June 28, 2016 1:13:26 PM
> Subject: Re: [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
> 
> Am 28.06.2016 um 12:54 schrieb Paolo Bonzini:
> >
> > On 28/06/2016 11:01, Peter Lieven wrote:
> >> evaluation with the recently introduced maximum stack size monitoring
> >> revealed
> >> that the actual used stack size was never above 4kB so allocating 1MB
> >> stack
> >> for each coroutine is a lot of wasted memory. So reduce the stack size to
> >> 64kB which should still give enough head room.
> > If we make the stack this much smaller, there is a non-zero chance of
> > smashing it.  You must add a guard page if you do this (actually more
> > than one because QEMU will happily have stack frames as big as 16 KB).
> > The stack counts for RSS but it's not actually allocated memory, so why
> > does it matter?
> 
> Is there an easy way to determinate how much of the RSS is actually
> allocated? I erroneously it was all allocated....
> 
> So as for the stack, the MAP_GROWSDOWN is it really important? Will the
> kernel
> allocate all pages of the stack otherwise if the last page is written?
> 
> I am asking because I don't know if MAP_GROWSDOWN is a good idea as Peter
> mentioned there were discussions to deprecate it.

I don't know, I found those discussions too.  However I've also seen
an interesting patch to ensure a guard page is kept at the bottom of the
VMA.

But thinking more about it, if you use MAP_GROWSDOWN you don't know anymore
where the bottom of the stack and you cannot free it correctly, can you?
Or am I completely misunderstanding the purpose of the flag?

I guess it's better to steer clear of it unless we're ready to look at
kernel code for a while...

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs
  2016-06-28 10:41   ` Paolo Bonzini
@ 2016-06-28 11:26     ` Peter Lieven
  2016-07-04  7:30     ` Peter Lieven
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 11:26 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:41 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> a classic use for mmap here.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> They are never freed, why does mmap help?

Actually it is. Adding some debug output to rom_add_file and rom_reset reveals the following:

rom load: /home/lieven/git/qemu/pc-bios/bios-256k.bin
rom load: /home/lieven/git/qemu/pc-bios/kvmvapic.bin
rom_reset

I think the rom is copied to system memory?

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 11:17       ` Peter Lieven
@ 2016-06-28 11:35         ` Dr. David Alan Gilbert
  2016-06-28 12:09           ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28 11:35 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

* Peter Lieven (pl@kamp.de) wrote:
> Am 28.06.2016 um 12:57 schrieb Dr. David Alan Gilbert:
> > * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > > 
> > > On 28/06/2016 11:01, Peter Lieven wrote:
> > > > evaluation with the recently introduced maximum stack size monitoring revealed
> > > > that the actual used stack size was never above 4kB so allocating 1MB stack
> > > > for each coroutine is a lot of wasted memory. So reduce the stack size to
> > > > 64kB which should still give enough head room.
> > > If we make the stack this much smaller, there is a non-zero chance of
> > > smashing it.  You must add a guard page if you do this (actually more
> > > than one because QEMU will happily have stack frames as big as 16 KB).
> > > The stack counts for RSS but it's not actually allocated memory, so why
> > > does it matter?
> > I think I'd be interested in seeing the /proc/.../smaps before and after this
> > change to see if anything is visible and if we can see the difference
> > in rss etc.
> 
> Can you advise what in smaps should be especially looked at.
> 
> As for RSS I can report hat the long term usage is significantly lower.
> I had the strange observation that when the VM is running for some minutes
> the RSS suddenly increases to the whole stack size.

You can see the Rss of each mapping; if you knew where your stacks were
it would be easy to see if it was the stacks that were Rss and if
there was anything else odd about them.
If you set hte mapping as growsdown then you can see the area that has a 'gd'
in it's VmFlags.

Dave

> 
> Peter
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
  2016-06-28  9:29   ` Dr. David Alan Gilbert
@ 2016-06-28 11:36   ` Paolo Bonzini
  2016-06-28 14:14     ` Eric Blake
  2016-06-30 14:12   ` Markus Armbruster
  2 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 11:36 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> this struct is approx 75kB
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  qapi/qmp-input-visitor.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Can you change the stack to a QSLIST instead?  That's where most of the
waste comes from.

Thanks,

Paolo

> diff --git a/qapi/qmp-input-visitor.c b/qapi/qmp-input-visitor.c
> index aea90a1..b6f5dfd 100644
> --- a/qapi/qmp-input-visitor.c
> +++ b/qapi/qmp-input-visitor.c
> @@ -17,6 +17,7 @@
>  #include "qapi/qmp-input-visitor.h"
>  #include "qapi/visitor-impl.h"
>  #include "qemu/queue.h"
> +#include "qemu/mmap-alloc.h"
>  #include "qemu-common.h"
>  #include "qapi/qmp/types.h"
>  #include "qapi/qmp/qerror.h"
> @@ -378,14 +379,14 @@ Visitor *qmp_input_get_visitor(QmpInputVisitor *v)
>  void qmp_input_visitor_cleanup(QmpInputVisitor *v)
>  {
>      qobject_decref(v->root);
> -    g_free(v);
> +    qemu_anon_ram_munmap(v, sizeof(*v));
>  }
>  
>  QmpInputVisitor *qmp_input_visitor_new(QObject *obj, bool strict)
>  {
>      QmpInputVisitor *v;
>  
> -    v = g_malloc0(sizeof(*v));
> +    v = qemu_anon_ram_mmap(sizeof(*v));
>  
>      v->visitor.type = VISITOR_INPUT;
>      v->visitor.start_struct = qmp_input_start_struct;
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (14 preceding siblings ...)
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 15/15] vnc: use mmap for VncState Peter Lieven
@ 2016-06-28 11:37 ` Paolo Bonzini
  2016-06-28 12:14   ` Peter Lieven
  2016-10-12 21:18 ` Michael R. Hines
  16 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 11:37 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 28/06/2016 11:01, Peter Lieven wrote:
> I recently found that Qemu is using several hundred megabytes of RSS memory
> more than older versions such as Qemu 2.2.0. So I started tracing
> memory allocation and found 2 major reasons for this.
> 
> 1) We changed the qemu coroutine pool to have a per thread and a global release
>    pool. The choosen poolsize and the changed algorithm could lead to up to
>    192 free coroutines with just a single iothread. Each of the coroutines
>    in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.

> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
>    of memory. This lead to higher heap allocations which could not effectively
>    be returned to kernel (most likely due to fragmentation).

I agree that some of the exec.c allocations need some care, but I would
prefer to use a custom free list or lazy allocation instead of mmap.

Changing allocations to use mmap also is not really useful if you do it
for objects that are never freed (as in patches 8-9-10-15 at least, and
probably 11 too which is one of the most contentious).

In other words, the effort tracking down the allocation is really,
really appreciated.  But the patches look like you only had a hammer at
hand, and everything looked like a nail. :)

Paolo

> The following series is what I came up with. Beside the coroutine patches I changed
> some allocations to forcibly use mmap. All these allocations are not repeatly made
> during runtime so the impact of using mmap should be neglectible.
> 
> There are still some big malloced allocations left which cannot be easily changed
> (e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap threshold for
> malloc since this threshold seems to be in the order of several Megabytes on modern systems.
> 
> Peter Lieven (15):
>   coroutine-ucontext: mmap stack memory
>   coroutine-ucontext: add a switch to monitor maximum stack size
>   coroutine-ucontext: reduce stack size to 64kB
>   coroutine: add a knob to disable the shared release pool
>   util: add a helper to mmap private anonymous memory
>   exec: use mmap for subpages
>   qapi: use mmap for QmpInputVisitor
>   virtio: use mmap for VirtQueue
>   loader: use mmap for ROMs
>   vmware_svga: use mmap for scratch pad
>   qom: use mmap for bigger Objects
>   util: add a function to realloc mmapped memory
>   exec: use mmap for PhysPageMap->nodes
>   vnc-tight: make the encoding palette static
>   vnc: use mmap for VncState
> 
>  configure                 | 33 ++++++++++++++++++--
>  exec.c                    | 11 ++++---
>  hw/core/loader.c          | 16 +++++-----
>  hw/display/vmware_vga.c   |  3 +-
>  hw/virtio/virtio.c        |  5 +--
>  include/qemu/mmap-alloc.h |  7 +++++
>  include/qom/object.h      |  1 +
>  qapi/qmp-input-visitor.c  |  5 +--
>  qom/object.c              | 20 ++++++++++--
>  ui/vnc-enc-tight.c        | 21 ++++++-------
>  ui/vnc.c                  |  5 +--
>  ui/vnc.h                  |  1 +
>  util/coroutine-ucontext.c | 66 +++++++++++++++++++++++++++++++++++++--
>  util/mmap-alloc.c         | 27 ++++++++++++++++
>  util/qemu-coroutine.c     | 79 ++++++++++++++++++++++++++---------------------
>  15 files changed, 225 insertions(+), 75 deletions(-)
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 11:35         ` Dr. David Alan Gilbert
@ 2016-06-28 12:09           ` Peter Lieven
  2016-06-28 14:20             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 12:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

Am 28.06.2016 um 13:35 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 28.06.2016 um 12:57 schrieb Dr. David Alan Gilbert:
>>> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>> evaluation with the recently introduced maximum stack size monitoring revealed
>>>>> that the actual used stack size was never above 4kB so allocating 1MB stack
>>>>> for each coroutine is a lot of wasted memory. So reduce the stack size to
>>>>> 64kB which should still give enough head room.
>>>> If we make the stack this much smaller, there is a non-zero chance of
>>>> smashing it.  You must add a guard page if you do this (actually more
>>>> than one because QEMU will happily have stack frames as big as 16 KB).
>>>> The stack counts for RSS but it's not actually allocated memory, so why
>>>> does it matter?
>>> I think I'd be interested in seeing the /proc/.../smaps before and after this
>>> change to see if anything is visible and if we can see the difference
>>> in rss etc.
>> Can you advise what in smaps should be especially looked at.
>>
>> As for RSS I can report hat the long term usage is significantly lower.
>> I had the strange observation that when the VM is running for some minutes
>> the RSS suddenly increases to the whole stack size.
> You can see the Rss of each mapping; if you knew where your stacks were
> it would be easy to see if it was the stacks that were Rss and if
> there was anything else odd about them.
> If you set hte mapping as growsdown then you can see the area that has a 'gd'
> in it's VmFlags.

Would you expect to see each 1MB allocation in smaps or is it possible that
the kernel merges some mappings to bigger ones?

And more importantly if the regions are merged Paolos comment about we
do not need a guard page would not be true because a coroutine stack could
grow into annother coroutines stack. Looking at the commit from Linus it
would also be good to have that guard page not having the gd flag.

Some of the regions above 1024kB have an RSS of exactly 4kB * (Size / 1024kB)
which leads to the assumption that it is a corouine stack where exactly one page
has been allocated.

I am asking because this is what I e.g. see for a Qemu VM with flags "gd":

cat /proc/5031/smaps | grep -B18 gd
7f808aee7000-7f808b9e6000 rw-p 00000000 00:00 0
Size:              11264 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        44 kB
Referenced:           44 kB
Anonymous:            44 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f808bb01000-7f8090000000 rw-p 00000000 00:00 0
Size:              70656 kB
Rss:                 276 kB
Pss:                 276 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:       276 kB
Referenced:          276 kB
Anonymous:           276 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f80940ff000-7f80943fe000 rw-p 00000000 00:00 0
Size:               3072 kB
Rss:                  12 kB
Pss:                  12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Anonymous:            12 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8095700000-7f80957ff000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8097301000-7f8097400000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f80974df000-7f80975de000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
7f809760c000-7f809770b000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8097901000-7f8097a00000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8097b01000-7f8097c00000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8097d01000-7f8097e00000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f8197f01000-7f8198000000 rw-p 00000000 00:00 0
Size:               1024 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7f81b4001000-7f81b4200000 rw-p 00000000 00:00 0
Size:               2048 kB
Rss:                  20 kB
Pss:                  20 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:           20 kB
Anonymous:            20 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac sd
--
7ffd337e2000-7ffd33805000 rw-p 00000000 00:00 0                          [stack]
Size:                144 kB
Rss:                  64 kB
Pss:                  64 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        64 kB
Referenced:           64 kB
Anonymous:            64 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 11:37 ` [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Paolo Bonzini
@ 2016-06-28 12:14   ` Peter Lieven
  2016-06-28 12:29     ` Paolo Bonzini
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 12:14 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> I recently found that Qemu is using several hundred megabytes of RSS memory
>> more than older versions such as Qemu 2.2.0. So I started tracing
>> memory allocation and found 2 major reasons for this.
>>
>> 1) We changed the qemu coroutine pool to have a per thread and a global release
>>     pool. The choosen poolsize and the changed algorithm could lead to up to
>>     192 free coroutines with just a single iothread. Each of the coroutines
>>     in the pool each having 1MB of stack memory.
> But the fix, as you correctly note, is to reduce the stack size.  It
> would be nice to compile block-obj-y with -Wstack-usage=2048 too.

To reveal if there are any big stack allocations in the block layer?

As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.
The question is which way to go? Reduce the stack size and fix the big stack allocations
or keep the stack size at 1MB?

>
>> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
>>     of memory. This lead to higher heap allocations which could not effectively
>>     be returned to kernel (most likely due to fragmentation).
> I agree that some of the exec.c allocations need some care, but I would
> prefer to use a custom free list or lazy allocation instead of mmap.

This would only help if the elements from the free list would be allocated using
mmap? The issue is that RCU delays the freeing so that the number of concurrent
allocations is high and then a bunch is freed at once. If the memory was malloced
it would still have caused trouble.

>
> Changing allocations to use mmap also is not really useful if you do it
> for objects that are never freed (as in patches 8-9-10-15 at least, and
> probably 11 too which is one of the most contentious).

9 actually frees the memory ;-)
15 frees the memory as soon as the vnc client disconnects.

The others I agree. If the objects in Patch 11 are freed needs to be checked.

>
> In other words, the effort tracking down the allocation is really,
> really appreciated.  But the patches look like you only had a hammer at
> hand, and everything looked like a nail. :)

I just have observed that forcing ptmalloc to use mmap for everything
above 4kB significantly reduced the RSS usage.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 12:14   ` Peter Lieven
@ 2016-06-28 12:29     ` Paolo Bonzini
  2016-06-28 12:33       ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 12:29 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, peter maydell, mst, dgilbert, mreitz, kraxel

> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> > On 28/06/2016 11:01, Peter Lieven wrote:
> >> I recently found that Qemu is using several hundred megabytes of RSS
> >> memory
> >> more than older versions such as Qemu 2.2.0. So I started tracing
> >> memory allocation and found 2 major reasons for this.
> >>
> >> 1) We changed the qemu coroutine pool to have a per thread and a global
> >> release
> >>     pool. The choosen poolsize and the changed algorithm could lead to up
> >>     to
> >>     192 free coroutines with just a single iothread. Each of the
> >>     coroutines
> >>     in the pool each having 1MB of stack memory.
> > But the fix, as you correctly note, is to reduce the stack size.  It
> > would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> 
> To reveal if there are any big stack allocations in the block layer?

Yes.  Most should be fixed by now, but a handful are probably still there.
(definitely one in vvfat.c).

> As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.

Does it hit the guard page?

> >> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
> >> freeing
> >>     of memory. This lead to higher heap allocations which could not
> >>     effectively
> >>     be returned to kernel (most likely due to fragmentation).
> > I agree that some of the exec.c allocations need some care, but I would
> > prefer to use a custom free list or lazy allocation instead of mmap.
> 
> This would only help if the elements from the free list would be allocated
> using mmap? The issue is that RCU delays the freeing so that the number of
> concurrent allocations is high and then a bunch is freed at once. If the memory
> was malloced it would still have caused trouble.

The free list should improve reuse and fragmentation.  I'll take a look at
lazy allocation of subpages, too.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 12:29     ` Paolo Bonzini
@ 2016-06-28 12:33       ` Peter Lieven
  2016-06-28 12:56         ` Paolo Bonzini
  2016-06-28 12:56         ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 12:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, kwolf, peter maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
>> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>> I recently found that Qemu is using several hundred megabytes of RSS
>>>> memory
>>>> more than older versions such as Qemu 2.2.0. So I started tracing
>>>> memory allocation and found 2 major reasons for this.
>>>>
>>>> 1) We changed the qemu coroutine pool to have a per thread and a global
>>>> release
>>>>      pool. The choosen poolsize and the changed algorithm could lead to up
>>>>      to
>>>>      192 free coroutines with just a single iothread. Each of the
>>>>      coroutines
>>>>      in the pool each having 1MB of stack memory.
>>> But the fix, as you correctly note, is to reduce the stack size.  It
>>> would be nice to compile block-obj-y with -Wstack-usage=2048 too.
>> To reveal if there are any big stack allocations in the block layer?
> Yes.  Most should be fixed by now, but a handful are probably still there.
> (definitely one in vvfat.c).
>
>> As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.
> Does it hit the guard page?

How would that look like? I get segfaults like this:

segfault at 7f91aa642b78 ip 0000555ab714ef7d sp 00007f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000]

most of the time error 6. Sometimes error 7. segfault is near the sp.


>
>>>> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
>>>> freeing
>>>>      of memory. This lead to higher heap allocations which could not
>>>>      effectively
>>>>      be returned to kernel (most likely due to fragmentation).
>>> I agree that some of the exec.c allocations need some care, but I would
>>> prefer to use a custom free list or lazy allocation instead of mmap.
>> This would only help if the elements from the free list would be allocated
>> using mmap? The issue is that RCU delays the freeing so that the number of
>> concurrent allocations is high and then a bunch is freed at once. If the memory
>> was malloced it would still have caused trouble.
> The free list should improve reuse and fragmentation.  I'll take a look at
> lazy allocation of subpages, too.

Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
the realloc?

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 12:33       ` Peter Lieven
@ 2016-06-28 12:56         ` Paolo Bonzini
  2016-06-28 12:56         ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-06-28 12:56 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, peter maydell, mst, dgilbert, mreitz, kraxel



----- Original Message -----
> From: "Peter Lieven" <pl@kamp.de>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: qemu-devel@nongnu.org, kwolf@redhat.com, "peter maydell" <peter.maydell@linaro.org>, mst@redhat.com,
> dgilbert@redhat.com, mreitz@redhat.com, kraxel@redhat.com
> Sent: Tuesday, June 28, 2016 2:33:02 PM
> Subject: Re: [PATCH 00/15] optimize Qemu RSS usage
> 
> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
> >> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> >>> On 28/06/2016 11:01, Peter Lieven wrote:
> >>>> I recently found that Qemu is using several hundred megabytes of RSS
> >>>> memory
> >>>> more than older versions such as Qemu 2.2.0. So I started tracing
> >>>> memory allocation and found 2 major reasons for this.
> >>>>
> >>>> 1) We changed the qemu coroutine pool to have a per thread and a global
> >>>> release
> >>>>      pool. The choosen poolsize and the changed algorithm could lead to
> >>>>      up
> >>>>      to
> >>>>      192 free coroutines with just a single iothread. Each of the
> >>>>      coroutines
> >>>>      in the pool each having 1MB of stack memory.
> >>> But the fix, as you correctly note, is to reduce the stack size.  It
> >>> would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> >> To reveal if there are any big stack allocations in the block layer?
> > Yes.  Most should be fixed by now, but a handful are probably still there.
> > (definitely one in vvfat.c).
> >
> >> As it seems reducing to 64kB breaks live migration in some (non
> >> reproducible) cases.
> > Does it hit the guard page?
> 
> How would that look like? I get segfaults like this:
> 
> segfault at 7f91aa642b78 ip 0000555ab714ef7d sp 00007f91aa642b50 error 6 in
> qemu-system-x86_64[555ab6f2c000+794000]
> 
> most of the time error 6. Sometimes error 7. segfault is near the sp.

You can use "p ((CoroutineUContext*)current)->stack" from gdb
to check the stack base of the currently running coroutine (do it in the thread
that received the segfault).

You can also check the instruction with that ip and try to get a backtrace.

Paolo


> >>>> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
> >>>> freeing
> >>>>      of memory. This lead to higher heap allocations which could not
> >>>>      effectively
> >>>>      be returned to kernel (most likely due to fragmentation).
> >>> I agree that some of the exec.c allocations need some care, but I would
> >>> prefer to use a custom free list or lazy allocation instead of mmap.
> >> This would only help if the elements from the free list would be allocated
> >> using mmap? The issue is that RCU delays the freeing so that the number of
> >> concurrent allocations is high and then a bunch is freed at once. If the
> >> memory
> >> was malloced it would still have caused trouble.
> > The free list should improve reuse and fragmentation.  I'll take a look at
> > lazy allocation of subpages, too.
> 
> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
> the realloc?

I think that with lazy allocation of subpages the PhysPageMap will be much
smaller, but I need to check.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 12:33       ` Peter Lieven
  2016-06-28 12:56         ` Paolo Bonzini
@ 2016-06-28 12:56         ` Dr. David Alan Gilbert
  2016-06-28 14:43           ` Peter Lieven
  1 sibling, 1 reply; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28 12:56 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter maydell, mst, mreitz, kraxel

* Peter Lieven (pl@kamp.de) wrote:
> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
> > > Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> > > > On 28/06/2016 11:01, Peter Lieven wrote:
> > > > > I recently found that Qemu is using several hundred megabytes of RSS
> > > > > memory
> > > > > more than older versions such as Qemu 2.2.0. So I started tracing
> > > > > memory allocation and found 2 major reasons for this.
> > > > > 
> > > > > 1) We changed the qemu coroutine pool to have a per thread and a global
> > > > > release
> > > > >      pool. The choosen poolsize and the changed algorithm could lead to up
> > > > >      to
> > > > >      192 free coroutines with just a single iothread. Each of the
> > > > >      coroutines
> > > > >      in the pool each having 1MB of stack memory.
> > > > But the fix, as you correctly note, is to reduce the stack size.  It
> > > > would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> > > To reveal if there are any big stack allocations in the block layer?
> > Yes.  Most should be fixed by now, but a handful are probably still there.
> > (definitely one in vvfat.c).
> > 
> > > As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.
> > Does it hit the guard page?
> 
> How would that look like? I get segfaults like this:
> 
> segfault at 7f91aa642b78 ip 0000555ab714ef7d sp 00007f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000]
> 
> most of the time error 6. Sometimes error 7. segfault is near the sp.

A backtrace would be good.

Dave

> 
> 
> > 
> > > > > 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
> > > > > freeing
> > > > >      of memory. This lead to higher heap allocations which could not
> > > > >      effectively
> > > > >      be returned to kernel (most likely due to fragmentation).
> > > > I agree that some of the exec.c allocations need some care, but I would
> > > > prefer to use a custom free list or lazy allocation instead of mmap.
> > > This would only help if the elements from the free list would be allocated
> > > using mmap? The issue is that RCU delays the freeing so that the number of
> > > concurrent allocations is high and then a bunch is freed at once. If the memory
> > > was malloced it would still have caused trouble.
> > The free list should improve reuse and fragmentation.  I'll take a look at
> > lazy allocation of subpages, too.
> 
> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
> the realloc?
> 
> Peter
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28 10:17         ` Dr. David Alan Gilbert
  2016-06-28 10:21           ` Daniel P. Berrange
@ 2016-06-28 14:10           ` Eric Blake
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Blake @ 2016-06-28 14:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Daniel P. Berrange
  Cc: kwolf, peter.maydell, mst, Peter Lieven, qemu-devel, mreitz,
	kraxel, pbonzini, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On 06/28/2016 04:17 AM, Dr. David Alan Gilbert wrote:

>> QmpInputVisitor is used to parse all QMP monitor commands, so will
>> be used continuously throughout life of QEMU, often very frequently.
>> eg When migration is running many monitor commands per second are
>> expected
> 
> Does the same input visitor get reused by each command?

No; in fact commit f2ff429 changed things to intentionally prevent reuse
of a visitor (on the argument that it was easier to do that than to
think about corner cases of reset after a partial visit encountered
errors).  But we can revisit that decision if reusing a static
QmpInputVisitor would be wiser than allocating a fresh one for every visit.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28 11:36   ` Paolo Bonzini
@ 2016-06-28 14:14     ` Eric Blake
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Blake @ 2016-06-28 14:14 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1315 bytes --]

On 06/28/2016 05:36 AM, Paolo Bonzini wrote:
> 
> 
> On 28/06/2016 11:01, Peter Lieven wrote:
>> this struct is approx 75kB
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>  qapi/qmp-input-visitor.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Can you change the stack to a QSLIST instead?  That's where most of the
> waste comes from.

QmpInputVisitor has:

struct QmpInputVisitor
{
    Visitor visitor;

    /* Root of visit at visitor creation. */
    QObject *root;

    /* Stack of objects being visited (all entries will be either
     * QDict or QList). */
    StackObject stack[QIV_STACK_SIZE];
...

while QmpOutputVisitor has:

struct QmpOutputVisitor
{
    Visitor visitor;
    QStack stack; /* Stack of containers that haven't yet been finished */
    QObject *root; /* Root of the output visit */
    QObject **result; /* User's storage location for result */
};

The extra layer of indirection to a QStack vs. a direct array has
tradeoffs, but both Markus and I have commented in the past that both
files' stacks are rather wasteful, and we just have not had a reason to
improve them.  This thread may be the reason :)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 12:09           ` Peter Lieven
@ 2016-06-28 14:20             ` Dr. David Alan Gilbert
  2016-06-30  6:34               ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-28 14:20 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

* Peter Lieven (pl@kamp.de) wrote:
> Am 28.06.2016 um 13:35 schrieb Dr. David Alan Gilbert:
> > * Peter Lieven (pl@kamp.de) wrote:
> > > Am 28.06.2016 um 12:57 schrieb Dr. David Alan Gilbert:
> > > > * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > > > > On 28/06/2016 11:01, Peter Lieven wrote:
> > > > > > evaluation with the recently introduced maximum stack size monitoring revealed
> > > > > > that the actual used stack size was never above 4kB so allocating 1MB stack
> > > > > > for each coroutine is a lot of wasted memory. So reduce the stack size to
> > > > > > 64kB which should still give enough head room.
> > > > > If we make the stack this much smaller, there is a non-zero chance of
> > > > > smashing it.  You must add a guard page if you do this (actually more
> > > > > than one because QEMU will happily have stack frames as big as 16 KB).
> > > > > The stack counts for RSS but it's not actually allocated memory, so why
> > > > > does it matter?
> > > > I think I'd be interested in seeing the /proc/.../smaps before and after this
> > > > change to see if anything is visible and if we can see the difference
> > > > in rss etc.
> > > Can you advise what in smaps should be especially looked at.
> > > 
> > > As for RSS I can report hat the long term usage is significantly lower.
> > > I had the strange observation that when the VM is running for some minutes
> > > the RSS suddenly increases to the whole stack size.
> > You can see the Rss of each mapping; if you knew where your stacks were
> > it would be easy to see if it was the stacks that were Rss and if
> > there was anything else odd about them.
> > If you set hte mapping as growsdown then you can see the area that has a 'gd'
> > in it's VmFlags.
> 
> Would you expect to see each 1MB allocation in smaps or is it possible that
> the kernel merges some mappings to bigger ones?
> 
> And more importantly if the regions are merged Paolos comment about we
> do not need a guard page would not be true because a coroutine stack could
> grow into annother coroutines stack. Looking at the commit from Linus it
> would also be good to have that guard page not having the gd flag.

Hmm I'm not sure; one for Paolo.

> Some of the regions above 1024kB have an RSS of exactly 4kB * (Size / 1024kB)
> which leads to the assumption that it is a corouine stack where exactly one page
> has been allocated.
> 
> I am asking because this is what I e.g. see for a Qemu VM with flags "gd":

However, what that does show is that if you add up all the Rss, it's still
near-enough nothing worth worrying about.

Maybe it looks different in the old world before you mmap'd it, you could
try going back to the g_malloc'd version but printf'ing the
address you get, then comparing that with smaps to see what the malloc'd
world ended up with mapped.

Dave

> cat /proc/5031/smaps | grep -B18 gd
> 7f808aee7000-7f808b9e6000 rw-p 00000000 00:00 0
> Size:              11264 kB
> Rss:                  44 kB
> Pss:                  44 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        44 kB
> Referenced:           44 kB
> Anonymous:            44 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f808bb01000-7f8090000000 rw-p 00000000 00:00 0
> Size:              70656 kB
> Rss:                 276 kB
> Pss:                 276 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:       276 kB
> Referenced:          276 kB
> Anonymous:           276 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f80940ff000-7f80943fe000 rw-p 00000000 00:00 0
> Size:               3072 kB
> Rss:                  12 kB
> Pss:                  12 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        12 kB
> Referenced:           12 kB
> Anonymous:            12 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8095700000-7f80957ff000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8097301000-7f8097400000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f80974df000-7f80975de000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> 7f809760c000-7f809770b000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8097901000-7f8097a00000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8097b01000-7f8097c00000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8097d01000-7f8097e00000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f8197f01000-7f8198000000 rw-p 00000000 00:00 0
> Size:               1024 kB
> Rss:                   4 kB
> Pss:                   4 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         4 kB
> Referenced:            4 kB
> Anonymous:             4 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7f81b4001000-7f81b4200000 rw-p 00000000 00:00 0
> Size:               2048 kB
> Rss:                  20 kB
> Pss:                  20 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        20 kB
> Referenced:           20 kB
> Anonymous:            20 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac sd
> --
> 7ffd337e2000-7ffd33805000 rw-p 00000000 00:00 0                          [stack]
> Size:                144 kB
> Rss:                  64 kB
> Pss:                  64 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        64 kB
> Referenced:           64 kB
> Anonymous:            64 kB
> AnonHugePages:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> VmFlags: rd wr mr mw me gd ac
> 
> Peter
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 12:56         ` Dr. David Alan Gilbert
@ 2016-06-28 14:43           ` Peter Lieven
  2016-06-28 14:52             ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 14:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter maydell, mst, mreitz, kraxel

Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
>>>> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
>>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>>> I recently found that Qemu is using several hundred megabytes of RSS
>>>>>> memory
>>>>>> more than older versions such as Qemu 2.2.0. So I started tracing
>>>>>> memory allocation and found 2 major reasons for this.
>>>>>>
>>>>>> 1) We changed the qemu coroutine pool to have a per thread and a global
>>>>>> release
>>>>>>       pool. The choosen poolsize and the changed algorithm could lead to up
>>>>>>       to
>>>>>>       192 free coroutines with just a single iothread. Each of the
>>>>>>       coroutines
>>>>>>       in the pool each having 1MB of stack memory.
>>>>> But the fix, as you correctly note, is to reduce the stack size.  It
>>>>> would be nice to compile block-obj-y with -Wstack-usage=2048 too.
>>>> To reveal if there are any big stack allocations in the block layer?
>>> Yes.  Most should be fixed by now, but a handful are probably still there.
>>> (definitely one in vvfat.c).
>>>
>>>> As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.
>>> Does it hit the guard page?
>> How would that look like? I get segfaults like this:
>>
>> segfault at 7f91aa642b78 ip 0000555ab714ef7d sp 00007f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000]
>>
>> most of the time error 6. Sometimes error 7. segfault is near the sp.
> A backtrace would be good.

Here we go. My old friend nc_senv_compat ;-)

Again the question: Would you go for reducing the stack size an eliminating all stack eaters ?

The static netbuf in nc_sendv_compat is no problem.

And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the rest of the
stack with this flag if availble. So we are save on non Linux systems or Linux before 3.9 or merged memory regions.

Peter

---

Program received signal SIGSEGV, Segmentation fault.
0x0000555555a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
     at net/net.c:701
(gdb) bt full
#0  0x0000555555a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
     at net/net.c:701
         buf = '\000' <repeats 65890 times>...
         buffer = 0x0
         offset = 0
#1  0x0000555555a2f058 in qemu_deliver_packet_iov (sender=0x5555565a46b0,
     flags=0, iov=0x7ffff7e98d20, iovcnt=1, opaque=0x555557802370)
     at net/net.c:745
         nc = 0x555557802370
         ret = 21845
#2  0x0000555555a3132d in qemu_net_queue_deliver (queue=0x555557802590,
     sender=0x5555565a46b0, flags=0, data=0x55555659e2a8 "", size=74)
     at net/queue.c:163
         ret = -1
         iov = {iov_base = 0x55555659e2a8, iov_len = 74}
#3  0x0000555555a3178b in qemu_net_queue_flush (queue=0x555557802590)
     at net/queue.c:260
         packet = 0x55555659e280
         ret = 21845
#4  0x0000555555a2eb7a in qemu_flush_or_purge_queued_packets (
     nc=0x555557802370, purge=false) at net/net.c:629
No locals.
#5  0x0000555555a2ebe4 in qemu_flush_queued_packets (nc=0x555557802370)
     at net/net.c:642
No locals.
#6  0x00005555557747b7 in virtio_net_set_status (vdev=0x555556fb32a8,
     status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178
         ncs = 0x555557802370
         queue_started = true
         n = 0x555556fb32a8
         __func__ = "virtio_net_set_status"
         q = 0x555557308b50
         i = 0
         queue_status = 7 '\a'
#7  0x0000555555795501 in virtio_set_status (vdev=0x555556fb32a8, val=7 '\a')
     at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618
         k = 0x55555657eb40
         __func__ = "virtio_set_status"
#8  0x00005555557985e6 in virtio_vmstate_change (opaque=0x555556fb32a8,
     running=1, state=RUN_STATE_RUNNING)
     at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539
         vdev = 0x555556fb32a8
         qbus = 0x555556fb3240
         __func__ = "virtio_vmstate_change"
         k = 0x555556570420
         backend_run = true
#9  0x00005555558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING)
     at vl.c:1601
         e = 0x555557320cf0
         next = 0x555557af4c40
#10 0x000055555585737d in vm_start () at vl.c:756
         requested = RUN_STATE_MAX
#11 0x0000555555a209ec in process_incoming_migration_co (opaque=0x5555566a1600)
     at migration/migration.c:392
         f = 0x5555566a1600
         local_err = 0x0
         mis = 0x5555575ab0e0
         ps = POSTCOPY_INCOMING_NONE
         ret = 0
#12 0x0000555555b61efd in coroutine_trampoline (i0=1465036928, i1=21845)
     at util/coroutine-ucontext.c:80
         arg = {p = 0x55555752b080, i = {1465036928, 21845}}
         self = 0x55555752b080
         co = 0x55555752b080
#13 0x00007ffff5cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x00007fffffffcb40 in ?? ()
No symbol table info available.
#15 0x0000000000000000 in ?? ()
No symbol table info available.


>
> Dave
>
>>
>>>>>> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
>>>>>> freeing
>>>>>>       of memory. This lead to higher heap allocations which could not
>>>>>>       effectively
>>>>>>       be returned to kernel (most likely due to fragmentation).
>>>>> I agree that some of the exec.c allocations need some care, but I would
>>>>> prefer to use a custom free list or lazy allocation instead of mmap.
>>>> This would only help if the elements from the free list would be allocated
>>>> using mmap? The issue is that RCU delays the freeing so that the number of
>>>> concurrent allocations is high and then a bunch is freed at once. If the memory
>>>> was malloced it would still have caused trouble.
>>> The free list should improve reuse and fragmentation.  I'll take a look at
>>> lazy allocation of subpages, too.
>> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
>> the realloc?
>>
>> Peter
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28 14:43           ` Peter Lieven
@ 2016-06-28 14:52             ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-28 14:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter maydell, mst, mreitz, kraxel

Am 28.06.2016 um 16:43 schrieb Peter Lieven:
> Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert:
>> * Peter Lieven (pl@kamp.de) wrote:
>>> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
>>>>> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
>>>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>>>> I recently found that Qemu is using several hundred megabytes of RSS
>>>>>>> memory
>>>>>>> more than older versions such as Qemu 2.2.0. So I started tracing
>>>>>>> memory allocation and found 2 major reasons for this.
>>>>>>>
>>>>>>> 1) We changed the qemu coroutine pool to have a per thread and a global
>>>>>>> release
>>>>>>>       pool. The choosen poolsize and the changed algorithm could lead to up
>>>>>>>       to
>>>>>>>       192 free coroutines with just a single iothread. Each of the
>>>>>>>       coroutines
>>>>>>>       in the pool each having 1MB of stack memory.
>>>>>> But the fix, as you correctly note, is to reduce the stack size.  It
>>>>>> would be nice to compile block-obj-y with -Wstack-usage=2048 too.
>>>>> To reveal if there are any big stack allocations in the block layer?
>>>> Yes.  Most should be fixed by now, but a handful are probably still there.
>>>> (definitely one in vvfat.c).
>>>>
>>>>> As it seems reducing to 64kB breaks live migration in some (non reproducible) cases.
>>>> Does it hit the guard page?
>>> How would that look like? I get segfaults like this:
>>>
>>> segfault at 7f91aa642b78 ip 0000555ab714ef7d sp 00007f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000]
>>>
>>> most of the time error 6. Sometimes error 7. segfault is near the sp.
>> A backtrace would be good.
>
> Here we go. My old friend nc_senv_compat ;-)

This has already been fixed in master. My test systems use an older Qemu ;-)

Peter

>
> Again the question: Would you go for reducing the stack size an eliminating all stack eaters ?
>
> The static netbuf in nc_sendv_compat is no problem.
>
> And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the rest of the
> stack with this flag if availble. So we are save on non Linux systems or Linux before 3.9 or merged memory regions.
>
> Peter
>
> ---
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000555555a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
>     at net/net.c:701
> (gdb) bt full
> #0  0x0000555555a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
>     at net/net.c:701
>         buf = '\000' <repeats 65890 times>...
>         buffer = 0x0
>         offset = 0
> #1  0x0000555555a2f058 in qemu_deliver_packet_iov (sender=0x5555565a46b0,
>     flags=0, iov=0x7ffff7e98d20, iovcnt=1, opaque=0x555557802370)
>     at net/net.c:745
>         nc = 0x555557802370
>         ret = 21845
> #2  0x0000555555a3132d in qemu_net_queue_deliver (queue=0x555557802590,
>     sender=0x5555565a46b0, flags=0, data=0x55555659e2a8 "", size=74)
>     at net/queue.c:163
>         ret = -1
>         iov = {iov_base = 0x55555659e2a8, iov_len = 74}
> #3  0x0000555555a3178b in qemu_net_queue_flush (queue=0x555557802590)
>     at net/queue.c:260
>         packet = 0x55555659e280
>         ret = 21845
> #4  0x0000555555a2eb7a in qemu_flush_or_purge_queued_packets (
>     nc=0x555557802370, purge=false) at net/net.c:629
> No locals.
> #5  0x0000555555a2ebe4 in qemu_flush_queued_packets (nc=0x555557802370)
>     at net/net.c:642
> No locals.
> #6  0x00005555557747b7 in virtio_net_set_status (vdev=0x555556fb32a8,
>     status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178
>         ncs = 0x555557802370
>         queue_started = true
>         n = 0x555556fb32a8
>         __func__ = "virtio_net_set_status"
>         q = 0x555557308b50
>         i = 0
>         queue_status = 7 '\a'
> #7  0x0000555555795501 in virtio_set_status (vdev=0x555556fb32a8, val=7 '\a')
>     at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618
>         k = 0x55555657eb40
>         __func__ = "virtio_set_status"
> #8  0x00005555557985e6 in virtio_vmstate_change (opaque=0x555556fb32a8,
>     running=1, state=RUN_STATE_RUNNING)
>     at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539
>         vdev = 0x555556fb32a8
>         qbus = 0x555556fb3240
>         __func__ = "virtio_vmstate_change"
>         k = 0x555556570420
>         backend_run = true
> #9  0x00005555558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING)
>     at vl.c:1601
>         e = 0x555557320cf0
>         next = 0x555557af4c40
> #10 0x000055555585737d in vm_start () at vl.c:756
>         requested = RUN_STATE_MAX
> #11 0x0000555555a209ec in process_incoming_migration_co (opaque=0x5555566a1600)
>     at migration/migration.c:392
>         f = 0x5555566a1600
>         local_err = 0x0
>         mis = 0x5555575ab0e0
>         ps = POSTCOPY_INCOMING_NONE
>         ret = 0
> #12 0x0000555555b61efd in coroutine_trampoline (i0=1465036928, i1=21845)
>     at util/coroutine-ucontext.c:80
>         arg = {p = 0x55555752b080, i = {1465036928, 21845}}
>         self = 0x55555752b080
>         co = 0x55555752b080
> #13 0x00007ffff5cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #14 0x00007fffffffcb40 in ?? ()
> No symbol table info available.
> #15 0x0000000000000000 in ?? ()
> No symbol table info available.
>
>
>>
>> Dave
>>
>>>
>>>>>>> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
>>>>>>> freeing
>>>>>>>       of memory. This lead to higher heap allocations which could not
>>>>>>>       effectively
>>>>>>>       be returned to kernel (most likely due to fragmentation).
>>>>>> I agree that some of the exec.c allocations need some care, but I would
>>>>>> prefer to use a custom free list or lazy allocation instead of mmap.
>>>>> This would only help if the elements from the free list would be allocated
>>>>> using mmap? The issue is that RCU delays the freeing so that the number of
>>>>> concurrent allocations is high and then a bunch is freed at once. If the memory
>>>>> was malloced it would still have caused trouble.
>>>> The free list should improve reuse and fragmentation.  I'll take a look at
>>>> lazy allocation of subpages, too.
>>> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
>>> the realloc?
>>>
>>> Peter
>>>
>> -- 
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB
  2016-06-28 14:20             ` Dr. David Alan Gilbert
@ 2016-06-30  6:34               ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-06-30  6:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter.maydell, mst, mreitz, kraxel

Am 28.06.2016 um 16:20 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 28.06.2016 um 13:35 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Am 28.06.2016 um 12:57 schrieb Dr. David Alan Gilbert:
>>>>> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>>>> evaluation with the recently introduced maximum stack size monitoring revealed
>>>>>>> that the actual used stack size was never above 4kB so allocating 1MB stack
>>>>>>> for each coroutine is a lot of wasted memory. So reduce the stack size to
>>>>>>> 64kB which should still give enough head room.
>>>>>> If we make the stack this much smaller, there is a non-zero chance of
>>>>>> smashing it.  You must add a guard page if you do this (actually more
>>>>>> than one because QEMU will happily have stack frames as big as 16 KB).
>>>>>> The stack counts for RSS but it's not actually allocated memory, so why
>>>>>> does it matter?
>>>>> I think I'd be interested in seeing the /proc/.../smaps before and after this
>>>>> change to see if anything is visible and if we can see the difference
>>>>> in rss etc.
>>>> Can you advise what in smaps should be especially looked at.
>>>>
>>>> As for RSS I can report hat the long term usage is significantly lower.
>>>> I had the strange observation that when the VM is running for some minutes
>>>> the RSS suddenly increases to the whole stack size.
>>> You can see the Rss of each mapping; if you knew where your stacks were
>>> it would be easy to see if it was the stacks that were Rss and if
>>> there was anything else odd about them.
>>> If you set hte mapping as growsdown then you can see the area that has a 'gd'
>>> in it's VmFlags.
>> Would you expect to see each 1MB allocation in smaps or is it possible that
>> the kernel merges some mappings to bigger ones?
>>
>> And more importantly if the regions are merged Paolos comment about we
>> do not need a guard page would not be true because a coroutine stack could
>> grow into annother coroutines stack. Looking at the commit from Linus it
>> would also be good to have that guard page not having the gd flag.
> Hmm I'm not sure; one for Paolo.

My fault. The second mmap call with the pointer to the stack must carry
the MAP_FIXED flag.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
  2016-06-28  9:29   ` Dr. David Alan Gilbert
  2016-06-28 11:36   ` Paolo Bonzini
@ 2016-06-30 14:12   ` Markus Armbruster
  2016-07-04  9:02     ` Paolo Bonzini
  2 siblings, 1 reply; 78+ messages in thread
From: Markus Armbruster @ 2016-06-30 14:12 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel,
	pbonzini

Peter Lieven <pl@kamp.de> writes:

> this struct is approx 75kB

32KiB on x86_64, almost entirely eaten by member stack.

As Dan observed, we use many short-lived QmpInputVisitors.  However, we
should not be using many simultaneously.  That would be necessary for
32KiB objects to have an impact on memory use.  Are you sure these have
an impact?

Implementing a stack as "big enough" array can be wasteful.
Implementing it as dynamically allocated list is differently wasteful.
Saving several mallocs and frees can be worth "wasting" a few pages of
memory for a short time.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects
  2016-06-28 10:49     ` Peter Lieven
@ 2016-06-30 14:15       ` Markus Armbruster
  0 siblings, 0 replies; 78+ messages in thread
From: Markus Armbruster @ 2016-06-30 14:15 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel, kwolf, peter.maydell, mst, dgilbert,
	mreitz, kraxel

Peter Lieven <pl@kamp.de> writes:

> Am 28.06.2016 um 12:42 schrieb Paolo Bonzini:
>>
>> On 28/06/2016 11:01, Peter Lieven wrote:
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>>   include/qom/object.h |  1 +
>>>   qom/object.c         | 20 +++++++++++++++++---
>>>   2 files changed, 18 insertions(+), 3 deletions(-)
>> No, please---glibc should be fixed instead.
>
> The objects we allocate are sometimes as big as 70kB...

That's a rather pedestrian size, isn't it?  If malloc() sucks at
managing such sizes, it needs fixing very badly.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs
  2016-06-28 10:41   ` Paolo Bonzini
  2016-06-28 11:26     ` Peter Lieven
@ 2016-07-04  7:30     ` Peter Lieven
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-07-04  7:30 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:41 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> a classic use for mmap here.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> They are never freed, why does mmap help?

checked again. qemu_system_reset (which calls rom_reset) is invoked after pc_macine_init.
So the memory is indeed freed again.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-06-30 14:12   ` Markus Armbruster
@ 2016-07-04  9:02     ` Paolo Bonzini
  2016-07-04 11:18       ` Markus Armbruster
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-07-04  9:02 UTC (permalink / raw)
  To: Markus Armbruster, Peter Lieven
  Cc: kwolf, peter.maydell, mst, qemu-devel, mreitz, kraxel, dgilbert



On 30/06/2016 16:12, Markus Armbruster wrote:
> Implementing a stack as "big enough" array can be wasteful.
> Implementing it as dynamically allocated list is differently wasteful.
> Saving several mallocs and frees can be worth "wasting" a few pages of
> memory for a short time.

Most usage of QmpInputVisitor at startup comes from
object_property_set_qobject, which only sets small scalar objects.  The
stack is entirely unused in this case.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-07-04  9:02     ` Paolo Bonzini
@ 2016-07-04 11:18       ` Markus Armbruster
  2016-07-04 11:36         ` Peter Lieven
  2016-07-04 11:42         ` Paolo Bonzini
  0 siblings, 2 replies; 78+ messages in thread
From: Markus Armbruster @ 2016-07-04 11:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Lieven, kwolf, peter.maydell, mst, qemu-devel, mreitz,
	kraxel, dgilbert

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/06/2016 16:12, Markus Armbruster wrote:
>> Implementing a stack as "big enough" array can be wasteful.
>> Implementing it as dynamically allocated list is differently wasteful.
>> Saving several mallocs and frees can be worth "wasting" a few pages of
>> memory for a short time.
>
> Most usage of QmpInputVisitor at startup comes from
> object_property_set_qobject, which only sets small scalar objects.  The
> stack is entirely unused in this case.

A quick test run shows ~300 qmp_input_visitor_new() calls during
startup, with at most two alive at the same time.

Why would it matter whether these are in the order of 150 bytes or 25000
bytes each?  How could this materially impact RSS?

There's one type of waste here that I understand: we zero the whole
QmpInputVisitor on allocation.

I'm not opposed to changing how the stack is implemented, I just want to
first understand why the current implmementation behaves badly (assuming
it does).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-07-04 11:18       ` Markus Armbruster
@ 2016-07-04 11:36         ` Peter Lieven
  2016-07-04 11:42         ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-07-04 11:36 UTC (permalink / raw)
  To: Markus Armbruster, Paolo Bonzini
  Cc: kwolf, peter.maydell, mst, qemu-devel, mreitz, kraxel, dgilbert

Am 04.07.2016 um 13:18 schrieb Markus Armbruster:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>
>> On 30/06/2016 16:12, Markus Armbruster wrote:
>>> Implementing a stack as "big enough" array can be wasteful.
>>> Implementing it as dynamically allocated list is differently wasteful.
>>> Saving several mallocs and frees can be worth "wasting" a few pages of
>>> memory for a short time.
>> Most usage of QmpInputVisitor at startup comes from
>> object_property_set_qobject, which only sets small scalar objects.  The
>> stack is entirely unused in this case.
> A quick test run shows ~300 qmp_input_visitor_new() calls during
> startup, with at most two alive at the same time.
>
> Why would it matter whether these are in the order of 150 bytes or 25000
> bytes each?  How could this materially impact RSS?
>
> There's one type of waste here that I understand: we zero the whole
> QmpInputVisitor on allocation.
>
> I'm not opposed to changing how the stack is implemented, I just want to
> first understand why the current implmementation behaves badly (assuming
> it does).

The history behind this is that I observed that the RSS usage of Qemu
has dramatically increased between Qemu 2.2.0 and 2.5.0. I observed
that really clearly since we use hugetblfs everywhere and so I can clearly
distinct Qemu memory from VM memory. After having bisected one increase
in RSS usage to the introduction of RCU the theory came up that the memory
gets fragmented because alloc and dealloc patterns have changed. So I started
to trace all malloc calls above 4kB and started to use mmap everywhere where it
was possible.

To give you an idea of the diffence I observed I'd like to give an example.
I have a blade with 22 vServers running on it. Including OS the allocated
memory with current master is approx. at 6.5GB. With current master
and the following environment set:

MALLOC_MMAP_THRESHOLD_=32768

the allocated memory stays at approx. 2GB.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor
  2016-07-04 11:18       ` Markus Armbruster
  2016-07-04 11:36         ` Peter Lieven
@ 2016-07-04 11:42         ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2016-07-04 11:42 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Lieven, kwolf, peter.maydell, mst, qemu-devel, mreitz,
	kraxel, dgilbert



On 04/07/2016 13:18, Markus Armbruster wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 30/06/2016 16:12, Markus Armbruster wrote:
>>> Implementing a stack as "big enough" array can be wasteful.
>>> Implementing it as dynamically allocated list is differently wasteful.
>>> Saving several mallocs and frees can be worth "wasting" a few pages of
>>> memory for a short time.
>>
>> Most usage of QmpInputVisitor at startup comes from
>> object_property_set_qobject, which only sets small scalar objects.  The
>> stack is entirely unused in this case.
> 
> A quick test run shows ~300 qmp_input_visitor_new() calls during
> startup, with at most two alive at the same time.
> 
> Why would it matter whether these are in the order of 150 bytes or 25000
> bytes each?  How could this materially impact RSS?

I think we agree that it doesn't.  The question in the subthread is
whether we can improve QmpInputVisitor in general.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-06-28 10:43   ` Paolo Bonzini
  2016-06-28 10:48     ` Peter Lieven
@ 2016-07-11  9:31     ` Peter Lieven
  2016-07-11  9:44       ` Peter Lieven
  2016-07-11 10:37       ` Paolo Bonzini
  1 sibling, 2 replies; 78+ messages in thread
From: Peter Lieven @ 2016-07-11  9:31 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>
> On 28/06/2016 11:01, Peter Lieven wrote:
>> this was causing serious framentation in conjunction with the
>> subpages since RCU was introduced. The node space was allocated
>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>> times at startup. And thanks to RCU the freeing was delayed.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> The size of the node from the previous as->dispatch could be used as a
> hint for the new one perhaps, avoiding the reallocation?

This here seems also to work:

diff --git a/exec.c b/exec.c
index 0122ef7..2691c0a 100644
--- a/exec.c
+++ b/exec.c
@@ -187,10 +187,12 @@ struct CPUAddressSpace {

  static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
  {
+    static unsigned alloc_hint = 16;
      if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
-        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
+        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
          map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb + nodes);
          map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
+        alloc_hint = map->nodes_nb_alloc;
      }
  }


Question is still, mmap for this?

Peter

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-07-11  9:31     ` Peter Lieven
@ 2016-07-11  9:44       ` Peter Lieven
  2016-07-11 10:37       ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-07-11  9:44 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 11.07.2016 um 11:31 schrieb Peter Lieven:
> Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>>
>> On 28/06/2016 11:01, Peter Lieven wrote:
>>> this was causing serious framentation in conjunction with the
>>> subpages since RCU was introduced. The node space was allocated
>>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>>> times at startup. And thanks to RCU the freeing was delayed.
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> The size of the node from the previous as->dispatch could be used as a
>> hint for the new one perhaps, avoiding the reallocation?
>
> This here seems also to work:
>
> diff --git a/exec.c b/exec.c
> index 0122ef7..2691c0a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -187,10 +187,12 @@ struct CPUAddressSpace {
>
>  static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
>  {
> +    static unsigned alloc_hint = 16;
>      if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
> -        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
> +        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
>          map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb + nodes);
>          map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
> +        alloc_hint = map->nodes_nb_alloc;
>      }
>  }
>
>
> Question is still, mmap for this?

Side note: I added some counters and found that on my test system at maximum 453 allocations of 28 * sizeof(Nodes) bytes where active at
the same time. During runtime its only 9. So this might explain why the alloc + realloc causes fragmentation of the brk heap.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-07-11  9:31     ` Peter Lieven
  2016-07-11  9:44       ` Peter Lieven
@ 2016-07-11 10:37       ` Paolo Bonzini
  2016-07-12 14:34         ` Peter Lieven
  1 sibling, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-07-11 10:37 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 11/07/2016 11:31, Peter Lieven wrote:
> Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>>
>> On 28/06/2016 11:01, Peter Lieven wrote:
>>> this was causing serious framentation in conjunction with the
>>> subpages since RCU was introduced. The node space was allocated
>>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>>> times at startup. And thanks to RCU the freeing was delayed.
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> The size of the node from the previous as->dispatch could be used as a
>> hint for the new one perhaps, avoiding the reallocation?
> 
> This here seems also to work:
> 
> diff --git a/exec.c b/exec.c
> index 0122ef7..2691c0a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -187,10 +187,12 @@ struct CPUAddressSpace {
> 
>  static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
>  {
> +    static unsigned alloc_hint = 16;
>      if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
> -        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
> +        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
>          map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb +
> nodes);
>          map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
> +        alloc_hint = map->nodes_nb_alloc;
>      }
>  }
> 
> 
> Question is still, mmap for this?

Nice!  Can you submit a patch for this?

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-07-11 10:37       ` Paolo Bonzini
@ 2016-07-12 14:34         ` Peter Lieven
  2016-07-13 10:27           ` Paolo Bonzini
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Lieven @ 2016-07-12 14:34 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 11.07.2016 um 12:37 schrieb Paolo Bonzini:
>
> On 11/07/2016 11:31, Peter Lieven wrote:
>> Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>> this was causing serious framentation in conjunction with the
>>>> subpages since RCU was introduced. The node space was allocated
>>>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>>>> times at startup. And thanks to RCU the freeing was delayed.
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> The size of the node from the previous as->dispatch could be used as a
>>> hint for the new one perhaps, avoiding the reallocation?
>> This here seems also to work:
>>
>> diff --git a/exec.c b/exec.c
>> index 0122ef7..2691c0a 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -187,10 +187,12 @@ struct CPUAddressSpace {
>>
>>  static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
>>  {
>> +    static unsigned alloc_hint = 16;
>>      if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
>> -        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
>> +        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
>>          map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb +
>> nodes);
>>          map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
>> +        alloc_hint = map->nodes_nb_alloc;
>>      }
>>  }
>>
>>
>> Question is still, mmap for this?
> Nice!  Can you submit a patch for this?

Of course, but please see my other comment. We still should consider mmap for this cause we have close to 500 Physmaps about 70KB each which
are all allocted at the same time - I think due to RCU.

Peter

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-07-12 14:34         ` Peter Lieven
@ 2016-07-13 10:27           ` Paolo Bonzini
  2016-07-14 14:47             ` Peter Lieven
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2016-07-13 10:27 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel



On 12/07/2016 16:34, Peter Lieven wrote:
> Am 11.07.2016 um 12:37 schrieb Paolo Bonzini:
>>
>> On 11/07/2016 11:31, Peter Lieven wrote:
>>> Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>> this was causing serious framentation in conjunction with the
>>>>> subpages since RCU was introduced. The node space was allocated
>>>>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>>>>> times at startup. And thanks to RCU the freeing was delayed.
>>>>>
>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> The size of the node from the previous as->dispatch could be used as a
>>>> hint for the new one perhaps, avoiding the reallocation?
>>> This here seems also to work:
>>>
>>> diff --git a/exec.c b/exec.c
>>> index 0122ef7..2691c0a 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -187,10 +187,12 @@ struct CPUAddressSpace {
>>>
>>>  static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
>>>  {
>>> +    static unsigned alloc_hint = 16;
>>>      if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
>>> -        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
>>> +        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
>>>          map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb +
>>> nodes);
>>>          map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
>>> +        alloc_hint = map->nodes_nb_alloc;
>>>      }
>>>  }
>>>
>>>
>>> Question is still, mmap for this?
>> Nice!  Can you submit a patch for this?
> 
> Of course, but please see my other comment. We still should consider mmap for this cause we have close to 500 Physmaps about 70KB each which
> are all allocted at the same time - I think due to RCU.

That I think is not material for 2.7, and also I want to take a look at
creating nodes lazily.

Paolo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes
  2016-07-13 10:27           ` Paolo Bonzini
@ 2016-07-14 14:47             ` Peter Lieven
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Lieven @ 2016-07-14 14:47 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel

Am 13.07.2016 um 12:27 schrieb Paolo Bonzini:
>
> On 12/07/2016 16:34, Peter Lieven wrote:
>> Am 11.07.2016 um 12:37 schrieb Paolo Bonzini:
>>> On 11/07/2016 11:31, Peter Lieven wrote:
>>>> Am 28.06.2016 um 12:43 schrieb Paolo Bonzini:
>>>>> On 28/06/2016 11:01, Peter Lieven wrote:
>>>>>> this was causing serious framentation in conjunction with the
>>>>>> subpages since RCU was introduced. The node space was allocated
>>>>>> at approx 32kB then reallocted to approx 75kB and this a few hundred
>>>>>> times at startup. And thanks to RCU the freeing was delayed.
>>>>>>
>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>> The size of the node from the previous as->dispatch could be used as a
>>>>> hint for the new one perhaps, avoiding the reallocation?
>>>> This here seems also to work:
>>>>
>>>> diff --git a/exec.c b/exec.c
>>>> index 0122ef7..2691c0a 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -187,10 +187,12 @@ struct CPUAddressSpace {
>>>>
>>>>   static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
>>>>   {
>>>> +    static unsigned alloc_hint = 16;
>>>>       if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
>>>> -        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
>>>> +        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, alloc_hint);
>>>>           map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb +
>>>> nodes);
>>>>           map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
>>>> +        alloc_hint = map->nodes_nb_alloc;
>>>>       }
>>>>   }
>>>>
>>>>
>>>> Question is still, mmap for this?
>>> Nice!  Can you submit a patch for this?
>> Of course, but please see my other comment. We still should consider mmap for this cause we have close to 500 Physmaps about 70KB each which
>> are all allocted at the same time - I think due to RCU.
> That I think is not material for 2.7, and also I want to take a look at
> creating nodes lazily.

Okay, then I will send a patch to avoid the realloc and we look at this later.

Peter


>
> Paolo


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
                   ` (15 preceding siblings ...)
  2016-06-28 11:37 ` [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Paolo Bonzini
@ 2016-10-12 21:18 ` Michael R. Hines
  2016-10-18 10:47   ` Peter Lieven
  16 siblings, 1 reply; 78+ messages in thread
From: Michael R. Hines @ 2016-10-12 21:18 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, pbonzini,
	blemasurier, patrick

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms 
without this patch.
We have, collectively, many gigabytes of un-planned-for RSS being used 
per-hypervisor
that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we immediately 
noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the 
coroutine thread pool explicitly,
with another ~100MB in additional stack usage that would likely also go 
away if we
applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?

- Michael

/*
  * Michael R. Hines
  * Senior Engineer, DigitalOcean.
  */

On 06/28/2016 04:01 AM, Peter Lieven wrote:
> I recently found that Qemu is using several hundred megabytes of RSS memory
> more than older versions such as Qemu 2.2.0. So I started tracing
> memory allocation and found 2 major reasons for this.
>
> 1) We changed the qemu coroutine pool to have a per thread and a global release
>     pool. The choosen poolsize and the changed algorithm could lead to up to
>     192 free coroutines with just a single iothread. Each of the coroutines
>     in the pool each having 1MB of stack memory.
>
> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
>     of memory. This lead to higher heap allocations which could not effectively
>     be returned to kernel (most likely due to fragmentation).
>
> The following series is what I came up with. Beside the coroutine patches I changed
> some allocations to forcibly use mmap. All these allocations are not repeatly made
> during runtime so the impact of using mmap should be neglectible.
>
> There are still some big malloced allocations left which cannot be easily changed
> (e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap threshold for
> malloc since this threshold seems to be in the order of several Megabytes on modern systems.
>
> Peter Lieven (15):
>    coroutine-ucontext: mmap stack memory
>    coroutine-ucontext: add a switch to monitor maximum stack size
>    coroutine-ucontext: reduce stack size to 64kB
>    coroutine: add a knob to disable the shared release pool
>    util: add a helper to mmap private anonymous memory
>    exec: use mmap for subpages
>    qapi: use mmap for QmpInputVisitor
>    virtio: use mmap for VirtQueue
>    loader: use mmap for ROMs
>    vmware_svga: use mmap for scratch pad
>    qom: use mmap for bigger Objects
>    util: add a function to realloc mmapped memory
>    exec: use mmap for PhysPageMap->nodes
>    vnc-tight: make the encoding palette static
>    vnc: use mmap for VncState
>
>   configure                 | 33 ++++++++++++++++++--
>   exec.c                    | 11 ++++---
>   hw/core/loader.c          | 16 +++++-----
>   hw/display/vmware_vga.c   |  3 +-
>   hw/virtio/virtio.c        |  5 +--
>   include/qemu/mmap-alloc.h |  7 +++++
>   include/qom/object.h      |  1 +
>   qapi/qmp-input-visitor.c  |  5 +--
>   qom/object.c              | 20 ++++++++++--
>   ui/vnc-enc-tight.c        | 21 ++++++-------
>   ui/vnc.c                  |  5 +--
>   ui/vnc.h                  |  1 +
>   util/coroutine-ucontext.c | 66 +++++++++++++++++++++++++++++++++++++--
>   util/mmap-alloc.c         | 27 ++++++++++++++++
>   util/qemu-coroutine.c     | 79 ++++++++++++++++++++++++++---------------------
>   15 files changed, 225 insertions(+), 75 deletions(-)
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory
  2016-06-28  9:01 ` [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory Peter Lieven
@ 2016-10-16  2:10   ` Michael S. Tsirkin
  2016-10-18 13:50     ` Alex Bennée
  0 siblings, 1 reply; 78+ messages in thread
From: Michael S. Tsirkin @ 2016-10-16  2:10 UTC (permalink / raw)
  To: Peter Lieven
  Cc: qemu-devel, kwolf, mreitz, pbonzini, dgilbert, peter.maydell, kraxel

On Tue, Jun 28, 2016 at 11:01:29AM +0200, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu/mmap-alloc.h |  6 ++++++
>  util/mmap-alloc.c         | 17 +++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> index 0899b2f..a457721 100644
> --- a/include/qemu/mmap-alloc.h
> +++ b/include/qemu/mmap-alloc.h
> @@ -9,4 +9,10 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
>  
>  void qemu_ram_munmap(void *ptr, size_t size);
>  
> +/* qemu_anon_ram_mmap maps private anonymous memory using mmap and
> + * aborts if the allocation fails. its meant to act as an replacement
> + * for g_malloc0 and friends. */

This needs better docs. When should one use g_malloc0 and when
qemu_anon_ram_munmap?



> +void *qemu_anon_ram_mmap(size_t size);
> +void qemu_anon_ram_munmap(void *ptr, size_t size);
> +

The names are confusing - this isn't guest RAM, this is
just internal QEMU memory, isn't it?

Just rename it qemu_malloc0 then ...

>  #endif
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 629d97a..c099858 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -107,3 +107,20 @@ void qemu_ram_munmap(void *ptr, size_t size)
>          munmap(ptr, size + getpagesize());
>      }
>  }
> +
> +void *qemu_anon_ram_mmap(size_t size)
> +{
> +    void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (ptr == MAP_FAILED) {
> +        abort();
> +    }
> +    return ptr;
> +}
> +
> +void qemu_anon_ram_munmap(void *ptr, size_t size)
> +{
> +    if (ptr) {
> +        munmap(ptr, size);
> +    }
> +}
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-10-12 21:18 ` Michael R. Hines
@ 2016-10-18 10:47   ` Peter Lieven
  2016-10-19 17:40     ` Michael R. Hines
  2016-10-31 22:00     ` Michael R. Hines
  0 siblings, 2 replies; 78+ messages in thread
From: Peter Lieven @ 2016-10-18 10:47 UTC (permalink / raw)
  To: Michael R. Hines, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, pbonzini,
	blemasurier, patrick

Am 12.10.2016 um 23:18 schrieb Michael R. Hines:
> Peter,
>
> Greetings from DigitalOcean. We're experiencing the same symptoms without this patch.
> We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor
> that we would like to get rid of =).
>
> Without explicitly trying this patch (will do that ASAP), we immediately noticed that the
> 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly,
> with another ~100MB in additional stack usage that would likely also go away if we
> applied the entirety of your patch.
>
> Is there any chance you have revisited this or have a timeline for it?

Hi Michael,

the current master already includes some of the patches of this original series. There are still some changes left, but
what works for me is the current master +

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..3eaef68 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -25,8 +25,6 @@ enum {
  };

  /** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
  static __thread unsigned int alloc_pool_size;
  static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
      if (CONFIG_COROUTINE_POOL) {
          co = QSLIST_FIRST(&alloc_pool);
          if (!co) {
-            if (release_pool_size > POOL_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                if (!coroutine_pool_cleanup_notifier.notify) {
-                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
- qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
-                }
-
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
-                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
-                co = QSLIST_FIRST(&alloc_pool);
+            /* Slow path; a good place to register the destructor, too.  */
+            if (!coroutine_pool_cleanup_notifier.notify) {
+                coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+ qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
              }
          }
          if (co) {
@@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
      co->caller = NULL;

      if (CONFIG_COROUTINE_POOL) {
-        if (release_pool_size < POOL_BATCH_SIZE * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&release_pool_size);
-            return;
-        }
          if (alloc_pool_size < POOL_BATCH_SIZE) {
              QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
              alloc_pool_size++;

+ invoking qemu with the following environemnet variable set:

MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 ....

The last one makes glibc automatically using mmap when the malloced memory exceeds 32kByte.

Hope this helps,
Peter

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory
  2016-10-16  2:10   ` Michael S. Tsirkin
@ 2016-10-18 13:50     ` Alex Bennée
  0 siblings, 0 replies; 78+ messages in thread
From: Alex Bennée @ 2016-10-18 13:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Lieven, kwolf, peter.maydell, qemu-devel, dgilbert, kraxel,
	pbonzini, mreitz


Michael S. Tsirkin <mst@redhat.com> writes:

> On Tue, Jun 28, 2016 at 11:01:29AM +0200, Peter Lieven wrote:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>  include/qemu/mmap-alloc.h |  6 ++++++
>>  util/mmap-alloc.c         | 17 +++++++++++++++++
>>  2 files changed, 23 insertions(+)
>>
>> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
>> index 0899b2f..a457721 100644
>> --- a/include/qemu/mmap-alloc.h
>> +++ b/include/qemu/mmap-alloc.h
>> @@ -9,4 +9,10 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
>>
>>  void qemu_ram_munmap(void *ptr, size_t size);
>>
>> +/* qemu_anon_ram_mmap maps private anonymous memory using mmap and
>> + * aborts if the allocation fails. its meant to act as an replacement
>> + * for g_malloc0 and friends. */
>
> This needs better docs. When should one use g_malloc0 and when
> qemu_anon_ram_munmap?

My concern is does this break memory sanitizers when we could just tweak
libc's allocation strategy to use mmap (which it should do for blocks
over a certain threshold).

>
>
>
>> +void *qemu_anon_ram_mmap(size_t size);
>> +void qemu_anon_ram_munmap(void *ptr, size_t size);
>> +
>
> The names are confusing - this isn't guest RAM, this is
> just internal QEMU memory, isn't it?
>
> Just rename it qemu_malloc0 then ...
>
>>  #endif
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index 629d97a..c099858 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -107,3 +107,20 @@ void qemu_ram_munmap(void *ptr, size_t size)
>>          munmap(ptr, size + getpagesize());
>>      }
>>  }
>> +
>> +void *qemu_anon_ram_mmap(size_t size)
>> +{
>> +    void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
>> +                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> +    if (ptr == MAP_FAILED) {
>> +        abort();
>> +    }
>> +    return ptr;
>> +}
>> +
>> +void qemu_anon_ram_munmap(void *ptr, size_t size)
>> +{
>> +    if (ptr) {
>> +        munmap(ptr, size);
>> +    }
>> +}
>> --
>> 1.9.1


--
Alex Bennée

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-10-18 10:47   ` Peter Lieven
@ 2016-10-19 17:40     ` Michael R. Hines
  2016-10-31 22:00     ` Michael R. Hines
  1 sibling, 0 replies; 78+ messages in thread
From: Michael R. Hines @ 2016-10-19 17:40 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, pbonzini,
	blemasurier, patrick

Thank you for the response! I'll run off and test that. =)

/*
  * Michael R. Hines
  * Senior Engineer, DigitalOcean.
  */

On 10/18/2016 05:47 AM, Peter Lieven wrote:
> Am 12.10.2016 um 23:18 schrieb Michael R. Hines:
>> Peter,
>>
>> Greetings from DigitalOcean. We're experiencing the same symptoms 
>> without this patch.
>> We have, collectively, many gigabytes of un-planned-for RSS being 
>> used per-hypervisor
>> that we would like to get rid of =).
>>
>> Without explicitly trying this patch (will do that ASAP), we 
>> immediately noticed that the
>> 192MB mentioned immediately melts away (Yay) when we disabled the 
>> coroutine thread pool explicitly,
>> with another ~100MB in additional stack usage that would likely also 
>> go away if we
>> applied the entirety of your patch.
>>
>> Is there any chance you have revisited this or have a timeline for it?
>
> Hi Michael,
>
> the current master already includes some of the patches of this 
> original series. There are still some changes left, but
> what works for me is the current master +
>
> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 5816702..3eaef68 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -25,8 +25,6 @@ enum {
>  };
>
>  /** Free list to speed up creation */
> -static QSLIST_HEAD(, Coroutine) release_pool = 
> QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
> QSLIST_HEAD_INITIALIZER(pool);
>  static __thread unsigned int alloc_pool_size;
>  static __thread Notifier coroutine_pool_cleanup_notifier;
> @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
> *entry)
>      if (CONFIG_COROUTINE_POOL) {
>          co = QSLIST_FIRST(&alloc_pool);
>          if (!co) {
> -            if (release_pool_size > POOL_BATCH_SIZE) {
> -                /* Slow path; a good place to register the 
> destructor, too.  */
> -                if (!coroutine_pool_cleanup_notifier.notify) {
> -                    coroutine_pool_cleanup_notifier.notify = 
> coroutine_pool_cleanup;
> - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
> -                }
> -
> -                /* This is not exact; there could be a little skew 
> between
> -                 * release_pool_size and the actual size of 
> release_pool.  But
> -                 * it is just a heuristic, it does not need to be 
> perfect.
> -                 */
> -                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
> -                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
> -                co = QSLIST_FIRST(&alloc_pool);
> +            /* Slow path; a good place to register the destructor, 
> too.  */
> +            if (!coroutine_pool_cleanup_notifier.notify) {
> +                coroutine_pool_cleanup_notifier.notify = 
> coroutine_pool_cleanup;
> + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
>              }
>          }
>          if (co) {
> @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
>      co->caller = NULL;
>
>      if (CONFIG_COROUTINE_POOL) {
> -        if (release_pool_size < POOL_BATCH_SIZE * 2) {
> -            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> -            atomic_inc(&release_pool_size);
> -            return;
> -        }
>          if (alloc_pool_size < POOL_BATCH_SIZE) {
>              QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
>              alloc_pool_size++;
>
> + invoking qemu with the following environemnet variable set:
>
> MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 ....
>
> The last one makes glibc automatically using mmap when the malloced 
> memory exceeds 32kByte.
>
> Hope this helps,
> Peter
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-10-18 10:47   ` Peter Lieven
  2016-10-19 17:40     ` Michael R. Hines
@ 2016-10-31 22:00     ` Michael R. Hines
  2016-11-01 22:02       ` Michael R. Hines
  1 sibling, 1 reply; 78+ messages in thread
From: Michael R. Hines @ 2016-10-31 22:00 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, pbonzini,
	blemasurier, patrick

On 10/18/2016 05:47 AM, Peter Lieven wrote:
> Am 12.10.2016 um 23:18 schrieb Michael R. Hines:
>> Peter,
>>
>> Greetings from DigitalOcean. We're experiencing the same symptoms 
>> without this patch.
>> We have, collectively, many gigabytes of un-planned-for RSS being 
>> used per-hypervisor
>> that we would like to get rid of =).
>>
>> Without explicitly trying this patch (will do that ASAP), we 
>> immediately noticed that the
>> 192MB mentioned immediately melts away (Yay) when we disabled the 
>> coroutine thread pool explicitly,
>> with another ~100MB in additional stack usage that would likely also 
>> go away if we
>> applied the entirety of your patch.
>>
>> Is there any chance you have revisited this or have a timeline for it?
>
> Hi Michael,
>
> the current master already includes some of the patches of this 
> original series. There are still some changes left, but
> what works for me is the current master +
>
> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 5816702..3eaef68 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -25,8 +25,6 @@ enum {
>  };
>
>  /** Free list to speed up creation */
> -static QSLIST_HEAD(, Coroutine) release_pool = 
> QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
> QSLIST_HEAD_INITIALIZER(pool);
>  static __thread unsigned int alloc_pool_size;
>  static __thread Notifier coroutine_pool_cleanup_notifier;
> @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
> *entry)
>      if (CONFIG_COROUTINE_POOL) {
>          co = QSLIST_FIRST(&alloc_pool);
>          if (!co) {
> -            if (release_pool_size > POOL_BATCH_SIZE) {
> -                /* Slow path; a good place to register the 
> destructor, too.  */
> -                if (!coroutine_pool_cleanup_notifier.notify) {
> -                    coroutine_pool_cleanup_notifier.notify = 
> coroutine_pool_cleanup;
> - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
> -                }
> -
> -                /* This is not exact; there could be a little skew 
> between
> -                 * release_pool_size and the actual size of 
> release_pool.  But
> -                 * it is just a heuristic, it does not need to be 
> perfect.
> -                 */
> -                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
> -                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
> -                co = QSLIST_FIRST(&alloc_pool);
> +            /* Slow path; a good place to register the destructor, 
> too.  */
> +            if (!coroutine_pool_cleanup_notifier.notify) {
> +                coroutine_pool_cleanup_notifier.notify = 
> coroutine_pool_cleanup;
> + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
>              }
>          }
>          if (co) {
> @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
>      co->caller = NULL;
>
>      if (CONFIG_COROUTINE_POOL) {
> -        if (release_pool_size < POOL_BATCH_SIZE * 2) {
> -            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> -            atomic_inc(&release_pool_size);
> -            return;
> -        }
>          if (alloc_pool_size < POOL_BATCH_SIZE) {
>              QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
>              alloc_pool_size++;
>
> + invoking qemu with the following environemnet variable set:
>
> MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 ....
>
> The last one makes glibc automatically using mmap when the malloced 
> memory exceeds 32kByte.
>

Peter,

I tested the above patch (and the environment variable --- it doesn't 
quite come close to as lean of
an RSS tally as the original patchset -------- there's still about 70-80 
MB of remaining RSS.

Any chance you could trim the remaining fat before merging this? =)


/*
  * Michael R. Hines
  * Senior Engineer, DigitalOcean.
  */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
  2016-10-31 22:00     ` Michael R. Hines
@ 2016-11-01 22:02       ` Michael R. Hines
  0 siblings, 0 replies; 78+ messages in thread
From: Michael R. Hines @ 2016-11-01 22:02 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, peter.maydell, mst, dgilbert, mreitz, kraxel, pbonzini,
	blemasurier, patrick

On 10/31/2016 05:00 PM, Michael R. Hines wrote:
> On 10/18/2016 05:47 AM, Peter Lieven wrote:
>> Am 12.10.2016 um 23:18 schrieb Michael R. Hines:
>>> Peter,
>>>
>>> Greetings from DigitalOcean. We're experiencing the same symptoms 
>>> without this patch.
>>> We have, collectively, many gigabytes of un-planned-for RSS being 
>>> used per-hypervisor
>>> that we would like to get rid of =).
>>>
>>> Without explicitly trying this patch (will do that ASAP), we 
>>> immediately noticed that the
>>> 192MB mentioned immediately melts away (Yay) when we disabled the 
>>> coroutine thread pool explicitly,
>>> with another ~100MB in additional stack usage that would likely also 
>>> go away if we
>>> applied the entirety of your patch.
>>>
>>> Is there any chance you have revisited this or have a timeline for it?
>>
>> Hi Michael,
>>
>> the current master already includes some of the patches of this 
>> original series. There are still some changes left, but
>> what works for me is the current master +
>>
>> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
>> index 5816702..3eaef68 100644
>> --- a/util/qemu-coroutine.c
>> +++ b/util/qemu-coroutine.c
>> @@ -25,8 +25,6 @@ enum {
>>  };
>>
>>  /** Free list to speed up creation */
>> -static QSLIST_HEAD(, Coroutine) release_pool = 
>> QSLIST_HEAD_INITIALIZER(pool);
>> -static unsigned int release_pool_size;
>>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
>> QSLIST_HEAD_INITIALIZER(pool);
>>  static __thread unsigned int alloc_pool_size;
>>  static __thread Notifier coroutine_pool_cleanup_notifier;
>> @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
>> *entry)
>>      if (CONFIG_COROUTINE_POOL) {
>>          co = QSLIST_FIRST(&alloc_pool);
>>          if (!co) {
>> -            if (release_pool_size > POOL_BATCH_SIZE) {
>> -                /* Slow path; a good place to register the 
>> destructor, too.  */
>> -                if (!coroutine_pool_cleanup_notifier.notify) {
>> -                    coroutine_pool_cleanup_notifier.notify = 
>> coroutine_pool_cleanup;
>> - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
>> -                }
>> -
>> -                /* This is not exact; there could be a little skew 
>> between
>> -                 * release_pool_size and the actual size of 
>> release_pool.  But
>> -                 * it is just a heuristic, it does not need to be 
>> perfect.
>> -                 */
>> -                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
>> -                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>> -                co = QSLIST_FIRST(&alloc_pool);
>> +            /* Slow path; a good place to register the destructor, 
>> too.  */
>> +            if (!coroutine_pool_cleanup_notifier.notify) {
>> +                coroutine_pool_cleanup_notifier.notify = 
>> coroutine_pool_cleanup;
>> + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
>>              }
>>          }
>>          if (co) {
>> @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
>>      co->caller = NULL;
>>
>>      if (CONFIG_COROUTINE_POOL) {
>> -        if (release_pool_size < POOL_BATCH_SIZE * 2) {
>> -            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
>> -            atomic_inc(&release_pool_size);
>> -            return;
>> -        }
>>          if (alloc_pool_size < POOL_BATCH_SIZE) {
>>              QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
>>              alloc_pool_size++;
>>
>> + invoking qemu with the following environemnet variable set:
>>
>> MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 ....
>>
>> The last one makes glibc automatically using mmap when the malloced 
>> memory exceeds 32kByte.
>>
>
> Peter,
>
> I tested the above patch (and the environment variable --- it doesn't 
> quite come close to as lean of
> an RSS tally as the original patchset -------- there's still about 
> 70-80 MB of remaining RSS.
>
> Any chance you could trim the remaining fat before merging this? =)
>
>

False alarm! I didn't set the MMAP threshold low enough. Now the results 
are on-par with the other patchset.

Thank you!

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2016-11-01 22:02 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-28  9:01 [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 01/15] coroutine-ucontext: mmap stack memory Peter Lieven
2016-06-28 10:02   ` Peter Maydell
2016-06-28 10:21     ` Peter Lieven
2016-06-28 11:04   ` Paolo Bonzini
2016-06-28  9:01 ` [Qemu-devel] [PATCH 02/15] coroutine-ucontext: add a switch to monitor maximum stack size Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 03/15] coroutine-ucontext: reduce stack size to 64kB Peter Lieven
2016-06-28 10:54   ` Paolo Bonzini
2016-06-28 10:57     ` Dr. David Alan Gilbert
2016-06-28 11:17       ` Peter Lieven
2016-06-28 11:35         ` Dr. David Alan Gilbert
2016-06-28 12:09           ` Peter Lieven
2016-06-28 14:20             ` Dr. David Alan Gilbert
2016-06-30  6:34               ` Peter Lieven
2016-06-28 11:13     ` Peter Lieven
2016-06-28 11:26       ` Paolo Bonzini
2016-06-28  9:01 ` [Qemu-devel] [PATCH 04/15] coroutine: add a knob to disable the shared release pool Peter Lieven
2016-06-28 10:41   ` Paolo Bonzini
2016-06-28 10:47     ` Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 05/15] util: add a helper to mmap private anonymous memory Peter Lieven
2016-10-16  2:10   ` Michael S. Tsirkin
2016-10-18 13:50     ` Alex Bennée
2016-06-28  9:01 ` [Qemu-devel] [PATCH 06/15] exec: use mmap for subpages Peter Lieven
2016-06-28 10:48   ` Paolo Bonzini
2016-06-28  9:01 ` [Qemu-devel] [PATCH 07/15] qapi: use mmap for QmpInputVisitor Peter Lieven
2016-06-28  9:29   ` Dr. David Alan Gilbert
2016-06-28  9:39     ` Peter Lieven
2016-06-28 10:10       ` Daniel P. Berrange
2016-06-28 10:17         ` Dr. David Alan Gilbert
2016-06-28 10:21           ` Daniel P. Berrange
2016-06-28 14:10           ` Eric Blake
2016-06-28 11:36   ` Paolo Bonzini
2016-06-28 14:14     ` Eric Blake
2016-06-30 14:12   ` Markus Armbruster
2016-07-04  9:02     ` Paolo Bonzini
2016-07-04 11:18       ` Markus Armbruster
2016-07-04 11:36         ` Peter Lieven
2016-07-04 11:42         ` Paolo Bonzini
2016-06-28  9:01 ` [Qemu-devel] [PATCH 08/15] virtio: use mmap for VirtQueue Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 09/15] loader: use mmap for ROMs Peter Lieven
2016-06-28 10:41   ` Paolo Bonzini
2016-06-28 11:26     ` Peter Lieven
2016-07-04  7:30     ` Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 10/15] vmware_svga: use mmap for scratch pad Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 11/15] qom: use mmap for bigger Objects Peter Lieven
2016-06-28 10:08   ` Daniel P. Berrange
2016-06-28 10:10   ` Peter Maydell
2016-06-28 10:19     ` Peter Lieven
2016-06-28 10:42   ` Paolo Bonzini
2016-06-28 10:49     ` Peter Lieven
2016-06-30 14:15       ` Markus Armbruster
2016-06-28  9:01 ` [Qemu-devel] [PATCH 12/15] util: add a function to realloc mmapped memory Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 13/15] exec: use mmap for PhysPageMap->nodes Peter Lieven
2016-06-28 10:43   ` Paolo Bonzini
2016-06-28 10:48     ` Peter Lieven
2016-07-11  9:31     ` Peter Lieven
2016-07-11  9:44       ` Peter Lieven
2016-07-11 10:37       ` Paolo Bonzini
2016-07-12 14:34         ` Peter Lieven
2016-07-13 10:27           ` Paolo Bonzini
2016-07-14 14:47             ` Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 14/15] vnc-tight: make the encoding palette static Peter Lieven
2016-06-28 11:12   ` Paolo Bonzini
2016-06-28 11:18     ` Peter Lieven
2016-06-28  9:01 ` [Qemu-devel] [PATCH 15/15] vnc: use mmap for VncState Peter Lieven
2016-06-28 11:37 ` [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage Paolo Bonzini
2016-06-28 12:14   ` Peter Lieven
2016-06-28 12:29     ` Paolo Bonzini
2016-06-28 12:33       ` Peter Lieven
2016-06-28 12:56         ` Paolo Bonzini
2016-06-28 12:56         ` Dr. David Alan Gilbert
2016-06-28 14:43           ` Peter Lieven
2016-06-28 14:52             ` Peter Lieven
2016-10-12 21:18 ` Michael R. Hines
2016-10-18 10:47   ` Peter Lieven
2016-10-19 17:40     ` Michael R. Hines
2016-10-31 22:00     ` Michael R. Hines
2016-11-01 22:02       ` Michael R. Hines

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.