All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
@ 2017-11-24  6:30 Yang Zhong
  2017-11-24 11:27 ` Stefan Hajnoczi
  2017-11-26  6:17 ` Shannon Zhao
  0 siblings, 2 replies; 22+ messages in thread
From: Yang Zhong @ 2017-11-24  6:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: stefanha, berrange, pbonzini, yang.zhong, stone.xulei,
	arei.gonglei, wangxinxin.wang, weidong.huang,
	zhang.zhanghailiang, liujunjie23

Since there are some issues in memory alloc/free machenism
in glibc for little chunk memory, if Qemu frequently
alloc/free little chunk memory, the glibc doesn't alloc
little chunk memory from free list of glibc and still
allocate from OS, which make the heap size bigger and bigger.

This patch introduce malloc_trim(), which will free heap memory.

Below are test results from smaps file.
(1)without patch
55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
Size:              21796 kB
Rss:               14260 kB
Pss:               14260 kB

(2)with patch
55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
Size:              21668 kB
Rss:                6940 kB
Pss:                6940 kB

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 configure  | 29 +++++++++++++++++++++++++++++
 util/rcu.c |  6 ++++++
 2 files changed, 35 insertions(+)

diff --git a/configure b/configure
index 0c6e757..6292ab0 100755
--- a/configure
+++ b/configure
@@ -426,6 +426,7 @@ vxhs=""
 supported_cpu="no"
 supported_os="no"
 bogus_os="no"
+malloc_trim="yes"
 
 # parse CC options first
 for opt do
@@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
     exit 1
 fi
 
+# Even if malloc_trim() is available, these non-libc memory allocators
+# do not support it.
+if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
+    if test "$malloc_trim" = "yes" ; then
+        echo "Disabling malloc_trim with non-libc memory allocator"
+    fi
+    malloc_trim="no"
+fi
+
+#######################################
+# malloc_trim
+
+if test "$malloc_trim" != "no" ; then
+    cat > $TMPC << EOF
+#include <malloc.h>
+int main(void) { malloc_trim(0); return 0; }
+EOF
+    if compile_prog "" "" ; then
+        malloc_trim="yes"
+    else
+        malloc_trim="no"
+    fi
+fi
+
 ##########################################
 # tcmalloc probe
 
@@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
   fi
 fi
 
+if test "$malloc_trim" = "yes" ; then
+  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
+fi
+
 if test "$avx2_opt" = "yes" ; then
   echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
 fi
diff --git a/util/rcu.c b/util/rcu.c
index ca5a63e..f403b77 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -32,6 +32,9 @@
 #include "qemu/atomic.h"
 #include "qemu/thread.h"
 #include "qemu/main-loop.h"
+#if defined(CONFIG_MALLOC_TRIM)
+#include <malloc.h>
+#endif
 
 /*
  * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
@@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
             node->func(node);
         }
         qemu_mutex_unlock_iothread();
+#if defined(CONFIG_MALLOC_TRIM)
+        malloc_trim(4 * 1024 * 1024);
+#endif
     }
     abort();
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-11-24  6:30 [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim() Yang Zhong
@ 2017-11-24 11:27 ` Stefan Hajnoczi
  2017-11-26  6:17 ` Shannon Zhao
  1 sibling, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-11-24 11:27 UTC (permalink / raw)
  To: Yang Zhong
  Cc: qemu-devel, berrange, pbonzini, stone.xulei, arei.gonglei,
	wangxinxin.wang, weidong.huang, zhang.zhanghailiang, liujunjie23

[-- Attachment #1: Type: text/plain, Size: 923 bytes --]

On Fri, Nov 24, 2017 at 02:30:30PM +0800, Yang Zhong wrote:
> diff --git a/configure b/configure
> index 0c6e757..6292ab0 100755
> --- a/configure
> +++ b/configure
> @@ -426,6 +426,7 @@ vxhs=""
>  supported_cpu="no"
>  supported_os="no"
>  bogus_os="no"
> +malloc_trim="yes"

Looks pretty good, sorry I forgot to mention two things:

Please add the --enable-malloc-trim/--disable-malloc-trim options so
it's easy to build QEMU with or without this feature.  For example, if
someone is debugging a performance issue they may wish to rebuild with
--disable-malloc-trim to confirm that trimming hasn't caused a
regression.

Also please change this line to malloc_trim="" so the "Disabling
malloc_trim with non-libc memory allocator" error message is only
printed when --enable-malloc-trim was explicitly given by the user.
Otherwise the message is always printed when QEMU is built with
jemalloc/tcmalloc - that's too noisy.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-11-24  6:30 [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim() Yang Zhong
  2017-11-24 11:27 ` Stefan Hajnoczi
@ 2017-11-26  6:17 ` Shannon Zhao
  2017-11-27  3:06   ` Zhong Yang
       [not found]   ` <20171201105622.GB26237@yangzhon-Virtual>
  1 sibling, 2 replies; 22+ messages in thread
From: Shannon Zhao @ 2017-11-26  6:17 UTC (permalink / raw)
  To: Yang Zhong, qemu-devel
  Cc: zhang.zhanghailiang, liujunjie23, wangxinxin.wang, stone.xulei,
	arei.gonglei, stefanha, pbonzini, weidong.huang

Hi,

On 2017/11/24 14:30, Yang Zhong wrote:
> Since there are some issues in memory alloc/free machenism
> in glibc for little chunk memory, if Qemu frequently
> alloc/free little chunk memory, the glibc doesn't alloc
> little chunk memory from free list of glibc and still
> allocate from OS, which make the heap size bigger and bigger.
> 
> This patch introduce malloc_trim(), which will free heap memory.
> 
> Below are test results from smaps file.
> (1)without patch
> 55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
> Size:              21796 kB
> Rss:               14260 kB
> Pss:               14260 kB
> 
> (2)with patch
> 55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
> Size:              21668 kB
> Rss:                6940 kB
> Pss:                6940 kB
> 
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  configure  | 29 +++++++++++++++++++++++++++++
>  util/rcu.c |  6 ++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/configure b/configure
> index 0c6e757..6292ab0 100755
> --- a/configure
> +++ b/configure
> @@ -426,6 +426,7 @@ vxhs=""
>  supported_cpu="no"
>  supported_os="no"
>  bogus_os="no"
> +malloc_trim="yes"
>  
>  # parse CC options first
>  for opt do
> @@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
>      exit 1
>  fi
>  
> +# Even if malloc_trim() is available, these non-libc memory allocators
> +# do not support it.
> +if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
> +    if test "$malloc_trim" = "yes" ; then
> +        echo "Disabling malloc_trim with non-libc memory allocator"
> +    fi
> +    malloc_trim="no"
> +fi
> +
> +#######################################
> +# malloc_trim
> +
> +if test "$malloc_trim" != "no" ; then
> +    cat > $TMPC << EOF
> +#include <malloc.h>
> +int main(void) { malloc_trim(0); return 0; }
> +EOF
> +    if compile_prog "" "" ; then
> +        malloc_trim="yes"
> +    else
> +        malloc_trim="no"
> +    fi
> +fi
> +
>  ##########################################
>  # tcmalloc probe
>  
> @@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
>    fi
>  fi
>  
> +if test "$malloc_trim" = "yes" ; then
> +  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
> +fi
> +
>  if test "$avx2_opt" = "yes" ; then
>    echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
>  fi
> diff --git a/util/rcu.c b/util/rcu.c
> index ca5a63e..f403b77 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -32,6 +32,9 @@
>  #include "qemu/atomic.h"
>  #include "qemu/thread.h"
>  #include "qemu/main-loop.h"
> +#if defined(CONFIG_MALLOC_TRIM)
> +#include <malloc.h>
> +#endif
>  
>  /*
>   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
> @@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
>              node->func(node);
>          }
>          qemu_mutex_unlock_iothread();
> +#if defined(CONFIG_MALLOC_TRIM)
> +        malloc_trim(4 * 1024 * 1024);
> +#endif
>      }
>      abort();
>  }
> 

Looks like this patch introduces a performance regression. With this
patch the time of booting a VM with 60 scsi disks on ARM64 is increased
by 200+ seconds.

Thanks,
-- 
Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-11-26  6:17 ` Shannon Zhao
@ 2017-11-27  3:06   ` Zhong Yang
  2017-11-27 11:59     ` Paolo Bonzini
       [not found]   ` <20171201105622.GB26237@yangzhon-Virtual>
  1 sibling, 1 reply; 22+ messages in thread
From: Zhong Yang @ 2017-11-27  3:06 UTC (permalink / raw)
  To: Shannon Zhao
  Cc: qemu-devel, pbonzini, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhang.zhanghailiang, stefanha,
	berrange, yang.zhong

On Sun, Nov 26, 2017 at 02:17:18PM +0800, Shannon Zhao wrote:
> Hi,
> 
> On 2017/11/24 14:30, Yang Zhong wrote:
> > Since there are some issues in memory alloc/free machenism
> > in glibc for little chunk memory, if Qemu frequently
> > alloc/free little chunk memory, the glibc doesn't alloc
> > little chunk memory from free list of glibc and still
> > allocate from OS, which make the heap size bigger and bigger.
> > 
> > This patch introduce malloc_trim(), which will free heap memory.
> > 
> > Below are test results from smaps file.
> > (1)without patch
> > 55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
> > Size:              21796 kB
> > Rss:               14260 kB
> > Pss:               14260 kB
> > 
> > (2)with patch
> > 55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
> > Size:              21668 kB
> > Rss:                6940 kB
> > Pss:                6940 kB
> > 
> > Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> > ---
> >  configure  | 29 +++++++++++++++++++++++++++++
> >  util/rcu.c |  6 ++++++
> >  2 files changed, 35 insertions(+)
> > 
> > diff --git a/configure b/configure
> > index 0c6e757..6292ab0 100755
> > --- a/configure
> > +++ b/configure
> > @@ -426,6 +426,7 @@ vxhs=""
> >  supported_cpu="no"
> >  supported_os="no"
> >  bogus_os="no"
> > +malloc_trim="yes"
> >  
> >  # parse CC options first
> >  for opt do
> > @@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
> >      exit 1
> >  fi
> >  
> > +# Even if malloc_trim() is available, these non-libc memory allocators
> > +# do not support it.
> > +if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
> > +    if test "$malloc_trim" = "yes" ; then
> > +        echo "Disabling malloc_trim with non-libc memory allocator"
> > +    fi
> > +    malloc_trim="no"
> > +fi
> > +
> > +#######################################
> > +# malloc_trim
> > +
> > +if test "$malloc_trim" != "no" ; then
> > +    cat > $TMPC << EOF
> > +#include <malloc.h>
> > +int main(void) { malloc_trim(0); return 0; }
> > +EOF
> > +    if compile_prog "" "" ; then
> > +        malloc_trim="yes"
> > +    else
> > +        malloc_trim="no"
> > +    fi
> > +fi
> > +
> >  ##########################################
> >  # tcmalloc probe
> >  
> > @@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
> >    fi
> >  fi
> >  
> > +if test "$malloc_trim" = "yes" ; then
> > +  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
> > +fi
> > +
> >  if test "$avx2_opt" = "yes" ; then
> >    echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
> >  fi
> > diff --git a/util/rcu.c b/util/rcu.c
> > index ca5a63e..f403b77 100644
> > --- a/util/rcu.c
> > +++ b/util/rcu.c
> > @@ -32,6 +32,9 @@
> >  #include "qemu/atomic.h"
> >  #include "qemu/thread.h"
> >  #include "qemu/main-loop.h"
> > +#if defined(CONFIG_MALLOC_TRIM)
> > +#include <malloc.h>
> > +#endif
> >  
> >  /*
> >   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
> > @@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
> >              node->func(node);
> >          }
> >          qemu_mutex_unlock_iothread();
> > +#if defined(CONFIG_MALLOC_TRIM)
> > +        malloc_trim(4 * 1024 * 1024);
> > +#endif
> >      }
> >      abort();
> >  }
> > 
> 
> Looks like this patch introduces a performance regression. With this
> patch the time of booting a VM with 60 scsi disks on ARM64 is increased
> by 200+ seconds.
> 
  Hello Shannon,

  Thanks for your reply!
  As for your concerns, i did VM bootup compared tests, and results as below:

  #test command
  ./qemu-system-x86_64 -enable-kvm -cpu host -m 2G -smp cpus=4,cores=4,\
                       threads=1,sockets=1 -drive format=raw,\
                       file=test.img,index=0,media=disk -nographic

  #without patch
  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.979s (kernel) + 1.214s (userspace) = 6.193s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.922s (kernel) + 1.175s (userspace) = 6.097s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.990s (kernel) + 1.301s (userspace) = 6.291s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.063s (kernel) + 1.336s (userspace) = 6.400s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.820s (kernel) + 1.237s (userspace) = 6.057s

  avg: kernel 4.9548, userspace 1.2526


  #with this patch
  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.099s (kernel) + 1.579s (userspace) = 6.679s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.003s (kernel) + 1.343s (userspace) = 6.347s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.853s (kernel) + 1.220s (userspace) = 6.074s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.836s (kernel) + 1.111s (userspace) = 5.948s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.917s (kernel) + 1.166s (userspace) = 6.083s

  avg: kernel 4.9416s, userspace: 1.2838

  From above test results, there are almost not any performance regression
  on x86 platform. Sorry, there is not any ARM based platform in my hand,
  i can't give related datas.  thanks!

  Regards,

  Yang


> Thanks,
> -- 
> Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-11-27  3:06   ` Zhong Yang
@ 2017-11-27 11:59     ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2017-11-27 11:59 UTC (permalink / raw)
  To: Zhong Yang, Shannon Zhao
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhang.zhanghailiang, stefanha,
	berrange

On 27/11/2017 04:06, Zhong Yang wrote:
>   #test command
>   ./qemu-system-x86_64 -enable-kvm -cpu host -m 2G -smp cpus=4,cores=4,\
>                        threads=1,sockets=1 -drive format=raw,\
>                        file=test.img,index=0,media=disk -nographic
> 
>   #without patch
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.979s (kernel) + 1.214s (userspace) = 6.193s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.922s (kernel) + 1.175s (userspace) = 6.097s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.990s (kernel) + 1.301s (userspace) = 6.291s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.063s (kernel) + 1.336s (userspace) = 6.400s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.820s (kernel) + 1.237s (userspace) = 6.057s
> 
>   avg: kernel 4.9548, userspace 1.2526
> 
> 
>   #with this patch
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.099s (kernel) + 1.579s (userspace) = 6.679s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.003s (kernel) + 1.343s (userspace) = 6.347s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.853s (kernel) + 1.220s (userspace) = 6.074s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.836s (kernel) + 1.111s (userspace) = 5.948s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.917s (kernel) + 1.166s (userspace) = 6.083s
> 
>   avg: kernel 4.9416s, userspace: 1.2838
> 
>   From above test results, there are almost not any performance regression
>   on x86 platform. Sorry, there is not any ARM based platform in my hand,
>   i can't give related datas.  thanks!

You are using only one disk, Shannon is using 200.  That may make a
difference, as PCI BAR setup in the guest becomes very expensive as you
add more devices.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
       [not found]     ` <74cccd14-e485-90d4-82d9-03355c05faca@redhat.com>
@ 2017-12-04 12:03       ` Yang Zhong
  2017-12-04 12:07         ` Daniel P. Berrange
  2017-12-04 12:26         ` Shannon Zhao
  0 siblings, 2 replies; 22+ messages in thread
From: Yang Zhong @ 2017-12-04 12:03 UTC (permalink / raw)
  To: Paolo Bonzini, zhaoshenglong, stefanha
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhang.zhanghailiang, berrange

On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
> On 01/12/2017 11:56, Yang Zhong wrote:
> >   This issue should be caused by much times of system call by malloc_trim(),
> >   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
> >   trade-off between VM perforamance and memory optimization. Whether below 
> >   method is suitable?
> > 
> >   int num=1;
> >   ......
> > 
> >   #if defined(CONFIG_MALLOC_TRIM)
> >         if(!(num++%5))
> >         {
> >              malloc_trim(4 * 1024 * 1024);
> >         }
> >   #endif
> >  
> >   Any comments are welcome! Thanks a lot!
> 
> Indeed something like this will do, perhaps only trim once per second?
> 
  Hello Paolo,

  Thanks for comments!
  If we do trim once per second, maybe the frequency is a little high, what'e
  more, we need maintain one timer to call this, this also cost cpu resource.

  I added the log and did the test here with my test qemu command, when VM bootup,
  which did more than 600 times free operations and 9 times memory trim in rcu 
  thread. If i use our ClearContainer qemu command, the memory trim will down 
  to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
  increse.

  In my above method, the trim is only executed in the multiple of 5, which will
  reduce trim times and do not heavily impact VM bootup performance. 

  I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
  method serialize to malloc() and free(), which will reduce VM performance.

  The ultimate aim is to reduce trim system call during the VM bootup and running.
  It's appreciated that if you have better suggestions.

  Regards,

  Yang

> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-04 12:03       ` Yang Zhong
@ 2017-12-04 12:07         ` Daniel P. Berrange
  2017-12-04 12:16           ` Yang Zhong
  2017-12-04 12:18           ` Paolo Bonzini
  2017-12-04 12:26         ` Shannon Zhao
  1 sibling, 2 replies; 22+ messages in thread
From: Daniel P. Berrange @ 2017-12-04 12:07 UTC (permalink / raw)
  To: Yang Zhong
  Cc: Paolo Bonzini, zhaoshenglong, stefanha, qemu-devel,
	weidong.huang, arei.gonglei, liujunjie23, wangxinxin.wang,
	stone.xulei, zhang.zhanghailiang

On Mon, Dec 04, 2017 at 08:03:22PM +0800, Yang Zhong wrote:
> On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
> > On 01/12/2017 11:56, Yang Zhong wrote:
> > >   This issue should be caused by much times of system call by malloc_trim(),
> > >   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
> > >   trade-off between VM perforamance and memory optimization. Whether below 
> > >   method is suitable?
> > > 
> > >   int num=1;
> > >   ......
> > > 
> > >   #if defined(CONFIG_MALLOC_TRIM)
> > >         if(!(num++%5))
> > >         {
> > >              malloc_trim(4 * 1024 * 1024);
> > >         }
> > >   #endif
> > >  
> > >   Any comments are welcome! Thanks a lot!
> > 
> > Indeed something like this will do, perhaps only trim once per second?
> > 
>   Hello Paolo,
> 
>   Thanks for comments!
>   If we do trim once per second, maybe the frequency is a little high, what'e
>   more, we need maintain one timer to call this, this also cost cpu resource.
> 
>   I added the log and did the test here with my test qemu command, when VM bootup,
>   which did more than 600 times free operations and 9 times memory trim in rcu 
>   thread. If i use our ClearContainer qemu command, the memory trim will down 
>   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
>   increse.
> 
>   In my above method, the trim is only executed in the multiple of 5, which will
>   reduce trim times and do not heavily impact VM bootup performance. 
> 
>   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
>   method serialize to malloc() and free(), which will reduce VM performance.
> 
>   The ultimate aim is to reduce trim system call during the VM bootup and running.
>   It's appreciated that if you have better suggestions.

Does configuring QEMU to use tcmalloc or jemalloc instead of glibc's malloc
give you the performance & menmory usage that you require? If so, it might
not be worth bothering to hack around problems in glibc's malloc at all.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-04 12:07         ` Daniel P. Berrange
@ 2017-12-04 12:16           ` Yang Zhong
  2017-12-04 12:18           ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Yang Zhong @ 2017-12-04 12:16 UTC (permalink / raw)
  To: Daniel P. Berrange, pbonzini, stefanha
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhang.zhanghailiang, zhaoshenglong,
	yang.zhong

On Mon, Dec 04, 2017 at 12:07:05PM +0000, Daniel P. Berrange wrote:
> On Mon, Dec 04, 2017 at 08:03:22PM +0800, Yang Zhong wrote:
> > On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
> > > On 01/12/2017 11:56, Yang Zhong wrote:
> > > >   This issue should be caused by much times of system call by malloc_trim(),
> > > >   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
> > > >   trade-off between VM perforamance and memory optimization. Whether below 
> > > >   method is suitable?
> > > > 
> > > >   int num=1;
> > > >   ......
> > > > 
> > > >   #if defined(CONFIG_MALLOC_TRIM)
> > > >         if(!(num++%5))
> > > >         {
> > > >              malloc_trim(4 * 1024 * 1024);
> > > >         }
> > > >   #endif
> > > >  
> > > >   Any comments are welcome! Thanks a lot!
> > > 
> > > Indeed something like this will do, perhaps only trim once per second?
> > > 
> >   Hello Paolo,
> > 
> >   Thanks for comments!
> >   If we do trim once per second, maybe the frequency is a little high, what'e
> >   more, we need maintain one timer to call this, this also cost cpu resource.
> > 
> >   I added the log and did the test here with my test qemu command, when VM bootup,
> >   which did more than 600 times free operations and 9 times memory trim in rcu 
> >   thread. If i use our ClearContainer qemu command, the memory trim will down 
> >   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
> >   increse.
> > 
> >   In my above method, the trim is only executed in the multiple of 5, which will
> >   reduce trim times and do not heavily impact VM bootup performance. 
> > 
> >   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
> >   method serialize to malloc() and free(), which will reduce VM performance.
> > 
> >   The ultimate aim is to reduce trim system call during the VM bootup and running.
> >   It's appreciated that if you have better suggestions.
> 
> Does configuring QEMU to use tcmalloc or jemalloc instead of glibc's malloc
> give you the performance & menmory usage that you require? If so, it might
> not be worth bothering to hack around problems in glibc's malloc at all.
> 
> 
  Hello Daniel,

  Thanks for comment!
  I used the tcmalloc and jemalloc to do compared test, it's pity that there is no heap
  item in smaps file with jemalloc, the glibc and tcmalloc result as below:

  ##glibc malloc
  5618c0a98000-5618c1cde000 rw-p 00000000 00:00 0                          [heap]
  Size:              18712 kB
  KernelPageSize:        4 kB
  MMUPageSize:           4 kB
  Rss:               10536 kB
  Pss:               10536 kB
 
  ##tcmalloc
  557f79119000-557f7af46000 rw-p 00000000 00:00 0                          [heap]
  Size:              30900 kB
  Rss:               20244 kB
  Pss:               20244 kB

  The result show the Rss with tcmalloc is higher than glibc's.

  Regards,

  Yang

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-04 12:07         ` Daniel P. Berrange
  2017-12-04 12:16           ` Yang Zhong
@ 2017-12-04 12:18           ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2017-12-04 12:18 UTC (permalink / raw)
  To: Daniel P. Berrange, Yang Zhong
  Cc: zhaoshenglong, stefanha, qemu-devel, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei, zhang.zhanghailiang

On 04/12/2017 13:07, Daniel P. Berrange wrote:
> On Mon, Dec 04, 2017 at 08:03:22PM +0800, Yang Zhong wrote:
>> On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
>>> On 01/12/2017 11:56, Yang Zhong wrote:
>>>>   This issue should be caused by much times of system call by malloc_trim(),
>>>>   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
>>>>   trade-off between VM perforamance and memory optimization. Whether below 
>>>>   method is suitable?
>>>>
>>>>   int num=1;
>>>>   ......
>>>>
>>>>   #if defined(CONFIG_MALLOC_TRIM)
>>>>         if(!(num++%5))
>>>>         {
>>>>              malloc_trim(4 * 1024 * 1024);
>>>>         }
>>>>   #endif
>>>>  
>>>>   Any comments are welcome! Thanks a lot!
>>>
>>> Indeed something like this will do, perhaps only trim once per second?
>>>
>>   Hello Paolo,
>>
>>   Thanks for comments!
>>   If we do trim once per second, maybe the frequency is a little high, what'e
>>   more, we need maintain one timer to call this, this also cost cpu resource.
>>
>>   I added the log and did the test here with my test qemu command, when VM bootup,
>>   which did more than 600 times free operations and 9 times memory trim in rcu 
>>   thread. If i use our ClearContainer qemu command, the memory trim will down 
>>   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
>>   increse.
>>
>>   In my above method, the trim is only executed in the multiple of 5, which will
>>   reduce trim times and do not heavily impact VM bootup performance. 
>>
>>   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
>>   method serialize to malloc() and free(), which will reduce VM performance.
>>
>>   The ultimate aim is to reduce trim system call during the VM bootup and running.
>>   It's appreciated that if you have better suggestions.
> 
> Does configuring QEMU to use tcmalloc or jemalloc instead of glibc's malloc
> give you the performance & menmory usage that you require? If so, it might
> not be worth bothering to hack around problems in glibc's malloc at all.

At least for tcmalloc, the default tuning is to pay little attention to
memory usage.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-04 12:03       ` Yang Zhong
  2017-12-04 12:07         ` Daniel P. Berrange
@ 2017-12-04 12:26         ` Shannon Zhao
  2017-12-05  6:00           ` Yang Zhong
  1 sibling, 1 reply; 22+ messages in thread
From: Shannon Zhao @ 2017-12-04 12:26 UTC (permalink / raw)
  To: Yang Zhong, Paolo Bonzini, stefanha
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhang.zhanghailiang, berrange

Hi Yang,

On 2017/12/4 20:03, Yang Zhong wrote:
> On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
>> > On 01/12/2017 11:56, Yang Zhong wrote:
>>> > >   This issue should be caused by much times of system call by malloc_trim(),
>>> > >   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
>>> > >   trade-off between VM perforamance and memory optimization. Whether below 
>>> > >   method is suitable?
>>> > > 
>>> > >   int num=1;
>>> > >   ......
>>> > > 
>>> > >   #if defined(CONFIG_MALLOC_TRIM)
>>> > >         if(!(num++%5))
>>> > >         {
>>> > >              malloc_trim(4 * 1024 * 1024);
>>> > >         }
>>> > >   #endif
>>> > >  
>>> > >   Any comments are welcome! Thanks a lot!
>> > 
>> > Indeed something like this will do, perhaps only trim once per second?
>> > 
>   Hello Paolo,
> 
>   Thanks for comments!
>   If we do trim once per second, maybe the frequency is a little high, what'e
>   more, we need maintain one timer to call this, this also cost cpu resource.
> 
>   I added the log and did the test here with my test qemu command, when VM bootup,
>   which did more than 600 times free operations and 9 times memory trim in rcu 
>   thread. If i use our ClearContainer qemu command, the memory trim will down 
>   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
>   increse.
> 
>   In my above method, the trim is only executed in the multiple of 5, which will
>   reduce trim times and do not heavily impact VM bootup performance. 
> 
>   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
>   method serialize to malloc() and free(), which will reduce VM performance.
> 
>   The ultimate aim is to reduce trim system call during the VM bootup and running.
>   It's appreciated that if you have better suggestions.

Maybe we can provide a QMP command or something else for user to trim
the heap manually like the kernel sysfs interface
/proc/sys/vm/drop_caches which provides an interface for user to drop
the caches.
So let user to decide whether it needs to trim the heap.

Thanks,
-- 
Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-04 12:26         ` Shannon Zhao
@ 2017-12-05  6:00           ` Yang Zhong
  2017-12-05 14:10             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Zhong @ 2017-12-05  6:00 UTC (permalink / raw)
  To: Shannon Zhao, pbonzini, stefanha, berrange
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, yang.zhong

On Mon, Dec 04, 2017 at 08:26:29PM +0800, Shannon Zhao wrote:
> Hi Yang,
> 
> On 2017/12/4 20:03, Yang Zhong wrote:
> > On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
> >> > On 01/12/2017 11:56, Yang Zhong wrote:
> >>> > >   This issue should be caused by much times of system call by malloc_trim(),
> >>> > >   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
> >>> > >   trade-off between VM perforamance and memory optimization. Whether below 
> >>> > >   method is suitable?
> >>> > > 
> >>> > >   int num=1;
> >>> > >   ......
> >>> > > 
> >>> > >   #if defined(CONFIG_MALLOC_TRIM)
> >>> > >         if(!(num++%5))
> >>> > >         {
> >>> > >              malloc_trim(4 * 1024 * 1024);
> >>> > >         }
> >>> > >   #endif
> >>> > >  
> >>> > >   Any comments are welcome! Thanks a lot!
> >> > 
> >> > Indeed something like this will do, perhaps only trim once per second?
> >> > 
> >   Hello Paolo,
> > 
> >   Thanks for comments!
> >   If we do trim once per second, maybe the frequency is a little high, what'e
> >   more, we need maintain one timer to call this, this also cost cpu resource.
> > 
> >   I added the log and did the test here with my test qemu command, when VM bootup,
> >   which did more than 600 times free operations and 9 times memory trim in rcu 
> >   thread. If i use our ClearContainer qemu command, the memory trim will down 
> >   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
> >   increse.
> > 
> >   In my above method, the trim is only executed in the multiple of 5, which will
> >   reduce trim times and do not heavily impact VM bootup performance. 
> > 
> >   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
> >   method serialize to malloc() and free(), which will reduce VM performance.
> > 
> >   The ultimate aim is to reduce trim system call during the VM bootup and running.
> >   It's appreciated that if you have better suggestions.
> 
> Maybe we can provide a QMP command or something else for user to trim
> the heap manually like the kernel sysfs interface
> /proc/sys/vm/drop_caches which provides an interface for user to drop
> the caches.
> So let user to decide whether it needs to trim the heap.
>
  Hello Shannon,

  Thanks for your comments!
  This is also a good solution by QMP interface, but this is only suitable for few VMs.
  If there are millions of VMs in CSP(clouds of provider), it is very hard to operate.
  Thanks!

  Regards,

  Yang     

 
> Thanks,
> -- 
> Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-05  6:00           ` Yang Zhong
@ 2017-12-05 14:10             ` Paolo Bonzini
  2017-12-06  9:26               ` Yang Zhong
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2017-12-05 14:10 UTC (permalink / raw)
  To: Yang Zhong, Shannon Zhao, stefanha, berrange
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei

On 05/12/2017 07:00, Yang Zhong wrote:
> On Mon, Dec 04, 2017 at 08:26:29PM +0800, Shannon Zhao wrote:
>> Hi Yang,
>>
>> On 2017/12/4 20:03, Yang Zhong wrote:
>>> On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
>>>>> On 01/12/2017 11:56, Yang Zhong wrote:
>>>>>>>   This issue should be caused by much times of system call by malloc_trim(),
>>>>>>>   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
>>>>>>>   trade-off between VM perforamance and memory optimization. Whether below 
>>>>>>>   method is suitable?
>>>>>>>
>>>>>>>   int num=1;
>>>>>>>   ......
>>>>>>>
>>>>>>>   #if defined(CONFIG_MALLOC_TRIM)
>>>>>>>         if(!(num++%5))
>>>>>>>         {
>>>>>>>              malloc_trim(4 * 1024 * 1024);
>>>>>>>         }
>>>>>>>   #endif
>>>>>>>  
>>>>>>>   Any comments are welcome! Thanks a lot!
>>>>>
>>>>> Indeed something like this will do, perhaps only trim once per second?
>>>>>
>>>   Hello Paolo,
>>>
>>>   Thanks for comments!
>>>   If we do trim once per second, maybe the frequency is a little high, what'e
>>>   more, we need maintain one timer to call this, this also cost cpu resource.
>>>
>>>   I added the log and did the test here with my test qemu command, when VM bootup,
>>>   which did more than 600 times free operations and 9 times memory trim in rcu 
>>>   thread. If i use our ClearContainer qemu command, the memory trim will down 
>>>   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
>>>   increse.
>>>
>>>   In my above method, the trim is only executed in the multiple of 5, which will
>>>   reduce trim times and do not heavily impact VM bootup performance. 
>>>
>>>   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
>>>   method serialize to malloc() and free(), which will reduce VM performance.
>>>
>>>   The ultimate aim is to reduce trim system call during the VM bootup and running.
>>>   It's appreciated that if you have better suggestions.
>>
>> Maybe we can provide a QMP command or something else for user to trim
>> the heap manually like the kernel sysfs interface
>> /proc/sys/vm/drop_caches which provides an interface for user to drop
>> the caches.
>> So let user to decide whether it needs to trim the heap.
>>
>   Hello Shannon,
> 
>   Thanks for your comments!
>   This is also a good solution by QMP interface, but this is only suitable for few VMs.
>   If there are millions of VMs in CSP(clouds of provider), it is very hard to operate.
>   Thanks!

I agree, we only need to tweak the conditions under which malloc_trim is
called.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-05 14:10             ` Paolo Bonzini
@ 2017-12-06  9:26               ` Yang Zhong
  2017-12-06  9:48                 ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Zhong @ 2017-12-06  9:26 UTC (permalink / raw)
  To: Paolo Bonzini, stefanha, berrange
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, yang.zhong, zhaoshenglong

On Tue, Dec 05, 2017 at 03:10:23PM +0100, Paolo Bonzini wrote:
> On 05/12/2017 07:00, Yang Zhong wrote:
> > On Mon, Dec 04, 2017 at 08:26:29PM +0800, Shannon Zhao wrote:
> >> Hi Yang,
> >>
> >> On 2017/12/4 20:03, Yang Zhong wrote:
> >>> On Fri, Dec 01, 2017 at 01:52:49PM +0100, Paolo Bonzini wrote:
> >>>>> On 01/12/2017 11:56, Yang Zhong wrote:
> >>>>>>>   This issue should be caused by much times of system call by malloc_trim(),
> >>>>>>>   Shannon's test script include 60 scsi disks and 31 ioh3420 devices. We need 
> >>>>>>>   trade-off between VM perforamance and memory optimization. Whether below 
> >>>>>>>   method is suitable?
> >>>>>>>
> >>>>>>>   int num=1;
> >>>>>>>   ......
> >>>>>>>
> >>>>>>>   #if defined(CONFIG_MALLOC_TRIM)
> >>>>>>>         if(!(num++%5))
> >>>>>>>         {
> >>>>>>>              malloc_trim(4 * 1024 * 1024);
> >>>>>>>         }
> >>>>>>>   #endif
> >>>>>>>  
> >>>>>>>   Any comments are welcome! Thanks a lot!
> >>>>>
> >>>>> Indeed something like this will do, perhaps only trim once per second?
> >>>>>
> >>>   Hello Paolo,
> >>>
> >>>   Thanks for comments!
> >>>   If we do trim once per second, maybe the frequency is a little high, what'e
> >>>   more, we need maintain one timer to call this, this also cost cpu resource.
> >>>
> >>>   I added the log and did the test here with my test qemu command, when VM bootup,
> >>>   which did more than 600 times free operations and 9 times memory trim in rcu 
> >>>   thread. If i use our ClearContainer qemu command, the memory trim will down 
> >>>   to 6 times. As for Shannon's test command, the malloc trim number will abosultly 
> >>>   increse.
> >>>
> >>>   In my above method, the trim is only executed in the multiple of 5, which will
> >>>   reduce trim times and do not heavily impact VM bootup performance. 
> >>>
> >>>   I also want to use synchronize_rcu() and free() to replace call_rcu(), but this
> >>>   method serialize to malloc() and free(), which will reduce VM performance.
> >>>
> >>>   The ultimate aim is to reduce trim system call during the VM bootup and running.
> >>>   It's appreciated that if you have better suggestions.
> >>
> >> Maybe we can provide a QMP command or something else for user to trim
> >> the heap manually like the kernel sysfs interface
> >> /proc/sys/vm/drop_caches which provides an interface for user to drop
> >> the caches.
> >> So let user to decide whether it needs to trim the heap.
> >>
> >   Hello Shannon,
> > 
> >   Thanks for your comments!
> >   This is also a good solution by QMP interface, but this is only suitable for few VMs.
> >   If there are millions of VMs in CSP(clouds of provider), it is very hard to operate.
> >   Thanks!
> 
> I agree, we only need to tweak the conditions under which malloc_trim is
> called.
> 

  Hello Paolo,

  The best option is only trim one time after guest kernel bootup or VM bootup, and as for
  hotplug/unhotplug operations during the VM running, the trim still can do for each batch
  memory free because trim will not impact VM performance during VM running status.

  So, the key point is qemu is hard to know when guest ernel bootup is over. If you have some 
  suggestions, please let me know. thanks!

  Regards,

  Yang


> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-06  9:26               ` Yang Zhong
@ 2017-12-06  9:48                 ` Paolo Bonzini
  2017-12-07 15:06                   ` Yang Zhong
  2017-12-08 11:06                   ` Yang Zhong
  0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2017-12-06  9:48 UTC (permalink / raw)
  To: Yang Zhong, stefanha, berrange
  Cc: qemu-devel, weidong.huang, arei.gonglei, liujunjie23,
	wangxinxin.wang, stone.xulei, zhaoshenglong

On 06/12/2017 10:26, Yang Zhong wrote:
>   Hello Paolo,
> 
>   The best option is only trim one time after guest kernel bootup or VM bootup, and as for
>   hotplug/unhotplug operations during the VM running, the trim still can do for each batch
>   memory free because trim will not impact VM performance during VM running status.
> 
>   So, the key point is qemu is hard to know when guest ernel bootup is over. If you have some 
>   suggestions, please let me know. thanks!

It shouldn't be hard.  Does QEMU's RCU thread actually get any
significant activity after bootup?  Hence the suggestion of keeping
malloc_trim in the RCU thread, but only do it if some time has passed
since the last time.

Maybe something like this every time the RCU thread runs:

 static uint64_t next_trim_time, last_trim_time;
 if (current time < next_trim_time) {
     next_trim_time -= last_trim_time / 2    /* or higher */
     last_trim_time -= last_trim_time / 2    /* same as previous line */
 } else {
     trim_start_time = current time
     malloc_trim(...)
     last_trim_time = current time - trim_start_time
     next_trim_time = current time + last_trim_time
 }

Where the "2" factor should be tuned so that both your and Shannon's
scenario work fine.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-06  9:48                 ` Paolo Bonzini
@ 2017-12-07 15:06                   ` Yang Zhong
  2017-12-11 16:31                     ` Paolo Bonzini
  2017-12-08 11:06                   ` Yang Zhong
  1 sibling, 1 reply; 22+ messages in thread
From: Yang Zhong @ 2017-12-07 15:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei, zhaoshenglong,
	yang.zhong

On Wed, Dec 06, 2017 at 10:48:45AM +0100, Paolo Bonzini wrote:
> On 06/12/2017 10:26, Yang Zhong wrote:
> >   Hello Paolo,
> > 
> >   The best option is only trim one time after guest kernel bootup or VM bootup, and as for
> >   hotplug/unhotplug operations during the VM running, the trim still can do for each batch
> >   memory free because trim will not impact VM performance during VM running status.
> > 
> >   So, the key point is qemu is hard to know when guest ernel bootup is over. If you have some 
> >   suggestions, please let me know. thanks!
> 
> It shouldn't be hard.  Does QEMU's RCU thread actually get any
> significant activity after bootup?  Hence the suggestion of keeping
> malloc_trim in the RCU thread, but only do it if some time has passed
> since the last time.
> 
> Maybe something like this every time the RCU thread runs:
> 
>  static uint64_t next_trim_time, last_trim_time;
>  if (current time < next_trim_time) {
>      next_trim_time -= last_trim_time / 2    /* or higher */
>      last_trim_time -= last_trim_time / 2    /* same as previous line */
>  } else {
>      trim_start_time = current time
>      malloc_trim(...)
>      last_trim_time = current time - trim_start_time
>      next_trim_time = current time + last_trim_time
>  }
> 
> Where the "2" factor should be tuned so that both your and Shannon's
> scenario work fine.
> 
  Hello Paolo,

  Thanks for your help!
  I changed the patch per your advice, and new TEMP patch as below:

  static void *call_rcu_thread(void *opaque)
  {
     struct rcu_head *node;
 +    int num=1;

     rcu_register_thread();

  @@ -272,6 +273,21 @@ static void *call_rcu_thread(void *opaque)
             node->func(node);
         }
         qemu_mutex_unlock_iothread();
 +
 +        static uint64_t next_trim_time, last_trim_time;
 +        int delta=100;
 +
 +        if ( qemu_clock_get_ns(QEMU_CLOCK_HOST) < next_trim_time ) {
 +            next_trim_time -= last_trim_time / delta;   /* or higher */
 +            last_trim_time -= last_trim_time / delta;   /* same as previous line */
 +        } else {
 +            uint64_t trim_start_time = qemu_clock_get_ns(QEMU_CLOCK_HOST);
 +            malloc_trim(4 * 1024 *1024);
 +            last_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) - trim_start_time;
 +            next_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) + last_trim_time;
 +            printf("---------memory trim----------num=%d------last_trim_time=%ld next_trim_time=%ld --\n", num++,last_trim_time,next_trim_time);
 +       }
 +
 
 The print log for your reference
 ---------memory trim----------num=1------last_trim_time=165000 next_trim_time=1512658205270477000 --
 ---------memory trim----------num=2------last_trim_time=656000 next_trim_time=1512658205278032000 --
 ---------memory trim----------num=3------last_trim_time=620000 next_trim_time=1512658205298888000 --
 ---------memory trim----------num=4------last_trim_time=635000 next_trim_time=1512658205339967000 --
 ---------memory trim----------num=6------last_trim_time=526000 next_trim_time=1512658207659599000 --
 ---------memory trim----------num=7------last_trim_time=744000 next_trim_time=1512658208121249000 --
 ---------memory trim----------num=8------last_trim_time=872000 next_trim_time=1512658208132805000 --
 ---------memory trim----------num=9------last_trim_time=380000 next_trim_time=1512658208376950000 --
 ---------memory trim----------num=10------last_trim_time=521000 next_trim_time=1512658210648843000 --

 Which show trim cost time less than 1ms and call_rcu_thread() do 10 times batch free, the trim also 10 times.

 I also did below changes: 
    delta=1000,  and 
    next_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) + delta * last_trim_time

 The whole VM bootup will trim 3 times.

 Regards,

 Yang


> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-06  9:48                 ` Paolo Bonzini
  2017-12-07 15:06                   ` Yang Zhong
@ 2017-12-08 11:06                   ` Yang Zhong
  1 sibling, 0 replies; 22+ messages in thread
From: Yang Zhong @ 2017-12-08 11:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, stefanha, berrange, zhaoshenglong, weidong.huang,
	arei.gonglei, liujunjie23, wangxinxin.wang, stone.xulei,
	yang.zhong

On Wed, Dec 06, 2017 at 10:48:45AM +0100, Paolo Bonzini wrote:
> On 06/12/2017 10:26, Yang Zhong wrote:
> >   Hello Paolo,
> > 
> >   The best option is only trim one time after guest kernel bootup or VM bootup, and as for
> >   hotplug/unhotplug operations during the VM running, the trim still can do for each batch
> >   memory free because trim will not impact VM performance during VM running status.
> > 
> >   So, the key point is qemu is hard to know when guest ernel bootup is over. If you have some 
> >   suggestions, please let me know. thanks!
> 
> It shouldn't be hard.  Does QEMU's RCU thread actually get any
> significant activity after bootup?  Hence the suggestion of keeping
> malloc_trim in the RCU thread, but only do it if some time has passed
> since the last time.
> 
> Maybe something like this every time the RCU thread runs:
> 
>  static uint64_t next_trim_time, last_trim_time;
>  if (current time < next_trim_time) {
>      next_trim_time -= last_trim_time / 2    /* or higher */
>      last_trim_time -= last_trim_time / 2    /* same as previous line */
>  } else {
>      trim_start_time = current time
>      malloc_trim(...)
>      last_trim_time = current time - trim_start_time
>      next_trim_time = current time + last_trim_time
>  }
> 
> Where the "2" factor should be tuned so that both your and Shannon's
> scenario work fine.
>
  Hello Paolo,

  As for your patch, i have commented on another mail.

  Please help check below TEMP patch.

  +++ b/util/rcu.c
  @@ -32,7 +32,7 @@
  #include "qemu/atomic.h"
  #include "qemu/thread.h"
  #include "qemu/main-loop.h"
 -
+#if defined(CONFIG_MALLOC_TRIM)
+#include <malloc.h>
+#endif
  /*
   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
   * Bits 1 and above are defined in synchronize_rcu.
  @@ -246,6 +246,7 @@ static void *call_rcu_thread(void *opaque)
                  qemu_event_reset(&rcu_call_ready_event);
                  n = atomic_read(&rcu_call_count);
                  if (n == 0) {
 +                    #if defined(CONFIG_MALLOC_TRIM)
 +                       malloc_trim(4 * 1024 * 1024);
 +                    #endif
                      qemu_event_wait(&rcu_call_ready_event);
                  }
              }

  If there is no rcu_call(), the n=0 and call_rcu_thread() will trim memory 
  and then enter sleep to wait for rcu_call() to wakeup this thread.

  Once the VM bootup, if there is not any activity like hotplug, rcu thread
  is always in sleep status.

  As for the VM bootup, if the n!=0, the rcu thread will not call trim.

  If use this method, the trim times will decrease to around 1/2 of previous. 

  Regards,

  Yang
 
> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-07 15:06                   ` Yang Zhong
@ 2017-12-11 16:31                     ` Paolo Bonzini
  2017-12-12  6:54                       ` Yang Zhong
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2017-12-11 16:31 UTC (permalink / raw)
  To: Yang Zhong
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei, zhaoshenglong

On 07/12/2017 16:06, Yang Zhong wrote:
>  Which show trim cost time less than 1ms and call_rcu_thread() do 10 times batch free, the trim also 10 times.
> 
>  I also did below changes: 
>     delta=1000,  and 
>     next_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) + delta * last_trim_time
> 
>  The whole VM bootup will trim 3 times.

For any adaptive mechanism (either this one or the simple "if (n == 0)"
one), the question is:

1) what effect it has on RSS in your case

2) what effect it has on boot time in Shannon's case.

Either patch is okay if you can justify it with these two performance
indices.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-11 16:31                     ` Paolo Bonzini
@ 2017-12-12  6:54                       ` Yang Zhong
  2017-12-12  7:09                         ` Shannon Zhao
  2017-12-18  7:17                         ` Shannon Zhao
  0 siblings, 2 replies; 22+ messages in thread
From: Yang Zhong @ 2017-12-12  6:54 UTC (permalink / raw)
  To: Paolo Bonzini, zhaoshenglong
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei, yang.zhong

On Mon, Dec 11, 2017 at 05:31:43PM +0100, Paolo Bonzini wrote:
> On 07/12/2017 16:06, Yang Zhong wrote:
> >  Which show trim cost time less than 1ms and call_rcu_thread() do 10 times batch free, the trim also 10 times.
> > 
> >  I also did below changes: 
> >     delta=1000,  and 
> >     next_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) + delta * last_trim_time
> > 
> >  The whole VM bootup will trim 3 times.
> 
> For any adaptive mechanism (either this one or the simple "if (n == 0)"
> one), the question is:
> 
> 1) what effect it has on RSS in your case
  Hello Paolo,

  I list those two TEMP patch here,

  (1). if (n==0) patch
  /*
   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
   * Bits 1 and above are defined in synchronize_rcu.
  @@ -246,6 +246,7 @@ static void *call_rcu_thread(void *opaque)
                  qemu_event_reset(&rcu_call_ready_event);
                  n = atomic_read(&rcu_call_count);
                  if (n == 0) {
  +                    malloc_trim(4 * 1024 * 1024);
                      qemu_event_wait(&rcu_call_ready_event);
                  }
              }

  (2). adaptive patch

       rcu_register_thread();

  @@ -272,6 +273,21 @@ static void *call_rcu_thread(void *opaque)
             node->func(node);
         }
         qemu_mutex_unlock_iothread();
  +
  +        static uint64_t next_trim_time, last_trim_time;
  +        int delta=1000;
  +
  +        if ( qemu_clock_get_ns(QEMU_CLOCK_HOST) < next_trim_time ) {
  +            next_trim_time -= last_trim_time / delta;   /* or higher */
  +            last_trim_time -= last_trim_time / delta;   /* same as previous line */
  +        } else {
  +            uint64_t trim_start_time = qemu_clock_get_ns(QEMU_CLOCK_HOST);
  +            malloc_trim(4 * 1024 *1024);
  +            last_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) - trim_start_time;
  +            next_trim_time = qemu_clock_get_ns(QEMU_CLOCK_HOST) + delta * last_trim_time;
  +       }
  +


   I used those two TEMP patch to test and results as below:

   My test command
   sudo ./qemu-system-x86_64 -enable-kvm -cpu host -m 2G -smp cpus=4,cores=4,threads=1,sockets=1 \
    -drive format=raw,file=/home/yangzhon/icx/workspace/eywa.img,index=0,media=disk -nographic
  
  (1) if (n==0) patch
    563015d84000-563016fd6000 rw-p 00000000 00:00 0                          [heap]
    Size:              18760 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Rss:                3176 kB
    Pss:                3176 kB

  (2)adaptive patch
    55bd5975a000-55bd5a9ac000 rw-p 00000000 00:00 0                          [heap]
    Size:              18760 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Rss:                3196 kB
    Pss:                3196 kB

  if set delta=10, then get below result

    56043a2e1000-56043b533000 rw-p 00000000 00:00 0                          [heap]
    Size:              18760 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Rss:                3168 kB
    Pss:                3168 kB

 
  With my test command, if used the n==0 patch, the trim times decresed to 1/2,
  if delta=1000 in patch2, the trim time is 3. If delta=10, the trim time is 10.

  Regards,

  Yang 

> 2) what effect it has on boot time in Shannon's case.
  Hello Shannon,

  It's hard for me to reproduce your commands in my x86 enviornment, as a compare test,
  would you please help me use above two TEMP patches to verify VM bootup time again?

  Those data can help Paolo to decide which patch will be used or how to adjust delta
  parameter.  Many thanks!

  Regards,

  Yang


> Either patch is okay if you can justify it with these two performance
> indices.
> 
> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-12  6:54                       ` Yang Zhong
@ 2017-12-12  7:09                         ` Shannon Zhao
  2017-12-18  7:17                         ` Shannon Zhao
  1 sibling, 0 replies; 22+ messages in thread
From: Shannon Zhao @ 2017-12-12  7:09 UTC (permalink / raw)
  To: Yang Zhong, Paolo Bonzini
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei



On 2017/12/12 14:54, Yang Zhong wrote:
>> 2) what effect it has on boot time in Shannon's case.
>   Hello Shannon,
> 
>   It's hard for me to reproduce your commands in my x86 enviornment, as a compare test,
>   would you please help me use above two TEMP patches to verify VM bootup time again?
> 
>   Those data can help Paolo to decide which patch will be used or how to adjust delta
>   parameter.  Many thanks!
> 
Sure, I'll test these patches.

Thanks,
-- 
Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-12  6:54                       ` Yang Zhong
  2017-12-12  7:09                         ` Shannon Zhao
@ 2017-12-18  7:17                         ` Shannon Zhao
  2017-12-18  7:51                           ` Yang Zhong
  1 sibling, 1 reply; 22+ messages in thread
From: Shannon Zhao @ 2017-12-18  7:17 UTC (permalink / raw)
  To: Yang Zhong, Paolo Bonzini
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei



On 2017/12/12 14:54, Yang Zhong wrote:
>> > 2) what effect it has on boot time in Shannon's case.
>   Hello Shannon,
> 
>   It's hard for me to reproduce your commands in my x86 enviornment, as a compare test,
>   would you please help me use above two TEMP patches to verify VM bootup time again?
> 
>   Those data can help Paolo to decide which patch will be used or how to adjust delta
>   parameter.  Many thanks!
We have tested these two patches. Both don't increase the VM bootup time
in my case.

Thanks,
-- 
Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-18  7:17                         ` Shannon Zhao
@ 2017-12-18  7:51                           ` Yang Zhong
  2017-12-19 12:57                             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Zhong @ 2017-12-18  7:51 UTC (permalink / raw)
  To: Shannon Zhao, pbonzini
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei, yang.zhong

On Mon, Dec 18, 2017 at 03:17:33PM +0800, Shannon Zhao wrote:
> 
> 
> On 2017/12/12 14:54, Yang Zhong wrote:
> >> > 2) what effect it has on boot time in Shannon's case.
> >   Hello Shannon,
> > 
> >   It's hard for me to reproduce your commands in my x86 enviornment, as a compare test,
> >   would you please help me use above two TEMP patches to verify VM bootup time again?
> > 
> >   Those data can help Paolo to decide which patch will be used or how to adjust delta
> >   parameter.  Many thanks!
> We have tested these two patches. Both don't increase the VM bootup time
> in my case.
> 
  Thanks for Shannon's great help!

  Paolo, please make decision which TEMP solution is prefered ? I can send out V4 patch soon.
  
  Thanks a lot!

  Regards,

  Yang

> Thanks,
> -- 
> Shannon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
  2017-12-18  7:51                           ` Yang Zhong
@ 2017-12-19 12:57                             ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2017-12-19 12:57 UTC (permalink / raw)
  To: Yang Zhong, Shannon Zhao
  Cc: qemu-devel, stefanha, berrange, weidong.huang, arei.gonglei,
	liujunjie23, wangxinxin.wang, stone.xulei

On 18/12/2017 08:51, Yang Zhong wrote:
> On Mon, Dec 18, 2017 at 03:17:33PM +0800, Shannon Zhao wrote:
>>
>>
>> On 2017/12/12 14:54, Yang Zhong wrote:
>>>>> 2) what effect it has on boot time in Shannon's case.
>>>   Hello Shannon,
>>>
>>>   It's hard for me to reproduce your commands in my x86 enviornment, as a compare test,
>>>   would you please help me use above two TEMP patches to verify VM bootup time again?
>>>
>>>   Those data can help Paolo to decide which patch will be used or how to adjust delta
>>>   parameter.  Many thanks!
>> We have tested these two patches. Both don't increase the VM bootup time
>> in my case.
>>
>   Thanks for Shannon's great help!
> 
>   Paolo, please make decision which TEMP solution is prefered ? I can send out V4 patch soon.

I would go with the simpler patch then.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-12-19 12:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-24  6:30 [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim() Yang Zhong
2017-11-24 11:27 ` Stefan Hajnoczi
2017-11-26  6:17 ` Shannon Zhao
2017-11-27  3:06   ` Zhong Yang
2017-11-27 11:59     ` Paolo Bonzini
     [not found]   ` <20171201105622.GB26237@yangzhon-Virtual>
     [not found]     ` <74cccd14-e485-90d4-82d9-03355c05faca@redhat.com>
2017-12-04 12:03       ` Yang Zhong
2017-12-04 12:07         ` Daniel P. Berrange
2017-12-04 12:16           ` Yang Zhong
2017-12-04 12:18           ` Paolo Bonzini
2017-12-04 12:26         ` Shannon Zhao
2017-12-05  6:00           ` Yang Zhong
2017-12-05 14:10             ` Paolo Bonzini
2017-12-06  9:26               ` Yang Zhong
2017-12-06  9:48                 ` Paolo Bonzini
2017-12-07 15:06                   ` Yang Zhong
2017-12-11 16:31                     ` Paolo Bonzini
2017-12-12  6:54                       ` Yang Zhong
2017-12-12  7:09                         ` Shannon Zhao
2017-12-18  7:17                         ` Shannon Zhao
2017-12-18  7:51                           ` Yang Zhong
2017-12-19 12:57                             ` Paolo Bonzini
2017-12-08 11:06                   ` Yang Zhong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.