All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 15:14 [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB Yang Zhong
@ 2017-03-10  8:41 ` Paolo Bonzini
  2017-03-10 16:05   ` Xu, Anthony
  2017-03-10 15:14 ` [Qemu-devel] [PATCH v1 2/2] " Yang Zhong
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-10  8:41 UTC (permalink / raw)
  To: Yang Zhong, qemu-devel; +Cc: anthony.xu, chao.p.peng



On 10/03/2017 16:14, Yang Zhong wrote:
> There are a lot of memory allocation during the qemu bootup, which are
> freed later by RCU thread,which means the heap size becomes biger and
> biger when allocation happens, but the heap may not be shrinked even
> after release happens,because some memory blocks in top of heap are still
> being used.Decreasing the sleep and removing qemu_mutex_unlock_iothread()
> lock, which make call_rcu_thread()thread response the free memory in time.
> This patch will reduce heap Rss around 10M.
> 
> This patch is from Anthony xu <anthony.xu@intel.com>.
> 
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  util/rcu.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/util/rcu.c b/util/rcu.c
> index 9adc5e4..c5c373c 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -167,7 +167,7 @@ void synchronize_rcu(void)
>  }
>  
>  
> -#define RCU_CALL_MIN_SIZE        30
> +#define RCU_CALL_MIN_SIZE        5
>  
>  /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
>   * from liburcu.  Note that head is only used by the consumer.
> @@ -241,7 +241,7 @@ static void *call_rcu_thread(void *opaque)
>           * added before synchronize_rcu() starts.
>           */
>          while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
> -            g_usleep(10000);
> +            g_usleep(100);
>              if (n == 0) {
>                  qemu_event_reset(&rcu_call_ready_event);
>                  n = atomic_read(&rcu_call_count);
> @@ -254,24 +254,20 @@ static void *call_rcu_thread(void *opaque)
>  
>          atomic_sub(&rcu_call_count, n);
>          synchronize_rcu();
> -        qemu_mutex_lock_iothread();
>          while (n > 0) {
>              node = try_dequeue();
>              while (!node) {
> -                qemu_mutex_unlock_iothread();
>                  qemu_event_reset(&rcu_call_ready_event);
>                  node = try_dequeue();
>                  if (!node) {
>                      qemu_event_wait(&rcu_call_ready_event);
>                      node = try_dequeue();
>                  }
> -                qemu_mutex_lock_iothread();
>              }
>  
>              n--;
>              node->func(node);
>          }
> -        qemu_mutex_unlock_iothread();

This is wrong.  RCU callbacks currently need the "big QEMU lock",
because they can free arbitrary QOM objects (including MemoryRegions).

Paolo

>      }
>      abort();
>  }
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 15:14 ` [Qemu-devel] [PATCH v1 2/2] " Yang Zhong
@ 2017-03-10  9:14   ` Paolo Bonzini
  2017-03-10 17:40     ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-10  9:14 UTC (permalink / raw)
  To: Yang Zhong, qemu-devel; +Cc: anthony.xu, chao.p.peng



On 10/03/2017 16:14, Yang Zhong wrote:
> There is no need to delete subregion and do memory begin/commit for
> subpage in memory_region_finalize().
> 
> This patch is from Anthony Xu <anthony.xu@intel.com>.
> 
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  memory.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/memory.c b/memory.c
> index 284894b..3e9bfff 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1505,13 +1505,14 @@ static void memory_region_finalize(Object *obj)
>       * and cause an infinite loop.
>       */
>      mr->enabled = false;
> -    memory_region_transaction_begin();
> -    while (!QTAILQ_EMPTY(&mr->subregions)) {
> -        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
> -        memory_region_del_subregion(mr, subregion);
> +    if (!mr->subpage) {
> +        memory_region_transaction_begin();
> +        while (!QTAILQ_EMPTY(&mr->subregions)) {
> +            MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
> +            memory_region_del_subregion(mr, subregion);
> +        }
> +        memory_region_transaction_commit();

Subpages never have subregions, so the loop never runs.  The
begin/commit pair then becomes:

    ++memory_region_transaction_depth;
    --memory_region_transaction_depth;
    if (!memory_region_transaction_depth) {
        if (memory_region_update_pending) {
            ...
        } else if (ioeventfd_update_pending) {
            ...
        }
        // memory_region_clear_pending()
        memory_region_update_pending = false;
        ioeventfd_update_pending = false;
   }

If memory_region_transaction_depth is > 0 the begin/commit pair does
nothing.

But if memory_region_transaction_depth is == 0, there should be no
update pending because the loop has never run.  So I don't see what your
patch can change.

Of course there could be an update pending because of a bug elsewhere,
but I tried adding this patch:

diff --git a/memory.c b/memory.c
index 284894b..2208a21 100644
--- a/memory.c
+++ b/memory.c
@@ -903,6 +903,10 @@ static void
address_space_update_topology(AddressSpace *as)
 void memory_region_transaction_begin(void)
 {
     qemu_flush_coalesced_mmio_buffer();
+    if (!memory_region_transaction_depth) {
+        assert(!memory_region_update_pending);
+        assert(!ioeventfd_update_pending);
+    }
     ++memory_region_transaction_depth;
 }

and at least a basic qemu-system-x86_64 run started just fine.  So why
does this patch make a difference?

Paolo

>      }
> -    memory_region_transaction_commit();
> -

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
@ 2017-03-10 15:14 Yang Zhong
  2017-03-10  8:41 ` Paolo Bonzini
  2017-03-10 15:14 ` [Qemu-devel] [PATCH v1 2/2] " Yang Zhong
  0 siblings, 2 replies; 20+ messages in thread
From: Yang Zhong @ 2017-03-10 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, anthony.xu, chao.p.peng, Yang Zhong

There are a lot of memory allocation during the qemu bootup, which are
freed later by RCU thread,which means the heap size becomes biger and
biger when allocation happens, but the heap may not be shrinked even
after release happens,because some memory blocks in top of heap are still
being used.Decreasing the sleep and removing qemu_mutex_unlock_iothread()
lock, which make call_rcu_thread()thread response the free memory in time.
This patch will reduce heap Rss around 10M.

This patch is from Anthony xu <anthony.xu@intel.com>.

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 util/rcu.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/util/rcu.c b/util/rcu.c
index 9adc5e4..c5c373c 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -167,7 +167,7 @@ void synchronize_rcu(void)
 }
 
 
-#define RCU_CALL_MIN_SIZE        30
+#define RCU_CALL_MIN_SIZE        5
 
 /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
  * from liburcu.  Note that head is only used by the consumer.
@@ -241,7 +241,7 @@ static void *call_rcu_thread(void *opaque)
          * added before synchronize_rcu() starts.
          */
         while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
-            g_usleep(10000);
+            g_usleep(100);
             if (n == 0) {
                 qemu_event_reset(&rcu_call_ready_event);
                 n = atomic_read(&rcu_call_count);
@@ -254,24 +254,20 @@ static void *call_rcu_thread(void *opaque)
 
         atomic_sub(&rcu_call_count, n);
         synchronize_rcu();
-        qemu_mutex_lock_iothread();
         while (n > 0) {
             node = try_dequeue();
             while (!node) {
-                qemu_mutex_unlock_iothread();
                 qemu_event_reset(&rcu_call_ready_event);
                 node = try_dequeue();
                 if (!node) {
                     qemu_event_wait(&rcu_call_ready_event);
                     node = try_dequeue();
                 }
-                qemu_mutex_lock_iothread();
             }
 
             n--;
             node->func(node);
         }
-        qemu_mutex_unlock_iothread();
     }
     abort();
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 15:14 [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB Yang Zhong
  2017-03-10  8:41 ` Paolo Bonzini
@ 2017-03-10 15:14 ` Yang Zhong
  2017-03-10  9:14   ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Yang Zhong @ 2017-03-10 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, anthony.xu, chao.p.peng, Yang Zhong

There is no need to delete subregion and do memory begin/commit for
subpage in memory_region_finalize().

This patch is from Anthony Xu <anthony.xu@intel.com>.

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 memory.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/memory.c b/memory.c
index 284894b..3e9bfff 100644
--- a/memory.c
+++ b/memory.c
@@ -1505,13 +1505,14 @@ static void memory_region_finalize(Object *obj)
      * and cause an infinite loop.
      */
     mr->enabled = false;
-    memory_region_transaction_begin();
-    while (!QTAILQ_EMPTY(&mr->subregions)) {
-        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
-        memory_region_del_subregion(mr, subregion);
+    if (!mr->subpage) {
+        memory_region_transaction_begin();
+        while (!QTAILQ_EMPTY(&mr->subregions)) {
+            MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
+            memory_region_del_subregion(mr, subregion);
+        }
+        memory_region_transaction_commit();
     }
-    memory_region_transaction_commit();
-
     mr->destructor(mr);
     memory_region_clear_coalescing(mr);
     g_free((char *)mr->name);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10  8:41 ` Paolo Bonzini
@ 2017-03-10 16:05   ` Xu, Anthony
  2017-03-10 16:07     ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-10 16:05 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, qemu-devel; +Cc: Peng, Chao P

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Friday, March 10, 2017 12:42 AM
> To: Zhong, Yang <yang.zhong@intel.com>; qemu-devel@nongnu.org
> Cc: Xu, Anthony <anthony.xu@intel.com>; Peng, Chao P
> <chao.p.peng@intel.com>
> Subject: Re: [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to
> 2752KB
> 
> 
> 
> On 10/03/2017 16:14, Yang Zhong wrote:
> > There are a lot of memory allocation during the qemu bootup, which are
> > freed later by RCU thread,which means the heap size becomes biger and
> > biger when allocation happens, but the heap may not be shrinked even
> > after release happens,because some memory blocks in top of heap are
> > still being used.Decreasing the sleep and removing
> > qemu_mutex_unlock_iothread() lock, which make call_rcu_thread()thread
> response the free memory in time.
> > This patch will reduce heap Rss around 10M.
> >
> > This patch is from Anthony xu <anthony.xu@intel.com>.
> >
> > Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> > ---
> >  util/rcu.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > diff --git a/util/rcu.c b/util/rcu.c
> > index 9adc5e4..c5c373c 100644
> > --- a/util/rcu.c
> > +++ b/util/rcu.c
> > @@ -167,7 +167,7 @@ void synchronize_rcu(void)  }
> >
> >
> > -#define RCU_CALL_MIN_SIZE        30
> > +#define RCU_CALL_MIN_SIZE        5
> >
> >  /* Multi-producer, single-consumer queue based on
> urcu/static/wfqueue.h
> >   * from liburcu.  Note that head is only used by the consumer.
> > @@ -241,7 +241,7 @@ static void *call_rcu_thread(void *opaque)
> >           * added before synchronize_rcu() starts.
> >           */
> >          while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
> > -            g_usleep(10000);
> > +            g_usleep(100);
> >              if (n == 0) {
> >                  qemu_event_reset(&rcu_call_ready_event);
> >                  n = atomic_read(&rcu_call_count); @@ -254,24 +254,20
> > @@ static void *call_rcu_thread(void *opaque)
> >
> >          atomic_sub(&rcu_call_count, n);
> >          synchronize_rcu();
> > -        qemu_mutex_lock_iothread();
> >          while (n > 0) {
> >              node = try_dequeue();
> >              while (!node) {
> > -                qemu_mutex_unlock_iothread();
> >                  qemu_event_reset(&rcu_call_ready_event);
> >                  node = try_dequeue();
> >                  if (!node) {
> >                      qemu_event_wait(&rcu_call_ready_event);
> >                      node = try_dequeue();
> >                  }
> > -                qemu_mutex_lock_iothread();
> >              }
> >
> >              n--;
> >              node->func(node);
> >          }
> > -        qemu_mutex_unlock_iothread();
> 
> This is wrong.  RCU callbacks currently need the "big QEMU lock", because
> they can free arbitrary QOM objects (including MemoryRegions).

Using "big QEMU lock" in RCU callbacks is a little bit heavy, we'd like to remove it.

We noticed an issue related to MemoryRegion popping up after we removed the global lock.
The root cause is in MemoryRegion destruction memory_region_transaction_commit is called, which may cause issue.
Freeing MemoryRegion seems not cause any issue.
Patch2 in this series fixes this issue,
We found out only subpage MemoryRegion is freed in RCU callbacks, and memory_region_transaction_commit
is not needed for subpage MemoryRegion because it doesn't have sub regions.

Ideally, freeing QOM object should not require a global lock. 
If you see any other QOM requiring a global lock, please let us know, we are willing to fix it.


Thanks,
Anthony


> 
> Paolo
> 
> >      }
> >      abort();
> >  }
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 16:05   ` Xu, Anthony
@ 2017-03-10 16:07     ` Paolo Bonzini
  2017-03-10 19:30       ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-10 16:07 UTC (permalink / raw)
  To: Xu, Anthony, Zhong, Yang, qemu-devel; +Cc: Peng, Chao P



On 10/03/2017 17:05, Xu, Anthony wrote:
> Ideally, freeing QOM object should not require a global lock. 
> If you see any other QOM requiring a global lock, please let us know, we are willing to fix it.

All of them.  When unplugging a device, the device object will be freed
from an RCU callback.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10  9:14   ` Paolo Bonzini
@ 2017-03-10 17:40     ` Xu, Anthony
  2017-03-11 17:04       ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-10 17:40 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, qemu-devel; +Cc: Peng, Chao P

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Friday, March 10, 2017 1:14 AM
> To: Zhong, Yang <yang.zhong@intel.com>; qemu-devel@nongnu.org
> Cc: Xu, Anthony <anthony.xu@intel.com>; Peng, Chao P
> <chao.p.peng@intel.com>
> Subject: Re: [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to
> 2752KB
> 
> 
> 
> On 10/03/2017 16:14, Yang Zhong wrote:
> > There is no need to delete subregion and do memory begin/commit for
> > subpage in memory_region_finalize().
> >
> > This patch is from Anthony Xu <anthony.xu@intel.com>.
> >
> > Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> > ---
> >  memory.c | 13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/memory.c b/memory.c
> > index 284894b..3e9bfff 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1505,13 +1505,14 @@ static void memory_region_finalize(Object
> *obj)
> >       * and cause an infinite loop.
> >       */
> >      mr->enabled = false;
> > -    memory_region_transaction_begin();
> > -    while (!QTAILQ_EMPTY(&mr->subregions)) {
> > -        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
> > -        memory_region_del_subregion(mr, subregion);
> > +    if (!mr->subpage) {
> > +        memory_region_transaction_begin();
> > +        while (!QTAILQ_EMPTY(&mr->subregions)) {
> > +            MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
> > +            memory_region_del_subregion(mr, subregion);
> > +        }
> > +        memory_region_transaction_commit();
> 
> Subpages never have subregions, so the loop never runs.  The begin/commit
> pair then becomes:
> 
>     ++memory_region_transaction_depth;
>     --memory_region_transaction_depth;
>     if (!memory_region_transaction_depth) {
>         if (memory_region_update_pending) {
>             ...
>         } else if (ioeventfd_update_pending) {
>             ...
>         }
>         // memory_region_clear_pending()
>         memory_region_update_pending = false;
>         ioeventfd_update_pending = false;
>    }
> 
> If memory_region_transaction_depth is > 0 the begin/commit pair does
> nothing.
> 
> But if memory_region_transaction_depth is == 0, there should be no update
> pending because the loop has never run.  So I don't see what your patch can
> change.

As I mentioned in PATCH1, this patch is used to fix an issue after we remove
the global lock in RCU callback. After global lock is removed, other thread may 
set up update pending, so memory_region_transaction_commit 
may try to rebuild PhysPageMap even the loop doesn’t run, other thread may try to 
rebuild PhysPageMap at the same time, it is a race condition.
subpage MemoryRegion is a specific MemoryRegion, it doesn't belong to any address 
space, it is only used to handle subpage. We may use a new structure other than 
MemoryRegion to handle subpage to make the logic more clearer. After the change, 
RCU callback will not free any MemoryRegion. 


Anthony

> 
> Of course there could be an update pending because of a bug elsewhere,
> but I tried adding this patch:
> 
> diff --git a/memory.c b/memory.c
> index 284894b..2208a21 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -903,6 +903,10 @@ static void
> address_space_update_topology(AddressSpace *as)  void
> memory_region_transaction_begin(void)
>  {
>      qemu_flush_coalesced_mmio_buffer();
> +    if (!memory_region_transaction_depth) {
> +        assert(!memory_region_update_pending);
> +        assert(!ioeventfd_update_pending);
> +    }
>      ++memory_region_transaction_depth;
>  }
> 
> and at least a basic qemu-system-x86_64 run started just fine.  So why does
> this patch make a difference?
> 
> Paolo
> 
> >      }
> > -    memory_region_transaction_commit();
> > -

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 16:07     ` Paolo Bonzini
@ 2017-03-10 19:30       ` Xu, Anthony
  2017-03-11 17:02         ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-10 19:30 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, qemu-devel; +Cc: Peng, Chao P

> > Ideally, freeing QOM object should not require a global lock.
> > If you see any other QOM requiring a global lock, please let us know, we
> are willing to fix it.
> 
> All of them.  When unplugging a device, the device object will be freed
> from an RCU callback.
> 

Thanks for your reply,
Some objects may not need lock at all to be freed,
Some objects may just need a small lock to be freed,

Should we let RCU callbacks to decide which lock they need instead of enforcing  
the global lock in RCU thread?

As for the device object, can we get the global lock inside the object free/destroy function for now?

If we can remove the global lock inside RCU thread, we can save 9MB heap memory, that's a lot!

Please share with us if you have other idea to do this.

Thanks
Anthony




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 19:30       ` Xu, Anthony
@ 2017-03-11 17:02         ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-11 17:02 UTC (permalink / raw)
  To: Anthony Xu; +Cc: Yang Zhong, qemu-devel, Chao P Peng



----- Original Message -----
> From: "Anthony Xu" <anthony.xu@intel.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>, "Yang Zhong" <yang.zhong@intel.com>, qemu-devel@nongnu.org
> Cc: "Chao P Peng" <chao.p.peng@intel.com>
> Sent: Friday, March 10, 2017 8:30:06 PM
> Subject: RE: [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
> 
> > > Ideally, freeing QOM object should not require a global lock.
> > > If you see any other QOM requiring a global lock, please let us know, we
> > are willing to fix it.
> > 
> > All of them.  When unplugging a device, the device object will be freed
> > from an RCU callback.
> > 
> 
> Thanks for your reply,
> Some objects may not need lock at all to be freed,
> Some objects may just need a small lock to be freed,
> 
> Should we let RCU callbacks to decide which lock they need instead of
> enforcing the global lock in RCU thread?

Splitting the global lock and deciding which object requires which lock
is very, very hard.

The problem is that everything calling object_unref potentially needs the
global lock, and that includes the following call chain: object_unref
-> memory_region_unref -> flatview_destroy -> flatview_unref ->
address_space_update_topology -> memory_region_transaction_commit.

Paolo

> As for the device object, can we get the global lock inside the object
> free/destroy function for now?
> 
> If we can remove the global lock inside RCU thread, we can save 9MB heap
> memory, that's a lot!
> 
> Please share with us if you have other idea to do this.
> 
> Thanks
> Anthony
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-10 17:40     ` Xu, Anthony
@ 2017-03-11 17:04       ` Paolo Bonzini
  2017-03-14  5:14         ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-11 17:04 UTC (permalink / raw)
  To: Anthony Xu; +Cc: Yang Zhong, qemu-devel, Chao P Peng


> > Subpages never have subregions, so the loop never runs.  The begin/commit
> > pair then becomes:
> > 
> >     ++memory_region_transaction_depth;
> >     --memory_region_transaction_depth;
> >     if (!memory_region_transaction_depth) {
> >         if (memory_region_update_pending) {
> >             ...
> >         } else if (ioeventfd_update_pending) {
> >             ...
> >         }
> >         // memory_region_clear_pending()
> >         memory_region_update_pending = false;
> >         ioeventfd_update_pending = false;
> >    }
> > 
> > If memory_region_transaction_depth is > 0 the begin/commit pair does
> > nothing.
> > 
> > But if memory_region_transaction_depth is == 0, there should be no update
> > pending because the loop has never run.  So I don't see what your patch can
> > change.
> 
> As I mentioned in PATCH1, this patch is used to fix an issue after we remove
> the global lock in RCU callback. After global lock is removed, other thread
> may set up update pending, so memory_region_transaction_commit
> may try to rebuild PhysPageMap even the loop doesn’t run, other thread may
> try to rebuild PhysPageMap at the same time, it is a race condition.
> subpage MemoryRegion is a specific MemoryRegion, it doesn't belong to any
> address space, it is only used to handle subpage. We may use a new structure
> other than MemoryRegion to handle subpage to make the logic more clearer. After
> the change, RCU callback will not free any MemoryRegion.

This is not true.  Try hot-unplugging a device.

I'm all for reducing the scope of the global QEMU lock, but this needs a plan
and a careful analysis of the involved data structures across _all_ instance_finalize
implementations.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-11 17:04       ` Paolo Bonzini
@ 2017-03-14  5:14         ` Xu, Anthony
  2017-03-14 10:14           ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-14  5:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Zhong, Yang, qemu-devel, Peng, Chao P

> > > Subpages never have subregions, so the loop never runs.  The
> begin/commit
> > > pair then becomes:
> > >
> > >     ++memory_region_transaction_depth;
> > >     --memory_region_transaction_depth;
> > >     if (!memory_region_transaction_depth) {
> > >         if (memory_region_update_pending) {
> > >             ...
> > >         } else if (ioeventfd_update_pending) {
> > >             ...
> > >         }
> > >         // memory_region_clear_pending()
> > >         memory_region_update_pending = false;
> > >         ioeventfd_update_pending = false;
> > >    }
> > >
> > > If memory_region_transaction_depth is > 0 the begin/commit pair does
> > > nothing.
> > >
> > > But if memory_region_transaction_depth is == 0, there should be no
> update
> > > pending because the loop has never run.  So I don't see what your patch
> can
> > > change.
> >
> > As I mentioned in PATCH1, this patch is used to fix an issue after we
> remove
> > the global lock in RCU callback. After global lock is removed, other thread
> > may set up update pending, so memory_region_transaction_commit
> > may try to rebuild PhysPageMap even the loop doesn’t run, other thread
> may
> > try to rebuild PhysPageMap at the same time, it is a race condition.
> > subpage MemoryRegion is a specific MemoryRegion, it doesn't belong to
> any
> > address space, it is only used to handle subpage. We may use a new
> structure
> > other than MemoryRegion to handle subpage to make the logic more
> clearer. After
> > the change, RCU callback will not free any MemoryRegion.
> 
> This is not true.  Try hot-unplugging a device.

You are right, hot-unplugging does cause memory region(not for subpage) freed in RCU thread.

I tried pci device hot plug/unplug,
When plug a device,
handle_hmp_command->qmp_device_add->qdev_device_add->
virtio_pci_dc_realize->pci_qdev_realize->virtio_pci_realize->
virtio_device_realize->virtio_bus_device_plugged->virtio_pci_device_plugged 
are called.
when unplug a device,
kvm_cpu_exec->memory_region_write_accessor->pci_write->
virtio_device_unrealize->virtio_pci_device_unplugged 
are called.

memory region addition and remove happen in virtio_pci_device_plugged 
and virtio_pci_device_unplugged respectively, memory region operation needs
 to acquire the global lock, but none of them happens in RCU thread.

When memory_region_finalize is called, the memory region has been removed 
from the address space (removed in virtio_pci_device_unplugged), both mr->subregions 
and mr->coalesced are empty. It makes sense to me, when memory_region_finalize is called, 
means this memory region is not used any more.
Please correct me if I'm wrong here, I only tried pci device hot plug/unplug.

If above assumption is correct, seems we don't need the global lock for memory region  
reclamation in RCU thread. Please let me know if other  memory reclamation in RCU thread 
need the global lock.

Under the assumption, I have below patch to remove the global lock in RCU thread, 
I tested vm boot ,reboot, shutdown and pci device hot plug/unplug.
Please review the patch, 
If no further issue, I will send out an official patch later.

Thanks,
Anthony


diff --git a/memory.c b/memory.c
index 6c58373..43e06e9 100644
--- a/memory.c
+++ b/memory.c
@@ -1503,15 +1503,9 @@ static void memory_region_finalize(Object *obj)
      * and cause an infinite loop.
      */
     mr->enabled = false;
-    memory_region_transaction_begin();
-    while (!QTAILQ_EMPTY(&mr->subregions)) {
-        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
-        memory_region_del_subregion(mr, subregion);
-    }
-    memory_region_transaction_commit();
-
+    assert(QTAILQ_EMPTY(&mr->subregions));
     mr->destructor(mr);
-    memory_region_clear_coalescing(mr);
+    assert(QTAILQ_EMPTY(&mr->coalesced));
     g_free((char *)mr->name);
     g_free(mr->ioeventfds);
 }
diff --git a/util/rcu.c b/util/rcu.c
index 9adc5e4..51e0248 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -254,24 +254,20 @@ static void *call_rcu_thread(void *opaque)

         atomic_sub(&rcu_call_count, n);
         synchronize_rcu();
-        qemu_mutex_lock_iothread();
         while (n > 0) {
             node = try_dequeue();
             while (!node) {
-                qemu_mutex_unlock_iothread();
                 qemu_event_reset(&rcu_call_ready_event);
                 node = try_dequeue();
                 if (!node) {
                     qemu_event_wait(&rcu_call_ready_event);
                     node = try_dequeue();
                 }
-                qemu_mutex_lock_iothread();
             }

             n--;
             node->func(node);
         }
-        qemu_mutex_unlock_iothread();
     }
     abort();
 }





> 
> I'm all for reducing the scope of the global QEMU lock, 
Thanks,


>but this needs a plan
> and a careful analysis of the involved data structures across _all_
> instance_finalize
> implementations.
Agreed,

Below functions are registered in RCU thread
address_space_dispatch_free,
do_address_space_destroy
flatview_unref
reclaim_ramblock,
qht_map_destroy,
migration_bitmap_free

first three are address space related, should work without global lock per above analysis.
The rest are very simple, seems doesn't need global lock.




> 
> Paolo

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-14  5:14         ` Xu, Anthony
@ 2017-03-14 10:14           ` Paolo Bonzini
  2017-03-14 21:23             ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-14 10:14 UTC (permalink / raw)
  To: Xu, Anthony; +Cc: Zhong, Yang, qemu-devel, Peng, Chao P



On 14/03/2017 06:14, Xu, Anthony wrote:
> Below functions are registered in RCU thread
> address_space_dispatch_free,
> do_address_space_destroy
> flatview_unref
> reclaim_ramblock,
> qht_map_destroy,
> migration_bitmap_free
> 
> first three are address space related, should work without global lock per above analysis.
> The rest are very simple, seems doesn't need global lock.

flatview_unref can call object_unref and thus reach:

- all QOM instance_finalize callbacks

- all QOM property release callbacks

In turn, of QOM property release callbacks the more important ones are
release_drive (which calls blockdev_auto_del and blk_detach_dev) and
release_chr (which calls qemu_chr_fe_deinit).

Your patch is incorrect, sorry.  If it were that simple, it would have
been done already...

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-14 10:14           ` Paolo Bonzini
@ 2017-03-14 21:23             ` Xu, Anthony
  2017-03-15  8:36               ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-14 21:23 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Zhong, Yang, qemu-devel, Peng, Chao P

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Tuesday, March 14, 2017 3:15 AM
> To: Xu, Anthony <anthony.xu@intel.com>
> Cc: Zhong, Yang <yang.zhong@intel.com>; qemu-devel@nongnu.org; Peng,
> Chao P <chao.p.peng@intel.com>
> Subject: Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from
> 12252kB to 2752KB
> 
> 
> 
> On 14/03/2017 06:14, Xu, Anthony wrote:
> > Below functions are registered in RCU thread
> > address_space_dispatch_free,
> > do_address_space_destroy
> > flatview_unref
> > reclaim_ramblock,
> > qht_map_destroy,
> > migration_bitmap_free
> >
> > first three are address space related, should work without global lock per
> above analysis.
> > The rest are very simple, seems doesn't need global lock.
> 
> flatview_unref can call object_unref and thus reach:

Okay,  flatview_unref is the one you worried about,

Flatview_unref is registered as a RCU callback only in address_space_update_topology,
Strangely, it is registered as a RCU callback, and is called directly in this function. 
Basially flatview_unref is called twice for old view. There are some comments, 
but it is not clear to me, is this a bug or by design? Is flatview_destroy called in current thread 
or RCU thread?


static void address_space_update_topology(AddressSpace *as)
{
    FlatView *old_view = address_space_get_flatview(as);
    FlatView *new_view = generate_memory_topology(as->root);

    address_space_update_topology_pass(as, old_view, new_view, false);
    address_space_update_topology_pass(as, old_view, new_view, true);

    /* Writes are protected by the BQL.  */
    atomic_rcu_set(&as->current_map, new_view);
    call_rcu(old_view, flatview_unref, rcu);

    /* Note that all the old MemoryRegions are still alive up to this
     * point.  This relieves most MemoryListeners from the need to
     * ref/unref the MemoryRegions they get---unless they use them
     * outside the iothread mutex, in which case precise reference
     * counting is necessary.
     */
    flatview_unref(old_view);

    address_space_update_ioeventfds(as);
}


Let me split the patch, Do you think below patch is correct?

--- a/memory.c
+++ b/memory.c
@@ -1503,15 +1503,9 @@ static void memory_region_finalize(Object *obj)
      * and cause an infinite loop.
      */
     mr->enabled = false;
-    memory_region_transaction_begin();
-    while (!QTAILQ_EMPTY(&mr->subregions)) {
-        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
-        memory_region_del_subregion(mr, subregion);
-    }
-    memory_region_transaction_commit();
-
+    assert(QTAILQ_EMPTY(&mr->subregions));
     mr->destructor(mr);
-    memory_region_clear_coalescing(m
+    assert(QTAILQ_EMPTY(&mr->coalesced));
     g_free((char *)mr->name);
     g_free(mr->ioeventfds);
 }


Then let us take a look at flatview_unref, memory_region_unref may call any QOM finalize through 
Object_unref(mr->owner), that's your concern, I got it. It is way too complicated to look at each QOM 
object release callbacks and each QOM property release callbacks.  I gave up this path.

How about fall back to synchronize_rcu?
As for address space, the RCU read lock is used to protect PhysPageMap, but not the regular MemoryRegions, 
because that are not allocated in PhysPageMap build. Subpage MemoryRegion is an exception, because that is 
allocated in PhysPageMap build.

The MemoryRegions returned from address_space_translate are regular MemoryRegions, so 
address_space_write_continue and address_space_read_continue don't need RCU read lock.

Below patch reclaim address space in synchronized way, it reduces heap size from ~12MB to ~3MB.
You need to apply this patch with above patch. And when address_space_dispatch_free is called 
it already holds the global lock, we don't need to acquire the global lock in address_space_dispatch_free.
Please review this patch.

Thanks,
Anthony


diff --git a/cputlb.c b/cputlb.c
index 6c39927..98bd21f 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -347,6 +347,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         tlb_add_large_page(env, vaddr, size);
     }

+    rcu_read_lock();
     sz = size;
     section = address_space_translate_for_iotlb(cpu, asidx, paddr, &xlat, &sz);
     assert(sz >= TARGET_PAGE_SIZE);
@@ -406,6 +407,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     } else {
         te->addr_write = -1;
     }
+    rcu_read_unlock();
 }

 /* Add a new TLB entry, but without specifying the memory
diff --git a/exec.c b/exec.c
index 6fa337b..446d622 100644
--- a/exec.c
+++ b/exec.c
@@ -455,7 +455,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
     IOMMUTLBEntry iotlb = {0};
     MemoryRegionSection *section;
     MemoryRegion *mr;
-
+    rcu_read_lock();
     for (;;) {
         AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
         section = address_space_lookup_region(d, addr, false);
@@ -477,7 +477,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
                 | (addr & iotlb.addr_mask));
         as = iotlb.target_as;
     }
-
+    rcu_read_unlock();
     return iotlb;
 }

@@ -490,6 +490,7 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
     MemoryRegionSection *section;
     MemoryRegion *mr;

+    rcu_read_lock();
     for (;;) {
         AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
         section = address_space_translate_internal(d, addr, &addr, plen, true);
@@ -517,6 +518,7 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
     }

     *xlat = addr;
+    rcu_read_unlock();
     return mr;
 }

@@ -2413,7 +2415,8 @@ static void mem_commit(MemoryListener *listener)

     atomic_rcu_set(&as->dispatch, next);
     if (cur) {
-        call_rcu(cur, address_space_dispatch_free, rcu);
+        synchronize_rcu();
+        address_space_dispatch_free(cur);
     }
 }

@@ -2459,7 +2462,8 @@ void address_space_destroy_dispatch(AddressSpace *as)

     atomic_rcu_set(&as->dispatch, NULL);
     if (d) {
-        call_rcu(d, address_space_dispatch_free, rcu);
+        synchronize_rcu();
+        address_space_dispatch_free(d);
     }
 }

@@ -2686,12 +2690,10 @@ MemTxResult address_space_write(AddressSpace *as, hwaddr addr, MemTxAttrs attrs,
     MemTxResult result = MEMTX_OK;

     if (len > 0) {
-        rcu_read_lock();
         l = len;
         mr = address_space_translate(as, addr, &addr1, &l, true);
         result = address_space_write_continue(as, addr, attrs, buf, len,
                                               addr1, l, mr);
-        rcu_read_unlock();
     }

     return result;
@@ -2776,12 +2778,10 @@ MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
     MemTxResult result = MEMTX_OK;

     if (len > 0) {
-        rcu_read_lock();
         l = len;
         mr = address_space_translate(as, addr, &addr1, &l, false);
         result = address_space_read_continue(as, addr, attrs, buf, len,
                                              addr1, l, mr);
-        rcu_read_unlock();
     }

     return result;
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index febe519..3557ac5 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -914,8 +914,6 @@ void vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
     IOMMUTLBEntry iotlb;
     uint64_t uaddr, len;

-    rcu_read_lock();
-
     iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
                                           iova, write);
     if (iotlb.target_as != NULL) {
@@ -936,7 +934,7 @@ void vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
         }
     }
 out:
-    rcu_read_unlock();
+    return;
 }

 static int vhost_virtqueue_start(struct vhost_dev *dev,













> 
> - all QOM instance_finalize callbacks
> 
> - all QOM property release callbacks
> 
> In turn, of QOM property release callbacks the more important ones are
> release_drive (which calls blockdev_auto_del and blk_detach_dev) and
> release_chr (which calls qemu_chr_fe_deinit).
> 
> Your patch is incorrect, sorry.  If it were that simple, it would have
> been done already...
> 
> Paolo

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-14 21:23             ` Xu, Anthony
@ 2017-03-15  8:36               ` Paolo Bonzini
  2017-03-15 19:05                 ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-15  8:36 UTC (permalink / raw)
  To: Xu, Anthony; +Cc: Zhong, Yang, qemu-devel, Peng, Chao P



On 14/03/2017 22:23, Xu, Anthony wrote:
>> flatview_unref can call object_unref and thus reach:
> 
> Okay,  flatview_unref is the one you worried about,
> 
> Flatview_unref is registered as a RCU callback only in address_space_update_topology,
> Strangely, it is registered as a RCU callback, and is called directly in this function. 
> Basially flatview_unref is called twice for old view.

The first unref is done after as->current_map is overwritten.
as->current_map is accessed under RCU, so it needs call_rcu.  It
balances the initial reference that is present since flatview_init.

The second unref is done to balance the flatview_ref in
address_space_get_flatview.

> There are some comments, 
> but it is not clear to me, is this a bug or by design? Is flatview_destroy called in current thread 
> or RCU thread?

You cannot know, because there are also other callers of
address_space_get_flatview.  Usually it's from the RCU thread.

> Let me split the patch, Do you think below patch is correct?
> 
> --- a/memory.c
> +++ b/memory.c
> @@ -1503,15 +1503,9 @@ static void memory_region_finalize(Object *obj)
>       * and cause an infinite loop.
>       */
>      mr->enabled = false;
> -    memory_region_transaction_begin();
> -    while (!QTAILQ_EMPTY(&mr->subregions)) {
> -        MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
> -        memory_region_del_subregion(mr, subregion);
> -    }
> -    memory_region_transaction_commit();
> -
> +    assert(QTAILQ_EMPTY(&mr->subregions));
>      mr->destructor(mr);
> -    memory_region_clear_coalescing(m
> +    assert(QTAILQ_EMPTY(&mr->coalesced));
>      g_free((char *)mr->name);
>      g_free(mr->ioeventfds);
>  }

No.  Callers can:

- let memory_region_finalize remove subregions (commit 2e2b8eb, "memory:
allow destroying a non-empty MemoryRegion", 2015-10-09)

- let memory_region_finalize disable coalesced I/O (in fact there are no
callers of memory_region_clear_coalescing outside memory.c)

> Then let us take a look at flatview_unref, memory_region_unref may call any QOM finalize through 
> Object_unref(mr->owner), that's your concern, I got it. It is way too complicated to look at each QOM 
> object release callbacks and each QOM property release callbacks.  I gave up this path.
> How about fall back to synchronize_rcu?

I'm afraid it would be too slow, but you can test.

Something like the kernel's synchronize_rcu_expedited would work, but we
don't have anything like that in QEMU yet.

> As for address space, the RCU read lock is used to protect PhysPageMap, but not the regular MemoryRegions,

Nope.  The RCU read lock protects all MemoryRegions through
flatview_unref: RCU keeps the FlatView alive, and the FlatView keeps
MemoryRegions alive.

Hence...

> The MemoryRegions returned from address_space_translate are regular MemoryRegions, so 
> address_space_write_continue and address_space_read_continue don't need RCU read lock.

... this is wrong too.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-15  8:36               ` Paolo Bonzini
@ 2017-03-15 19:05                 ` Xu, Anthony
  2017-03-15 20:21                   ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-15 19:05 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Zhong, Yang, qemu-devel, Peng, Chao P

> The first unref is done after as->current_map is overwritten.
> as->current_map is accessed under RCU, so it needs call_rcu.  It
> balances the initial reference that is present since flatview_init.

Got it, thanks for explanation.

> > but it is not clear to me, is this a bug or by design? Is flatview_destroy called
> in current thread
> > or RCU thread?
> 
> You cannot know, because there are also other callers of
> address_space_get_flatview.  Usually it's from the RCU thread.

If it is from current thread, do we need to check if the global lock is acquired?
As you mentioned flatview_unref may call memory_region_unref.


> - let memory_region_finalize remove subregions (commit 2e2b8eb,
> "memory:
> allow destroying a non-empty MemoryRegion", 2015-10-09)
> 
> - let memory_region_finalize disable coalesced I/O (in fact there are no
> callers of memory_region_clear_coalescing outside memory.c)

Okay, It may have sub regions,

In the comments of memory_region_finalize
"   /* We know the region is not visible in any address space (it
     * does not have a container and cannot be a root either because
     * it has no references,
"

We know the memory region is not a root region, and the memory region has 
already been removed from address space even it has sub regions.
memory_region_transaction_commit should be called when the 
memory region is removed from address space.
memory_region_transaction_commit seems not be needed in memory_region_finalize.
Let me know if you think otherwise.


> > How about fall back to synchronize_rcu?
> 
> I'm afraid it would be too slow, but you can test.

It is not slow, the contention is not high. Yes we can test.


> Nope.  The RCU read lock protects all MemoryRegions through
> flatview_unref: RCU keeps the FlatView alive, and the FlatView keeps
> MemoryRegions alive.

I see, RCU needs to protect MemoryRegions as well.



Please review below patch, MemoryRegion is protected by RCU.

Thanks,
Anthony



diff --git a/exec.c b/exec.c
index a22f5a0..6631668 100644
--- a/exec.c
+++ b/exec.c
@@ -2521,7 +2521,8 @@ static void mem_commit(MemoryListener *listener)

     atomic_rcu_set(&as->dispatch, next);
     if (cur) {
-        call_rcu(cur, address_space_dispatch_free, rcu);
+        synchronize_rcu();
+        address_space_dispatch_free(cur);
     }
 }

@@ -2567,7 +2568,8 @@ void address_space_destroy_dispatch(AddressSpace *as)

     atomic_rcu_set(&as->dispatch, NULL);
     if (d) {
-        call_rcu(d, address_space_dispatch_free, rcu);
+        synchronize_rcu();
+        address_space_dispatch_free(d);
     }
 }

@@ -2712,19 +2714,22 @@ static bool prepare_mmio_access(MemoryRegion *mr)
     return release_lock;
 }

-/* Called within RCU critical section.  */
-static MemTxResult address_space_write_continue(AddressSpace *as, hwaddr addr,
-                                                MemTxAttrs attrs,
-                                                const uint8_t *buf,
-                                                int len, hwaddr addr1,
-                                                hwaddr l, MemoryRegion *mr)
+MemTxResult address_space_write(AddressSpace *as, hwaddr addr,
+                  MemTxAttrs attrs, const uint8_t *buf, int len)
 {
     uint8_t *ptr;
     uint64_t val;
+    hwaddr l=len;
+    hwaddr addr1;
+    MemoryRegion *mr;
     MemTxResult result = MEMTX_OK;
     bool release_lock = false;

     for (;;) {
+        rcu_read_lock();
+        mr = address_space_translate(as, addr, &addr1, &l, true);
+        memory_region_ref(mr);
+        rcu_read_unlock();
         if (!memory_access_is_direct(mr, true)) {
             release_lock |= prepare_mmio_access(mr);
             l = memory_access_size(mr, l, addr1);
@@ -2764,7 +2769,7 @@ static MemTxResult address_space_write_continue(AddressSpace *as, hwaddr addr,
             memcpy(ptr, buf, l);
             invalidate_and_set_dirty(mr, addr1, l);
         }
-
+        memory_region_unref(mr);
         if (release_lock) {
             qemu_mutex_unlock_iothread();
             release_lock = false;
@@ -2779,27 +2784,6 @@ static MemTxResult address_space_write_continue(AddressSpace *as, hwaddr addr,
         }

         l = len;
-        mr = address_space_translate(as, addr, &addr1, &l, true);
-    }
-
-    return result;
-}
-
-MemTxResult address_space_write(AddressSpace *as, hwaddr addr, MemTxAttrs attrs,
-                                const uint8_t *buf, int len)
-{
-    hwaddr l;
-    hwaddr addr1;
-    MemoryRegion *mr;
-    MemTxResult result = MEMTX_OK;
-
-    if (len > 0) {
-        rcu_read_lock();
-        l = len;
-        mr = address_space_translate(as, addr, &addr1, &l, true);
-        result = address_space_write_continue(as, addr, attrs, buf, len,
-                                              addr1, l, mr);
-        rcu_read_unlock();
     }

     return result;
diff --git a/memory.c b/memory.c
index 64b0a60..d12437c 100644
--- a/memory.c
+++ b/memory.c
@@ -1505,13 +1505,10 @@ static void memory_region_finalize(Object *obj)
      * and cause an infinite loop.
      */
     mr->enabled = false;
-    memory_region_transaction_begin();
     while (!QTAILQ_EMPTY(&mr->subregions)) {
         MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
         memory_region_del_subregion(mr, subregion);
     }
-    memory_region_transaction_commit();
-
     mr->destructor(mr);
     memory_region_clear_coalescing(mr);
     g_free((char *)mr->name);



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-15 19:05                 ` Xu, Anthony
@ 2017-03-15 20:21                   ` Paolo Bonzini
  2017-03-16  2:47                     ` Zhong, Yang
  2017-03-16 20:02                     ` Xu, Anthony
  0 siblings, 2 replies; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-15 20:21 UTC (permalink / raw)
  To: Anthony Xu; +Cc: Yang Zhong, Chao P Peng, qemu-devel



----- Original Message -----
> From: "Anthony Xu" <anthony.xu@intel.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Yang Zhong" <yang.zhong@intel.com>, "Chao P Peng" <chao.p.peng@intel.com>, qemu-devel@nongnu.org
> Sent: Wednesday, March 15, 2017 8:05:48 PM
> Subject: Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
> 
> > The first unref is done after as->current_map is overwritten.
> > as->current_map is accessed under RCU, so it needs call_rcu.  It
> > balances the initial reference that is present since flatview_init.
> 
> Got it, thanks for explanation.
> 
> > > but it is not clear to me, is this a bug or by design? Is
> > > flatview_destroy called
> > in current thread
> > > or RCU thread?
> > 
> > You cannot know, because there are also other callers of
> > address_space_get_flatview.  Usually it's from the RCU thread.
> 
> If it is from current thread, do we need to check if the global lock is
> acquired?
> As you mentioned flatview_unref may call memory_region_unref.

No, you don't need to check it.  Most functions (in this case,
memory_region_transaction_commit) can only be called under the global lock.
In fact, only address_space_read/write/ld*/st*, plus memory_region_find can
be called without the global lock.

> In the comments of memory_region_finalize
> "   /* We know the region is not visible in any address space (it
>      * does not have a container and cannot be a root either because
>      * it has no references,
> "
> 
> We know the memory region is not a root region, and the memory region has
> already been removed from address space even it has sub regions.
> memory_region_transaction_commit should be called when the
> memory region is removed from address space.
> memory_region_transaction_commit seems not be needed in
> memory_region_finalize.
> Let me know if you think otherwise.

Yes, you can replace memory_region_del_subregion in memory_region_finalize
with special code that does

    assert(!mr->enabled);
    assert(subregion->container == mr);
    subregion->container = NULL;
    QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
    memory_region_unref(subregion);

The last four lines are shared with memory_region_del_subregion, so please
factor them in a new function, for example memory_region_del_subregion_internal.

> > > How about fall back to synchronize_rcu?
> > 
> > I'm afraid it would be too slow, but you can test.
> 
> It is not slow, the contention is not high. Yes we can test.

It's not a matter of contention.  It's a matter of how long an RCU
critical section can be---not just at startup, but at any point
during QEMU execution.

Have you tried tcmalloc or jemalloc?  They use the brk region
less and sometimes are more aggressive in releasing mmap-ed memory
areas.

> Please review below patch, MemoryRegion is protected by RCU.

Removing transaction begin/commit in memory_region_finalize makes
little sense because memory_region_del_subregion calls those two
functions again.  See above for an alternative.  Apart from this,
the patch is at least correct, though I'm not sure it's a good
idea (synchronize_rcu needs testing).  I'm not sure I understand the
address_space_write change).

Paolo

> Thanks,
> Anthony
> 
> 
> 
> diff --git a/exec.c b/exec.c
> index a22f5a0..6631668 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2521,7 +2521,8 @@ static void mem_commit(MemoryListener *listener)
> 
>      atomic_rcu_set(&as->dispatch, next);
>      if (cur) {
> -        call_rcu(cur, address_space_dispatch_free, rcu);
> +        synchronize_rcu();
> +        address_space_dispatch_free(cur);
>      }
>  }
> 
> @@ -2567,7 +2568,8 @@ void address_space_destroy_dispatch(AddressSpace *as)
> 
>      atomic_rcu_set(&as->dispatch, NULL);
>      if (d) {
> -        call_rcu(d, address_space_dispatch_free, rcu);
> +        synchronize_rcu();
> +        address_space_dispatch_free(d);
>      }
>  }
> 
> @@ -2712,19 +2714,22 @@ static bool prepare_mmio_access(MemoryRegion *mr)
>      return release_lock;
>  }
> 
> -/* Called within RCU critical section.  */
> -static MemTxResult address_space_write_continue(AddressSpace *as, hwaddr
> addr,
> -                                                MemTxAttrs attrs,
> -                                                const uint8_t *buf,
> -                                                int len, hwaddr addr1,
> -                                                hwaddr l, MemoryRegion *mr)
> +MemTxResult address_space_write(AddressSpace *as, hwaddr addr,
> +                  MemTxAttrs attrs, const uint8_t *buf, int len)
>  {
>      uint8_t *ptr;
>      uint64_t val;
> +    hwaddr l=len;
> +    hwaddr addr1;
> +    MemoryRegion *mr;
>      MemTxResult result = MEMTX_OK;
>      bool release_lock = false;
> 
>      for (;;) {
> +        rcu_read_lock();
> +        mr = address_space_translate(as, addr, &addr1, &l, true);
> +        memory_region_ref(mr);
> +        rcu_read_unlock();
>          if (!memory_access_is_direct(mr, true)) {
>              release_lock |= prepare_mmio_access(mr);
>              l = memory_access_size(mr, l, addr1);
> @@ -2764,7 +2769,7 @@ static MemTxResult
> address_space_write_continue(AddressSpace *as, hwaddr addr,
>              memcpy(ptr, buf, l);
>              invalidate_and_set_dirty(mr, addr1, l);
>          }
> -
> +        memory_region_unref(mr);
>          if (release_lock) {
>              qemu_mutex_unlock_iothread();
>              release_lock = false;
> @@ -2779,27 +2784,6 @@ static MemTxResult
> address_space_write_continue(AddressSpace *as, hwaddr addr,
>          }
> 
>          l = len;
> -        mr = address_space_translate(as, addr, &addr1, &l, true);
> -    }
> -
> -    return result;
> -}
> -
> -MemTxResult address_space_write(AddressSpace *as, hwaddr addr, MemTxAttrs
> attrs,
> -                                const uint8_t *buf, int len)
> -{
> -    hwaddr l;
> -    hwaddr addr1;
> -    MemoryRegion *mr;
> -    MemTxResult result = MEMTX_OK;
> -
> -    if (len > 0) {
> -        rcu_read_lock();
> -        l = len;
> -        mr = address_space_translate(as, addr, &addr1, &l, true);
> -        result = address_space_write_continue(as, addr, attrs, buf, len,
> -                                              addr1, l, mr);
> -        rcu_read_unlock();
>      }
> 
>      return result;
> diff --git a/memory.c b/memory.c
> index 64b0a60..d12437c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1505,13 +1505,10 @@ static void memory_region_finalize(Object *obj)
>       * and cause an infinite loop.
>       */
>      mr->enabled = false;
> -    memory_region_transaction_begin();
>      while (!QTAILQ_EMPTY(&mr->subregions)) {
>          MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
>          memory_region_del_subregion(mr, subregion);
>      }
> -    memory_region_transaction_commit();
> -
>      mr->destructor(mr);
>      memory_region_clear_coalescing(mr);
>      g_free((char *)mr->name);
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-15 20:21                   ` Paolo Bonzini
@ 2017-03-16  2:47                     ` Zhong, Yang
  2017-03-16 20:02                     ` Xu, Anthony
  1 sibling, 0 replies; 20+ messages in thread
From: Zhong, Yang @ 2017-03-16  2:47 UTC (permalink / raw)
  To: Paolo Bonzini, Xu, Anthony; +Cc: Peng, Chao P, qemu-devel

Hello Paolo,

As for your said tcmalloc or jemalloc for optimization memory allocation, we also tried the malloc_trim() in glibc, which can get the same memory optimization.

diff --git a/util/rcu.c b/util/rcu.c
index 9adc5e4..3643e43 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -32,7 +32,7 @@
#include "qemu/atomic.h"
#include "qemu/thread.h"
#include "qemu/main-loop.h"
-
+#include "malloc.h"
/*
  * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
  * Bits 1 and above are defined in synchronize_rcu.
@@ -271,6 +271,7 @@ static void *call_rcu_thread(void *opaque)
             n--;
             node->func(node);
         }
+        malloc_trim(0);
         qemu_mutex_unlock_iothread();
     }
     abort();

The man info in malloc_trim  is only release free memory from the top of the heap, in fact, we found this function also release hole memory below top of heap.

Thanks!

ZhongYang


-----Original Message-----
From: Paolo Bonzini [mailto:pbonzini@redhat.com] 
Sent: Thursday, March 16, 2017 4:22 AM
To: Xu, Anthony <anthony.xu@intel.com>
Cc: Zhong, Yang <yang.zhong@intel.com>; Peng, Chao P <chao.p.peng@intel.com>; qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB



----- Original Message -----
> From: "Anthony Xu" <anthony.xu@intel.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Yang Zhong" <yang.zhong@intel.com>, "Chao P Peng" 
> <chao.p.peng@intel.com>, qemu-devel@nongnu.org
> Sent: Wednesday, March 15, 2017 8:05:48 PM
> Subject: Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size 
> from 12252kB to 2752KB
> 
> > The first unref is done after as->current_map is overwritten.
> > as->current_map is accessed under RCU, so it needs call_rcu.  It
> > balances the initial reference that is present since flatview_init.
> 
> Got it, thanks for explanation.
> 
> > > but it is not clear to me, is this a bug or by design? Is 
> > > flatview_destroy called
> > in current thread
> > > or RCU thread?
> > 
> > You cannot know, because there are also other callers of 
> > address_space_get_flatview.  Usually it's from the RCU thread.
> 
> If it is from current thread, do we need to check if the global lock 
> is acquired?
> As you mentioned flatview_unref may call memory_region_unref.

No, you don't need to check it.  Most functions (in this case,
memory_region_transaction_commit) can only be called under the global lock.
In fact, only address_space_read/write/ld*/st*, plus memory_region_find can be called without the global lock.

> In the comments of memory_region_finalize
> "   /* We know the region is not visible in any address space (it
>      * does not have a container and cannot be a root either because
>      * it has no references,
> "
> 
> We know the memory region is not a root region, and the memory region 
> has already been removed from address space even it has sub regions.
> memory_region_transaction_commit should be called when the memory 
> region is removed from address space.
> memory_region_transaction_commit seems not be needed in 
> memory_region_finalize.
> Let me know if you think otherwise.

Yes, you can replace memory_region_del_subregion in memory_region_finalize with special code that does

    assert(!mr->enabled);
    assert(subregion->container == mr);
    subregion->container = NULL;
    QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
    memory_region_unref(subregion);

The last four lines are shared with memory_region_del_subregion, so please factor them in a new function, for example memory_region_del_subregion_internal.

> > > How about fall back to synchronize_rcu?
> > 
> > I'm afraid it would be too slow, but you can test.
> 
> It is not slow, the contention is not high. Yes we can test.

It's not a matter of contention.  It's a matter of how long an RCU critical section can be---not just at startup, but at any point during QEMU execution.

Have you tried tcmalloc or jemalloc?  They use the brk region less and sometimes are more aggressive in releasing mmap-ed memory areas.

> Please review below patch, MemoryRegion is protected by RCU.

Removing transaction begin/commit in memory_region_finalize makes little sense because memory_region_del_subregion calls those two functions again.  See above for an alternative.  Apart from this, the patch is at least correct, though I'm not sure it's a good idea (synchronize_rcu needs testing).  I'm not sure I understand the address_space_write change).

Paolo

> Thanks,
> Anthony
> 
> 
> 
> diff --git a/exec.c b/exec.c
> index a22f5a0..6631668 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2521,7 +2521,8 @@ static void mem_commit(MemoryListener *listener)
> 
>      atomic_rcu_set(&as->dispatch, next);
>      if (cur) {
> -        call_rcu(cur, address_space_dispatch_free, rcu);
> +        synchronize_rcu();
> +        address_space_dispatch_free(cur);
>      }
>  }
> 
> @@ -2567,7 +2568,8 @@ void address_space_destroy_dispatch(AddressSpace 
> *as)
> 
>      atomic_rcu_set(&as->dispatch, NULL);
>      if (d) {
> -        call_rcu(d, address_space_dispatch_free, rcu);
> +        synchronize_rcu();
> +        address_space_dispatch_free(d);
>      }
>  }
> 
> @@ -2712,19 +2714,22 @@ static bool prepare_mmio_access(MemoryRegion *mr)
>      return release_lock;
>  }
> 
> -/* Called within RCU critical section.  */ -static MemTxResult 
> address_space_write_continue(AddressSpace *as, hwaddr addr,
> -                                                MemTxAttrs attrs,
> -                                                const uint8_t *buf,
> -                                                int len, hwaddr addr1,
> -                                                hwaddr l, MemoryRegion *mr)
> +MemTxResult address_space_write(AddressSpace *as, hwaddr addr,
> +                  MemTxAttrs attrs, const uint8_t *buf, int len)
>  {
>      uint8_t *ptr;
>      uint64_t val;
> +    hwaddr l=len;
> +    hwaddr addr1;
> +    MemoryRegion *mr;
>      MemTxResult result = MEMTX_OK;
>      bool release_lock = false;
> 
>      for (;;) {
> +        rcu_read_lock();
> +        mr = address_space_translate(as, addr, &addr1, &l, true);
> +        memory_region_ref(mr);
> +        rcu_read_unlock();
>          if (!memory_access_is_direct(mr, true)) {
>              release_lock |= prepare_mmio_access(mr);
>              l = memory_access_size(mr, l, addr1); @@ -2764,7 +2769,7 
> @@ static MemTxResult address_space_write_continue(AddressSpace *as, 
> hwaddr addr,
>              memcpy(ptr, buf, l);
>              invalidate_and_set_dirty(mr, addr1, l);
>          }
> -
> +        memory_region_unref(mr);
>          if (release_lock) {
>              qemu_mutex_unlock_iothread();
>              release_lock = false;
> @@ -2779,27 +2784,6 @@ static MemTxResult 
> address_space_write_continue(AddressSpace *as, hwaddr addr,
>          }
> 
>          l = len;
> -        mr = address_space_translate(as, addr, &addr1, &l, true);
> -    }
> -
> -    return result;
> -}
> -
> -MemTxResult address_space_write(AddressSpace *as, hwaddr addr, 
> MemTxAttrs attrs,
> -                                const uint8_t *buf, int len)
> -{
> -    hwaddr l;
> -    hwaddr addr1;
> -    MemoryRegion *mr;
> -    MemTxResult result = MEMTX_OK;
> -
> -    if (len > 0) {
> -        rcu_read_lock();
> -        l = len;
> -        mr = address_space_translate(as, addr, &addr1, &l, true);
> -        result = address_space_write_continue(as, addr, attrs, buf, len,
> -                                              addr1, l, mr);
> -        rcu_read_unlock();
>      }
> 
>      return result;
> diff --git a/memory.c b/memory.c
> index 64b0a60..d12437c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1505,13 +1505,10 @@ static void memory_region_finalize(Object *obj)
>       * and cause an infinite loop.
>       */
>      mr->enabled = false;
> -    memory_region_transaction_begin();
>      while (!QTAILQ_EMPTY(&mr->subregions)) {
>          MemoryRegion *subregion = QTAILQ_FIRST(&mr->subregions);
>          memory_region_del_subregion(mr, subregion);
>      }
> -    memory_region_transaction_commit();
> -
>      mr->destructor(mr);
>      memory_region_clear_coalescing(mr);
>      g_free((char *)mr->name);
> 
> 
> 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-15 20:21                   ` Paolo Bonzini
  2017-03-16  2:47                     ` Zhong, Yang
@ 2017-03-16 20:02                     ` Xu, Anthony
  2017-03-22 11:46                       ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Xu, Anthony @ 2017-03-16 20:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Zhong, Yang, Peng, Chao P, qemu-devel

> > memory_region_finalize.
> > Let me know if you think otherwise.
> 
> Yes, you can replace memory_region_del_subregion in
> memory_region_finalize
> with special code that does
> 
>     assert(!mr->enabled);
>     assert(subregion->container == mr);
>     subregion->container = NULL;
>     QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
>     memory_region_unref(subregion);
> 
> The last four lines are shared with memory_region_del_subregion, so please
> factor them in a new function, for example
> memory_region_del_subregion_internal.


After adding synchronize_rcu, I saw an infinite recursive call,
  mem_commit-> memory_region_finalize-> mem_commit->
memory_region_finalize-> ......
it caused a segment fault, because 8M stack space is used up, and found when 
memory_region_finalize is called, memory_region_transaction_depth is 0 and 
memory_region_update_pending is true. That's not normal!

As you mentioned in your previous email, that should  not happen.
>" But if memory_region_transaction_depth is == 0, there should be no
>update pending because the loop has never run"

The root cause is,
MEMORY_LISTENER_CALL_GLOBAL(commit, Forward) is called before 
memory_region_clear_pending() in memory_region_transaction_commit.
so when MEMORY_LISTENER_CALL_GLOBAL(commit, Forward) is called,
memory_region_transaction_depth is 0 and memory_region_update_pending is true.
mem_commit may call memory_region_finalize, it causes infinite recursive call.

my previous fix for this is not complete.
We may clear memory_region_update_pending before calling 
MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)

Please review below patch

diff --git a/memory.c b/memory.c
index 64b0a60..4c95aaf 100644
--- a/memory.c
+++ b/memory.c
@@ -906,12 +906,6 @@ void memory_region_transaction_begin(void)
     ++memory_region_transaction_depth;
 }

-static void memory_region_clear_pending(void)
-{
-    memory_region_update_pending = false;
-    ioeventfd_update_pending = false;
-}
-
 void memory_region_transaction_commit(void)
 {
     AddressSpace *as;
@@ -927,14 +921,14 @@ void memory_region_transaction_commit(void)
             QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
                 address_space_update_topology(as);
             }
-
+            memory_region_update_pending = false;
             MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);
         } else if (ioeventfd_update_pending) {
             QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
                 address_space_update_ioeventfds(as);
             }
+            ioeventfd_update_pending = false;
         }
-        memory_region_clear_pending();
    }
 }




> > It is not slow, the contention is not high. Yes we can test.
> 
> It's not a matter of contention.  It's a matter of how long an RCU
> critical section can be---not just at startup, but at any point
> during QEMU execution.
The thing is, seems both address_space_translate and address_space_dispatch_free 
are called under the global lock. When synchronize_rcu is called, no other threads 
are in RCU critical section.

Seems RCU is not that useful for address space.


> 
> Have you tried tcmalloc or jemalloc?  They use the brk region
> less and sometimes are more aggressive in releasing mmap-ed memory
> areas.

They may be aggressive. But if memory are freed afterward, it may not help in some cases,
for example, starting a lot of VMs at the same time.


> 
> > Please review below patch, MemoryRegion is protected by RCU.
> 
> Removing transaction begin/commit in memory_region_finalize makes
> little sense because memory_region_del_subregion calls those two
> functions again.  See above for an alternative.  Apart from this,
> the patch is at least correct, though I'm not sure it's a good
> idea (synchronize_rcu needs testing). 
> I'm not sure I understand the
> address_space_write change).

After adding synchronize_rcu, we noticed a RCU dead loop.  synchronize_rcu is called 
inside RCU critical section. It happened when guest OS programmed the PCI bar.
The call trace is like,
address_space_write-> pci_host_config_write_common -> 
memory_region_transaction_commit ->mem_commit-> synchronize_rcu
pci_host_config_write_common is called inside RCU critical section.

The address_space_write change fixed this issue.


Thanks,
Anthony

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-16 20:02                     ` Xu, Anthony
@ 2017-03-22 11:46                       ` Paolo Bonzini
  2017-03-22 18:26                         ` Xu, Anthony
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2017-03-22 11:46 UTC (permalink / raw)
  To: Xu, Anthony; +Cc: Zhong, Yang, Peng, Chao P, qemu-devel



On 16/03/2017 21:02, Xu, Anthony wrote:
>>> memory_region_finalize.
>>> Let me know if you think otherwise.
>>
>> Yes, you can replace memory_region_del_subregion in
>> memory_region_finalize
>> with special code that does
>>
>>     assert(!mr->enabled);
>>     assert(subregion->container == mr);
>>     subregion->container = NULL;
>>     QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
>>     memory_region_unref(subregion);
>>
>> The last four lines are shared with memory_region_del_subregion, so please
>> factor them in a new function, for example
>> memory_region_del_subregion_internal.
> 
> After adding synchronize_rcu, I saw an infinite recursive call,
>   mem_commit-> memory_region_finalize-> mem_commit->
> memory_region_finalize-> ......
> it caused a segment fault, because 8M stack space is used up, and found when 
> memory_region_finalize is called, memory_region_transaction_depth is 0 and 
> memory_region_update_pending is true. That's not normal!

Ok, this is a bug.  This would fix it:

> Please review below patch
> 
> diff --git a/memory.c b/memory.c
> index 64b0a60..4c95aaf 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -906,12 +906,6 @@ void memory_region_transaction_begin(void)
>      ++memory_region_transaction_depth;
>  }
> 
> -static void memory_region_clear_pending(void)
> -{
> -    memory_region_update_pending = false;
> -    ioeventfd_update_pending = false;
> -}
> -
>  void memory_region_transaction_commit(void)
>  {
>      AddressSpace *as;
> @@ -927,14 +921,14 @@ void memory_region_transaction_commit(void)
>              QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
>                  address_space_update_topology(as);
>              }
> -
> +            memory_region_update_pending = false;
>              MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);
>          } else if (ioeventfd_update_pending) {
>              QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
>                  address_space_update_ioeventfds(as);
>              }
> +            ioeventfd_update_pending = false;
>          }
> -        memory_region_clear_pending();
>     }
>  }

So please send it to the list with Signed-off-by line.

> The thing is, seems both address_space_translate and address_space_dispatch_free 
> are called under the global lock. When synchronize_rcu is called, no other threads 
> are in RCU critical section.

No, not necessarily.  address_space_write for example is called outside
the global lock by KVM and it calls address_space_translate.

> Seems RCU is not that useful for address space.

I suggest that you study the code more closely...  there is this in
kvm-all.c:

            DPRINTF("handle_mmio\n");
            /* Called outside BQL */
            address_space_rw(&address_space_memory,
                             run->mmio.phys_addr, attrs,
                             run->mmio.data,
                             run->mmio.len,
                             run->mmio.is_write);

and adding a simple assert() would have quickly disproved your theory.

> After adding synchronize_rcu, we noticed a RCU dead loop.  synchronize_rcu is called 
> inside RCU critical section. It happened when guest OS programmed the PCI bar.
> The call trace is like,
> address_space_write-> pci_host_config_write_common -> 
> memory_region_transaction_commit ->mem_commit-> synchronize_rcu
> pci_host_config_write_common is called inside RCU critical section.
> 
> The address_space_write change fixed this issue.

It's not a fix if the code is not thread-safe anymore!  But I think you
have the answer now as to why you cannot use synchronize_rcu.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/2] reduce qemu's heap Rss size from 12252kB to 2752KB
  2017-03-22 11:46                       ` Paolo Bonzini
@ 2017-03-22 18:26                         ` Xu, Anthony
  0 siblings, 0 replies; 20+ messages in thread
From: Xu, Anthony @ 2017-03-22 18:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Zhong, Yang, Peng, Chao P, qemu-devel

> So please send it to the list with Signed-off-by line.
Thanks, 

> 
>             DPRINTF("handle_mmio\n");
>             /* Called outside BQL */
>             address_space_rw(&address_space_memory,
>                              run->mmio.phys_addr, attrs,
>                              run->mmio.data,
>                              run->mmio.len,
>                              run->mmio.is_write);
> 
> and adding a simple assert() would have quickly disproved your theory.
You are right here.
If it is a PCI bar write, a memory region operation(del/add) may 
be called inside address_space_rw, and 
memory_region_transaction_commit will be called.
If address_space_rw is called  without the global lock, 
memory_region_transaction_commit is called with the global lock.
 It conflicts with what you said before.

>No, you don't need to check it.  Most functions (in this case,
>memory_region_transaction_commit) can only be called under the global lock.

And if two vcpus program the different PCI bars at the same time 
(it is unlikely, but QEMU should not assume it), without the global lock, 
region operations may be called at the same time. 
Are memory_region_del_subregion and 
memory_region_add_subregion_overlap thread-safe?

> It's not a fix if the code is not thread-safe anymore!  But I think you
> have the answer now as to why you cannot use synchronize_rcu.

Can you elaborate why the code is not thread-safe?

Thanks
Anthony

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-03-22 18:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 15:14 [Qemu-devel] [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB Yang Zhong
2017-03-10  8:41 ` Paolo Bonzini
2017-03-10 16:05   ` Xu, Anthony
2017-03-10 16:07     ` Paolo Bonzini
2017-03-10 19:30       ` Xu, Anthony
2017-03-11 17:02         ` Paolo Bonzini
2017-03-10 15:14 ` [Qemu-devel] [PATCH v1 2/2] " Yang Zhong
2017-03-10  9:14   ` Paolo Bonzini
2017-03-10 17:40     ` Xu, Anthony
2017-03-11 17:04       ` Paolo Bonzini
2017-03-14  5:14         ` Xu, Anthony
2017-03-14 10:14           ` Paolo Bonzini
2017-03-14 21:23             ` Xu, Anthony
2017-03-15  8:36               ` Paolo Bonzini
2017-03-15 19:05                 ` Xu, Anthony
2017-03-15 20:21                   ` Paolo Bonzini
2017-03-16  2:47                     ` Zhong, Yang
2017-03-16 20:02                     ` Xu, Anthony
2017-03-22 11:46                       ` Paolo Bonzini
2017-03-22 18:26                         ` Xu, Anthony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.