qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd
@ 2019-06-25 13:18 Ivan Ren
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop Ivan Ren
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Ivan Ren @ 2019-06-25 13:18 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

The patches fix the problems encountered in multifd migration when try
to cancel the migration by migrate_cancel.

Ivan Ren (3):
  migration: fix migrate_cancel leads live_migration thread endless loop
  migration: fix migrate_cancel leads live_migration thread hung forever
  migration: fix migrate_cancel multifd migration leads destination hung
    forever

 migration/ram.c | 62 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 11 deletions(-)

-- 
2.17.2 (Apple Git-113)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop
  2019-06-25 13:18 [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
@ 2019-06-25 13:18 ` Ivan Ren
  2019-07-24  8:43   ` Juan Quintela
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever Ivan Ren
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Ivan Ren @ 2019-06-25 13:18 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

When we 'migrate_cancel' a multifd migration, live_migration thread may
go into endless loop in multifd_send_pages functions.

Reproduce steps:

(qemu) migrate_set_capability multifd on
(qemu) migrate -d url
(qemu) [wait a while]
(qemu) migrate_cancel

Then may get live_migration 100% cpu usage in following stack:

pthread_mutex_lock
qemu_mutex_lock_impl
multifd_send_pages
multifd_queue_page
ram_save_multifd_page
ram_save_target_page
ram_save_host_page
ram_find_and_save_block
ram_find_and_save_block
ram_save_iterate
qemu_savevm_state_iterate
migration_iteration_run
migration_thread
qemu_thread_start
start_thread
clone

Signed-off-by: Ivan Ren <ivanren@tencent.com>
---
 migration/ram.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 908517fc2b..f8908286c2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -920,7 +920,7 @@ struct {
  * false.
  */
 
-static void multifd_send_pages(void)
+static int multifd_send_pages(void)
 {
     int i;
     static int next_channel;
@@ -933,6 +933,11 @@ static void multifd_send_pages(void)
         p = &multifd_send_state->params[i];
 
         qemu_mutex_lock(&p->mutex);
+        if (p->quit) {
+            error_report("%s: channel %d has already quit!", __func__, i);
+            qemu_mutex_unlock(&p->mutex);
+            return -1;
+        }
         if (!p->pending_job) {
             p->pending_job++;
             next_channel = (i + 1) % migrate_multifd_channels();
@@ -951,9 +956,11 @@ static void multifd_send_pages(void)
     ram_counters.transferred += transferred;;
     qemu_mutex_unlock(&p->mutex);
     qemu_sem_post(&p->sem);
+
+    return 1;
 }
 
-static void multifd_queue_page(RAMBlock *block, ram_addr_t offset)
+static int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
 {
     MultiFDPages_t *pages = multifd_send_state->pages;
 
@@ -968,15 +975,19 @@ static void multifd_queue_page(RAMBlock *block, ram_addr_t offset)
         pages->used++;
 
         if (pages->used < pages->allocated) {
-            return;
+            return 1;
         }
     }
 
-    multifd_send_pages();
+    if (multifd_send_pages() < 0) {
+        return -1;
+    }
 
     if (pages->block != block) {
-        multifd_queue_page(block, offset);
+        return  multifd_queue_page(block, offset);
     }
+
+    return 1;
 }
 
 static void multifd_send_terminate_threads(Error *err)
@@ -1049,7 +1060,10 @@ static void multifd_send_sync_main(void)
         return;
     }
     if (multifd_send_state->pages->used) {
-        multifd_send_pages();
+        if (multifd_send_pages() < 0) {
+            error_report("%s: multifd_send_pages fail", __func__);
+            return;
+        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -1058,6 +1072,12 @@ static void multifd_send_sync_main(void)
 
         qemu_mutex_lock(&p->mutex);
 
+        if (p->quit) {
+            error_report("%s: channel %d has already quit", __func__, i);
+            qemu_mutex_unlock(&p->mutex);
+            return;
+        }
+
         p->packet_num = multifd_send_state->packet_num++;
         p->flags |= MULTIFD_FLAG_SYNC;
         p->pending_job++;
@@ -2000,7 +2020,9 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
 static int ram_save_multifd_page(RAMState *rs, RAMBlock *block,
                                  ram_addr_t offset)
 {
-    multifd_queue_page(block, offset);
+    if (multifd_queue_page(block, offset) < 0) {
+        return -1;
+    }
     ram_counters.normal++;
 
     return 1;
-- 
2.17.2 (Apple Git-113)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever
  2019-06-25 13:18 [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop Ivan Ren
@ 2019-06-25 13:18 ` Ivan Ren
  2019-07-24  8:47   ` Juan Quintela
  2019-07-24  9:18   ` Juan Quintela
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination " Ivan Ren
  2019-07-14 14:53 ` [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
  3 siblings, 2 replies; 11+ messages in thread
From: Ivan Ren @ 2019-06-25 13:18 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

When we 'migrate_cancel' a multifd migration, live_migration thread may
hung forever at some points, because of multifd_send_thread has already
exit for socket error:
1. multifd_send_pages may hung at qemu_sem_wait(&multifd_send_state->
   channels_ready)
2. multifd_send_sync_main my hung at qemu_sem_wait(&multifd_send_state->
   sem_sync)

Signed-off-by: Ivan Ren <ivanren@tencent.com>
---
 migration/ram.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index f8908286c2..e4eb9c441f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1097,7 +1097,11 @@ static void *multifd_send_thread(void *opaque)
 {
     MultiFDSendParams *p = opaque;
     Error *local_err = NULL;
-    int ret;
+    int ret = 0;
+
+    uint32_t used = 0;
+    uint64_t packet_num = 0;
+    uint32_t flags = 0;
 
     trace_multifd_send_thread_start(p->id);
     rcu_register_thread();
@@ -1113,9 +1117,9 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
-            uint32_t used = p->pages->used;
-            uint64_t packet_num = p->packet_num;
-            uint32_t flags = p->flags;
+            used = p->pages->used;
+            packet_num = p->packet_num;
+            flags = p->flags;
 
             p->next_packet_size = used * qemu_target_page_size();
             multifd_send_fill_packet(p);
@@ -1164,6 +1168,17 @@ out:
         multifd_send_terminate_threads(local_err);
     }
 
+    /*
+     * Error happen, I will exit, but I can't just leave, tell
+     * who pay attention to me.
+     */
+    if (ret != 0) {
+        if (flags & MULTIFD_FLAG_SYNC) {
+            qemu_sem_post(&multifd_send_state->sem_sync);
+        }
+        qemu_sem_post(&multifd_send_state->channels_ready);
+    }
+
     qemu_mutex_lock(&p->mutex);
     p->running = false;
     qemu_mutex_unlock(&p->mutex);
-- 
2.17.2 (Apple Git-113)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination hung forever
  2019-06-25 13:18 [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop Ivan Ren
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever Ivan Ren
@ 2019-06-25 13:18 ` Ivan Ren
  2019-07-24  7:01   ` Ivan Ren
  2019-07-24  9:01   ` Juan Quintela
  2019-07-14 14:53 ` [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
  3 siblings, 2 replies; 11+ messages in thread
From: Ivan Ren @ 2019-06-25 13:18 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

When migrate_cancel a multifd migration, if run sequence like this:

        [source]                              [destination]

multifd_send_sync_main[finish]
                                    multifd_recv_thread wait &p->sem_sync
shutdown to_dst_file
                                    detect error from_src_file
send  RAM_SAVE_FLAG_EOS[fail]       [no chance to run multifd_recv_sync_main]
                                    multifd_load_cleanup
                                    join multifd receive thread forever

will lead destination qemu hung at following stack:

pthread_join
qemu_thread_join
multifd_load_cleanup
process_incoming_migration_co
coroutine_trampoline

Signed-off-by: Ivan Ren <ivanren@tencent.com>
---
 migration/ram.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index e4eb9c441f..504c8ccb03 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
         if (p->running) {
+            /*
+             * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
+             * however try to wakeup it without harm in cleanup phase.
+             */
+            qemu_sem_post(&p->sem_sync);
             qemu_thread_join(&p->thread);
         }
         object_unref(OBJECT(p->c));
-- 
2.17.2 (Apple Git-113)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd
  2019-06-25 13:18 [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
                   ` (2 preceding siblings ...)
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination " Ivan Ren
@ 2019-07-14 14:53 ` Ivan Ren
  3 siblings, 0 replies; 11+ messages in thread
From: Ivan Ren @ 2019-07-14 14:53 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

The problem still exists in mainline, Ping for review

On Tue, Jun 25, 2019 at 9:18 PM Ivan Ren <renyime@gmail.com> wrote:

> The patches fix the problems encountered in multifd migration when try
> to cancel the migration by migrate_cancel.
>
> Ivan Ren (3):
>   migration: fix migrate_cancel leads live_migration thread endless loop
>   migration: fix migrate_cancel leads live_migration thread hung forever
>   migration: fix migrate_cancel multifd migration leads destination hung
>     forever
>
>  migration/ram.c | 62 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 51 insertions(+), 11 deletions(-)
>
> --
> 2.17.2 (Apple Git-113)
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination hung forever
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination " Ivan Ren
@ 2019-07-24  7:01   ` Ivan Ren
  2019-07-24  9:01   ` Juan Quintela
  1 sibling, 0 replies; 11+ messages in thread
From: Ivan Ren @ 2019-07-24  7:01 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel

ping for review

problem still exist in qemu-4.1.0-rc2

Threads:  24 total,   0 running,  24 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
 0.0 st
KiB Mem : 39434172+total, 36798950+free,  2948836 used, 23403388 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 38926476+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
286108 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:01.19
qemu-system-x86
286109 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00
qemu-system-x86
286113 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 IO
mon_iothread
286114 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
0/KVM
286115 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
1/KVM
286116 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
2/KVM
286117 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
3/KVM
286118 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
4/KVM
286119 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
5/KVM
286120 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
6/KVM
286121 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
7/KVM
286122 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
8/KVM
286123 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
9/KVM
286124 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
10/KVM
286125 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
11/KVM
286126 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
12/KVM
286127 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
13/KVM
286128 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
14/KVM
286129 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00 CPU
15/KVM
286131 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.00
vnc_worker
286132 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.01
multifdrecv_0
286133 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.01
multifdrecv_1
286134 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.01
multifdrecv_2
286136 root      20   0  0.127t 110596  12684 S  0.0  0.0   0:00.01
multifdrecv_3

Thread 2 (Thread 0x7f9d075fe700 (LWP 286136)):
#0  0x00007fbd67123a0b in do_futex_wait.constprop.1 () from
/lib64/libpthread.so.0
#1  0x00007fbd67123a9f in __new_sem_wait_slow.constprop.0 () from
/lib64/libpthread.so.0
#2  0x00007fbd67123b3b in sem_wait@@GLIBC_2.2.5 () from
/lib64/libpthread.so.0
#3  0x00005582236e2514 in qemu_sem_wait (sem=sem@entry=0x558226364dc8) at
util/qemu-thread-posix.c:319
#4  0x00005582232efb67 in multifd_recv_thread
(opaque=opaque@entry=0x558226364d30)
at /qemu-4.1.0-rc2/migration/ram.c:1356
#5  0x00005582236e1b26 in qemu_thread_start (args=<optimized out>) at
util/qemu-thread-posix.c:502
#6  0x00007fbd6711de25 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fbd66e4a35d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fbd6c7a4cc0 (LWP 286108)):
#0  0x00007fbd6711ef57 in pthread_join () from /lib64/libpthread.so.0
#1  0x00005582236e28ff in qemu_thread_join (thread=thread@entry=0x558226364b00)
at util/qemu-thread-posix.c:570
#2  0x00005582232f36b4 in multifd_load_cleanup (errp=errp@entry=0x7fbd341fff58)
at /qemu-4.1.0-rc2/migration/ram.c:1259
#3  0x00005582235765a9 in process_incoming_migration_co (opaque=<optimized
out>) at migration/migration.c:510
#4  0x00005582236f4c0a in coroutine_trampoline (i0=<optimized out>,
i1=<optimized out>) at util/coroutine-ucontext.c:115
#5  0x00007fbd66d98d40 in ?? () from /lib64/libc.so.6
#6  0x00007ffec0bf24d0 in ?? ()
#7  0x0000000000000000 in ?? ()

On Tue, Jun 25, 2019 at 9:18 PM Ivan Ren <renyime@gmail.com> wrote:

> When migrate_cancel a multifd migration, if run sequence like this:
>
>         [source]                              [destination]
>
> multifd_send_sync_main[finish]
>                                     multifd_recv_thread wait &p->sem_sync
> shutdown to_dst_file
>                                     detect error from_src_file
> send  RAM_SAVE_FLAG_EOS[fail]       [no chance to run
> multifd_recv_sync_main]
>                                     multifd_load_cleanup
>                                     join multifd receive thread forever
>
> will lead destination qemu hung at following stack:
>
> pthread_join
> qemu_thread_join
> multifd_load_cleanup
> process_incoming_migration_co
> coroutine_trampoline
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>
> ---
>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index e4eb9c441f..504c8ccb03 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
>          MultiFDRecvParams *p = &multifd_recv_state->params[i];
>
>          if (p->running) {
> +            /*
> +             * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> +             * however try to wakeup it without harm in cleanup phase.
> +             */
> +            qemu_sem_post(&p->sem_sync);
>              qemu_thread_join(&p->thread);
>          }
>          object_unref(OBJECT(p->c));
> --
> 2.17.2 (Apple Git-113)
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop Ivan Ren
@ 2019-07-24  8:43   ` Juan Quintela
  0 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2019-07-24  8:43 UTC (permalink / raw)
  To: Ivan Ren; +Cc: dgilbert, qemu-devel

Ivan Ren <renyime@gmail.com> wrote:
> When we 'migrate_cancel' a multifd migration, live_migration thread may
> go into endless loop in multifd_send_pages functions.
>
> Reproduce steps:
>
> (qemu) migrate_set_capability multifd on
> (qemu) migrate -d url
> (qemu) [wait a while]
> (qemu) migrate_cancel
>
> Then may get live_migration 100% cpu usage in following stack:
>
> pthread_mutex_lock
> qemu_mutex_lock_impl
> multifd_send_pages
> multifd_queue_page
> ram_save_multifd_page
> ram_save_target_page
> ram_save_host_page
> ram_find_and_save_block
> ram_find_and_save_block
> ram_save_iterate
> qemu_savevm_state_iterate
> migration_iteration_run
> migration_thread
> qemu_thread_start
> start_thread
> clone
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Will sent for rc3.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever Ivan Ren
@ 2019-07-24  8:47   ` Juan Quintela
  2019-07-24  9:18   ` Juan Quintela
  1 sibling, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2019-07-24  8:47 UTC (permalink / raw)
  To: Ivan Ren; +Cc: dgilbert, qemu-devel

Ivan Ren <renyime@gmail.com> wrote:
> When we 'migrate_cancel' a multifd migration, live_migration thread may
> hung forever at some points, because of multifd_send_thread has already
> exit for socket error:
> 1. multifd_send_pages may hung at qemu_sem_wait(&multifd_send_state->
>    channels_ready)
> 2. multifd_send_sync_main my hung at qemu_sem_wait(&multifd_send_state->
>    sem_sync)
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>
> ---
>  migration/ram.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index f8908286c2..e4eb9c441f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1097,7 +1097,11 @@ static void *multifd_send_thread(void *opaque)
>  {
>      MultiFDSendParams *p = opaque;
>      Error *local_err = NULL;
> -    int ret;
> +    int ret = 0;

I agree with the ret change.

> +
> +    uint32_t used = 0;
> +    uint64_t packet_num = 0;
> +    uint32_t flags = 0;

This movement is unneeded.

>      trace_multifd_send_thread_start(p->id);
>      rcu_register_thread();
> @@ -1113,9 +1117,9 @@ static void *multifd_send_thread(void *opaque)
>          qemu_mutex_lock(&p->mutex);
>  
>          if (p->pending_job) {
> -            uint32_t used = p->pages->used;
> -            uint64_t packet_num = p->packet_num;
> -            uint32_t flags = p->flags;
> +            used = p->pages->used;
> +            packet_num = p->packet_num;
> +            flags = p->flags;
>  
>              p->next_packet_size = used * qemu_target_page_size();
>              multifd_send_fill_packet(p);
> @@ -1164,6 +1168,17 @@ out:
>          multifd_send_terminate_threads(local_err);
>      }
>  
> +    /*
> +     * Error happen, I will exit, but I can't just leave, tell
> +     * who pay attention to me.
> +     */
> +    if (ret != 0) {
> +        if (flags & MULTIFD_FLAG_SYNC) {
> +            qemu_sem_post(&multifd_send_state->sem_sync);
> +        }
> +        qemu_sem_post(&multifd_send_state->channels_ready);
> +    }

The real change is just this one.  Good catch, thanks.

Reviewed-by: Juan Quintela <quintela@redhat.com>


Later, Juan.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination hung forever
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination " Ivan Ren
  2019-07-24  7:01   ` Ivan Ren
@ 2019-07-24  9:01   ` Juan Quintela
  2019-07-24 11:30     ` Ivan Ren
  1 sibling, 1 reply; 11+ messages in thread
From: Juan Quintela @ 2019-07-24  9:01 UTC (permalink / raw)
  To: Ivan Ren; +Cc: dgilbert, qemu-devel

Ivan Ren <renyime@gmail.com> wrote:
> When migrate_cancel a multifd migration, if run sequence like this:
>
>         [source]                              [destination]
>
> multifd_send_sync_main[finish]
>                                     multifd_recv_thread wait &p->sem_sync
> shutdown to_dst_file
>                                     detect error from_src_file
> send  RAM_SAVE_FLAG_EOS[fail]       [no chance to run multifd_recv_sync_main]
>                                     multifd_load_cleanup
>                                     join multifd receive thread forever
>
> will lead destination qemu hung at following stack:
>
> pthread_join
> qemu_thread_join
> multifd_load_cleanup
> process_incoming_migration_co
> coroutine_trampoline
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>

I think this one is not enough.  We need to set some error code, or
disable the running bit at that point.

> ---
>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index e4eb9c441f..504c8ccb03 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
>          MultiFDRecvParams *p = &multifd_recv_state->params[i];
>  
>          if (p->running) {
> +            /*
> +             * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
> +             * however try to wakeup it without harm in cleanup phase.
> +             */
> +            qemu_sem_post(&p->sem_sync);
>              qemu_thread_join(&p->thread);
>          }
>          object_unref(OBJECT(p->c));

Let's see where we wait for p->sem_sync:


static void *multifd_recv_thread(void *opaque)
{
    ....
    while (true) {
        uint32_t used;
        uint32_t flags;

        ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
                                       p->packet_len, &local_err);
        .....

        if (flags & MULTIFD_FLAG_SYNC) {
            qemu_sem_post(&multifd_recv_state->sem_sync);
            qemu_sem_wait(&p->sem_sync);
        }
    }
    if (local_err) {
        multifd_recv_terminate_threads(local_err);
    }
    qemu_mutex_lock(&p->mutex);
    p->running = false;
    qemu_mutex_unlock(&p->mutex);

    rcu_unregister_thread();
    trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);

    return NULL;
}

If we just post it there, we get out of the wait (that bit is ok), but
then we go back to the beggining of the bucle, we (probably) got one
error on the qui_channel_read_all_eof(), and we go back to
multifd_recv_terminate_threads(), or wait there.

I think that it is better to *also* set an p->quit variable there, and
not even try to receive anything for that channel?

I will send a patch later.

Good catch.

Later, Juan.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever
  2019-06-25 13:18 ` [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever Ivan Ren
  2019-07-24  8:47   ` Juan Quintela
@ 2019-07-24  9:18   ` Juan Quintela
  1 sibling, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2019-07-24  9:18 UTC (permalink / raw)
  To: Ivan Ren; +Cc: dgilbert, qemu-devel

Ivan Ren <renyime@gmail.com> wrote:
> When we 'migrate_cancel' a multifd migration, live_migration thread may
> hung forever at some points, because of multifd_send_thread has already
> exit for socket error:
> 1. multifd_send_pages may hung at qemu_sem_wait(&multifd_send_state->
>    channels_ready)
> 2. multifd_send_sync_main my hung at qemu_sem_wait(&multifd_send_state->
>    sem_sync)
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>
> ---
> +    if (ret != 0) {
> +        if (flags & MULTIFD_FLAG_SYNC) {
> +            qemu_sem_post(&multifd_send_state->sem_sync);

flags is also needed.

Sorry, Juan.


> +        }
> +        qemu_sem_post(&multifd_send_state->channels_ready);
> +    }
> +
>      qemu_mutex_lock(&p->mutex);
>      p->running = false;
>      qemu_mutex_unlock(&p->mutex);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination hung forever
  2019-07-24  9:01   ` Juan Quintela
@ 2019-07-24 11:30     ` Ivan Ren
  0 siblings, 0 replies; 11+ messages in thread
From: Ivan Ren @ 2019-07-24 11:30 UTC (permalink / raw)
  To: quintela; +Cc: dgilbert, qemu-devel

> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.

Yes, agree.

Thanks for review.

On Wed, Jul 24, 2019 at 5:01 PM Juan Quintela <quintela@redhat.com> wrote:

> Ivan Ren <renyime@gmail.com> wrote:
> > When migrate_cancel a multifd migration, if run sequence like this:
> >
> >         [source]                              [destination]
> >
> > multifd_send_sync_main[finish]
> >                                     multifd_recv_thread wait &p->sem_sync
> > shutdown to_dst_file
> >                                     detect error from_src_file
> > send  RAM_SAVE_FLAG_EOS[fail]       [no chance to run
> multifd_recv_sync_main]
> >                                     multifd_load_cleanup
> >                                     join multifd receive thread forever
> >
> > will lead destination qemu hung at following stack:
> >
> > pthread_join
> > qemu_thread_join
> > multifd_load_cleanup
> > process_incoming_migration_co
> > coroutine_trampoline
> >
> > Signed-off-by: Ivan Ren <ivanren@tencent.com>
>
> I think this one is not enough.  We need to set some error code, or
> disable the running bit at that point.
>
> > ---
> >  migration/ram.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e4eb9c441f..504c8ccb03 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> >          MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >
> >          if (p->running) {
> > +            /*
> > +             * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> > +             * however try to wakeup it without harm in cleanup phase.
> > +             */
> > +            qemu_sem_post(&p->sem_sync);
> >              qemu_thread_join(&p->thread);
> >          }
> >          object_unref(OBJECT(p->c));
>
> Let's see where we wait for p->sem_sync:
>
>
> static void *multifd_recv_thread(void *opaque)
> {
>     ....
>     while (true) {
>         uint32_t used;
>         uint32_t flags;
>
>         ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>                                        p->packet_len, &local_err);
>         .....
>
>         if (flags & MULTIFD_FLAG_SYNC) {
>             qemu_sem_post(&multifd_recv_state->sem_sync);
>             qemu_sem_wait(&p->sem_sync);
>         }
>     }
>     if (local_err) {
>         multifd_recv_terminate_threads(local_err);
>     }
>     qemu_mutex_lock(&p->mutex);
>     p->running = false;
>     qemu_mutex_unlock(&p->mutex);
>
>     rcu_unregister_thread();
>     trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);
>
>     return NULL;
> }
>
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
>
> Good catch.
>
> Later, Juan.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-07-24 11:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-25 13:18 [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren
2019-06-25 13:18 ` [Qemu-devel] [PATCH 1/3] migration: fix migrate_cancel leads live_migration thread endless loop Ivan Ren
2019-07-24  8:43   ` Juan Quintela
2019-06-25 13:18 ` [Qemu-devel] [PATCH 2/3] migration: fix migrate_cancel leads live_migration thread hung forever Ivan Ren
2019-07-24  8:47   ` Juan Quintela
2019-07-24  9:18   ` Juan Quintela
2019-06-25 13:18 ` [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination " Ivan Ren
2019-07-24  7:01   ` Ivan Ren
2019-07-24  9:01   ` Juan Quintela
2019-07-24 11:30     ` Ivan Ren
2019-07-14 14:53 ` [Qemu-devel] [PATCH 0/3] migration: fix migrate_cancel problems of multifd Ivan Ren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).