All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
@ 2018-10-22 11:08 Fei Li
  2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code Fei Li
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Fei Li @ 2018-10-22 11:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, dgilbert, peterx

Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.

But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.

My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
    "'QEMU (...) [stopped]' is not responding.
    You may choose to wait a short while for it to continue or force
    the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.

The source host keeps running as expected, but I guess the hang
phenonmenon in the dest is not right.
Would someone kindly give some suggestions on this? Thanks a lot.


Fei Li (2):
  migration: fix the multifd code
  migration: fix some error handling

 migration/migration.c    |  5 +----
 migration/postcopy-ram.c |  3 +++
 migration/ram.c          | 33 +++++++++++++++++++++++----------
 migration/ram.h          |  2 +-
 4 files changed, 28 insertions(+), 15 deletions(-)

-- 
2.13.7

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code
  2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
@ 2018-10-22 11:08 ` Fei Li
  2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 2/2] migration: fix some error handling Fei Li
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Fei Li @ 2018-10-22 11:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, dgilbert, peterx

When multifd is used during migration, if there is an error before
the destination receives all new channels, the destination does not
exit but keeps waiting in our current code. However, a segmentaion
fault will occur in the source when multifd_save_cleanup() is called
again as the multifd_send_state has been freed earlier in the first
error handling.  This can happen when migrate_fd_connect() fails and
multifd_fd_cleanup() is called, and then multifd_new_send_channel_
async() fails and multifd_save_cleanup() is called again.

If the QIOChannel *c of multifd_recv_state->params[i] (p->c) is not
initialized, there is no need to close the channel. Or else a
segmentation fault will occur in multifd_recv_terminate_threads()
when multifd_recv_initial_packet() fails.

Signed-off-by: Fei Li <fli@suse.com>
---
 migration/ram.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7e7deec4d8..2dd873ba35 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -907,6 +907,11 @@ static void multifd_send_terminate_threads(Error *err)
         }
     }
 
+    /* in case multifd_send_state has been freed earlier */
+    if (!multifd_send_state) {
+        return;
+    }
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -922,7 +927,7 @@ int multifd_save_cleanup(Error **errp)
     int i;
     int ret = 0;
 
-    if (!migrate_use_multifd()) {
+    if (!migrate_use_multifd() || !multifd_send_state) {
         return 0;
     }
     multifd_send_terminate_threads(NULL);
@@ -1070,6 +1075,10 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
     QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
     Error *local_err = NULL;
 
+    if (!multifd_send_state) {
+        return;
+    }
+
     if (qio_task_propagate_error(task, &local_err)) {
         if (multifd_save_cleanup(&local_err) != 0) {
             migrate_set_error(migrate_get_current(), local_err);
@@ -1131,7 +1140,7 @@ struct {
     uint64_t packet_num;
 } *multifd_recv_state;
 
-static void multifd_recv_terminate_threads(Error *err)
+static void multifd_recv_terminate_threads(Error *err, bool channel)
 {
     int i;
 
@@ -1145,6 +1154,11 @@ static void multifd_recv_terminate_threads(Error *err)
         }
     }
 
+    /* in case p->c is not initialized */
+    if (!channel) {
+        return;
+    }
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1166,7 +1180,7 @@ int multifd_load_cleanup(Error **errp)
     if (!migrate_use_multifd()) {
         return 0;
     }
-    multifd_recv_terminate_threads(NULL);
+    multifd_recv_terminate_threads(NULL, true);
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1269,7 +1283,7 @@ static void *multifd_recv_thread(void *opaque)
     }
 
     if (local_err) {
-        multifd_recv_terminate_threads(local_err);
+        multifd_recv_terminate_threads(local_err, true);
     }
     qemu_mutex_lock(&p->mutex);
     p->running = false;
@@ -1331,7 +1345,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
 
     id = multifd_recv_initial_packet(ioc, &local_err);
     if (id < 0) {
-        multifd_recv_terminate_threads(local_err);
+        multifd_recv_terminate_threads(local_err, false);
         return false;
     }
 
@@ -1339,7 +1353,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
     if (p->c != NULL) {
         error_setg(&local_err, "multifd: received id '%d' already setup'",
                    id);
-        multifd_recv_terminate_threads(local_err);
+        multifd_recv_terminate_threads(local_err, true);
         return false;
     }
     p->c = ioc;
-- 
2.13.7

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH RFC 2/2] migration: fix some error handling
  2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
  2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code Fei Li
@ 2018-10-22 11:08 ` Fei Li
  2018-10-24 21:27 ` [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Peter Xu
  2018-10-25 12:55 ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 12+ messages in thread
From: Fei Li @ 2018-10-22 11:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, dgilbert, peterx

Add error handling for qemu_ram_foreach_migratable_block() when
it fails.

Always call migrate_set_error() to set the error state without relying
on whether multifd_save_cleanup() succeeds. As the passed &local_err
is never used in multifd_save_cleanup(), remove it.

Signed-off-by: Fei Li <fli@suse.com>
---
 migration/migration.c    | 5 +----
 migration/postcopy-ram.c | 3 +++
 migration/ram.c          | 7 +++----
 migration/ram.h          | 2 +-
 4 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 7696729340..ca07c243be 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1372,7 +1372,6 @@ static void migrate_fd_cleanup(void *opaque)
     qemu_savevm_state_cleanup();
 
     if (s->to_dst_file) {
-        Error *local_err = NULL;
         QEMUFile *tmp;
 
         trace_migrate_fd_cleanup();
@@ -1383,9 +1382,7 @@ static void migrate_fd_cleanup(void *opaque)
         }
         qemu_mutex_lock_iothread();
 
-        if (multifd_save_cleanup(&local_err) != 0) {
-            error_report_err(local_err);
-        }
+        multifd_save_cleanup();
         qemu_mutex_lock(&s->qemu_file_lock);
         tmp = s->to_dst_file;
         s->to_dst_file = NULL;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e5c02a32c5..4ca2bc02b3 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1117,6 +1117,9 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
 
     /* Mark so that we get notified of accesses to unwritten areas */
     if (qemu_ram_foreach_migratable_block(ram_block_enable_notify, mis)) {
+        error_report("ram_block_enable_notify failed");
+        close(mis->userfault_event_fd);
+        close(mis->userfault_fd);
         return -1;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 2dd873ba35..e7c8ccb38a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -922,7 +922,7 @@ static void multifd_send_terminate_threads(Error *err)
     }
 }
 
-int multifd_save_cleanup(Error **errp)
+int multifd_save_cleanup(void)
 {
     int i;
     int ret = 0;
@@ -1080,9 +1080,8 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
     }
 
     if (qio_task_propagate_error(task, &local_err)) {
-        if (multifd_save_cleanup(&local_err) != 0) {
-            migrate_set_error(migrate_get_current(), local_err);
-        }
+        multifd_save_cleanup();
+        migrate_set_error(migrate_get_current(), local_err);
     } else {
         p->c = QIO_CHANNEL(sioc);
         qio_channel_set_delay(p->c, false);
diff --git a/migration/ram.h b/migration/ram.h
index 83ff1bc11a..c4fafea368 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -43,7 +43,7 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 
 int multifd_save_setup(void);
-int multifd_save_cleanup(Error **errp);
+int multifd_save_cleanup(void);
 int multifd_load_setup(void);
 int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
-- 
2.13.7

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
  2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code Fei Li
  2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 2/2] migration: fix some error handling Fei Li
@ 2018-10-24 21:27 ` Peter Xu
  2018-10-25  9:04   ` Fei Li
  2018-10-25 12:55 ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2018-10-24 21:27 UTC (permalink / raw)
  To: Fei Li; +Cc: qemu-devel, quintela, dgilbert

On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
>     "'QEMU (...) [stopped]' is not responding.
>     You may choose to wait a short while for it to continue or force
>     the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.
> 
> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.

Note that it's during KVM forum so the response from anyone might be
slow (it ends this week).

I think the thing you described seems normal since we can't guarantee
the network is always stable, normally I'll expect that the migration
will fail but it won't matter much since after all it's a precopy so
we lose nothing.  So I'm curious about when the error you mentioned
happens (e.g., total channel number is N, you only got M channels
connected, with M < N) could you just simply kill the destination?
Then AFAIU the source can just continue to run, right?

> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c    |  5 +----
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c          | 33 +++++++++++++++++++++++----------
>  migration/ram.h          |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-24 21:27 ` [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Peter Xu
@ 2018-10-25  9:04   ` Fei Li
  2018-10-25 12:58     ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Fei Li @ 2018-10-25  9:04 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, quintela, dgilbert



On 10/25/2018 05:27 AM, Peter Xu wrote:
> On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:
>> Hi,
>> these two patches are to fix live migration issues. The first is
>> about multifd, and the second is to fix some error handling.
>>
>> But I have a question about using multifd migration.
>> In our current code, when multifd is used during migration, if there
>> is an error before the destination receives all new channels (I mean
>> multifd_recv_new_channel(ioc)), the destination does not exit but
>> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
>> the source exits.
>>
>> My question is about the state of the destination host if fails during
>> this period. I did a test, after applying [1/2] patch, if
>> multifd_new_send_channel_async() fails, the destination host hangs for
>> a while then later pops up a window saying
>>      "'QEMU (...) [stopped]' is not responding.
>>      You may choose to wait a short while for it to continue or force
>>      the application to quit entirely."
>> But after closing the window by clicking, the qemu on the dest still
>> hangs there until I exclusively kill the qemu on the source.
>>
>> The source host keeps running as expected, but I guess the hang
>> phenonmenon in the dest is not right.
>> Would someone kindly give some suggestions on this? Thanks a lot.
> Note that it's during KVM forum so the response from anyone might be
> slow (it ends this week).
Thanks for the kindly reminder. :)
> I think the thing you described seems normal since we can't guarantee
> the network is always stable, normally I'll expect that the migration
> will fail but it won't matter much since after all it's a precopy so
> we lose nothing.  So I'm curious about when the error you mentioned
> happens (e.g., total channel number is N, you only got M channels
> connected, with M < N) could you just simply kill the destination?
> Then AFAIU the source can just continue to run, right?
Yes, for the M < N situation, IMO the destination can be simply killed by
adding exit(EXIT_FAILURE) when it failed to receive packet via some
channel. The code is as below which has been tested, and result is the
source continues to run and the destination exits.
I'd like to write a separate patch if the below code/idea is acceptable
to fix the hang issue.

@@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
  /* Return true if multifd is ready for the migration, otherwise false */
  bool multifd_recv_new_channel(QIOChannel *ioc)
  {
+    MigrationIncomingState *mis = migration_incoming_get_current();
      MultiFDRecvParams *p;
      Error *local_err = NULL;
      int id;

      id = multifd_recv_initial_packet(ioc, &local_err);
      if (id < 0) {
-        multifd_recv_terminate_threads(local_err);
-        return false;
+        error_reportf_err(local_err,
+                          "failed to receive packet via multifd channel 
%x: ",
+                          multifd_recv_state->count);
+        goto fail;
      }

      p = &multifd_recv_state->params[id];
      if (p->c != NULL) {
          error_setg(&local_err, "multifd: received id '%d' already setup'",
                     id);
-        multifd_recv_terminate_threads(local_err);
-        return false;
+        goto fail;
      }
      p->c = ioc;
      object_ref(OBJECT(ioc));
@@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
                         QEMU_THREAD_JOINABLE);
      atomic_inc(&multifd_recv_state->count);
      return multifd_recv_state->count == migrate_multifd_channels();
+fail:
+    multifd_recv_terminate_threads(local_err);
+    qemu_fclose(mis->from_src_file);
+    mis->from_src_file = NULL;
+    exit(EXIT_FAILURE);
  }

Have a nice day, thanks a lot
Fei
>>
>> Fei Li (2):
>>    migration: fix the multifd code
>>    migration: fix some error handling
>>
>>   migration/migration.c    |  5 +----
>>   migration/postcopy-ram.c |  3 +++
>>   migration/ram.c          | 33 +++++++++++++++++++++++----------
>>   migration/ram.h          |  2 +-
>>   4 files changed, 28 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.13.7
>>
> Regards,
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
                   ` (2 preceding siblings ...)
  2018-10-24 21:27 ` [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Peter Xu
@ 2018-10-25 12:55 ` Dr. David Alan Gilbert
  2018-10-26 12:59   ` Fei Li
  3 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-25 12:55 UTC (permalink / raw)
  To: Fei Li; +Cc: qemu-devel, quintela, peterx

* Fei Li (fli@suse.com) wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
>     "'QEMU (...) [stopped]' is not responding.
>     You may choose to wait a short while for it to continue or force
>     the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.

That sounds like the main thread is blocked for some reason? But I don't
normally use the window setup;  if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?
If it doesn't then try and get a backtrace.

The monitor really shouldn't block, so it would be interesting to see.

Dave

> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.
> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c    |  5 +----
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c          | 33 +++++++++++++++++++++++----------
>  migration/ram.h          |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-25  9:04   ` Fei Li
@ 2018-10-25 12:58     ` Peter Xu
  2018-10-26 13:10       ` Fei Li
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2018-10-25 12:58 UTC (permalink / raw)
  To: Fei Li; +Cc: qemu-devel, quintela, dgilbert

On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:

[...]

> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>  /* Return true if multifd is ready for the migration, otherwise false */
>  bool multifd_recv_new_channel(QIOChannel *ioc)
>  {
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>      MultiFDRecvParams *p;
>      Error *local_err = NULL;
>      int id;
> 
>      id = multifd_recv_initial_packet(ioc, &local_err);
>      if (id < 0) {
> -        multifd_recv_terminate_threads(local_err);
> -        return false;
> +        error_reportf_err(local_err,
> +                          "failed to receive packet via multifd channel %x:
> ",
> +                          multifd_recv_state->count);
> +        goto fail;
>      }
> 
>      p = &multifd_recv_state->params[id];
>      if (p->c != NULL) {
>          error_setg(&local_err, "multifd: received id '%d' already setup'",
>                     id);
> -        multifd_recv_terminate_threads(local_err);
> -        return false;
> +        goto fail;
>      }
>      p->c = ioc;
>      object_ref(OBJECT(ioc));
> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>                         QEMU_THREAD_JOINABLE);
>      atomic_inc(&multifd_recv_state->count);
>      return multifd_recv_state->count == migrate_multifd_channels();
> +fail:
> +    multifd_recv_terminate_threads(local_err);
> +    qemu_fclose(mis->from_src_file);
> +    mis->from_src_file = NULL;
> +    exit(EXIT_FAILURE);
>  }

Yeah I think it makes sense to at least report some details when error
happens, but I'm not sure whether it's good to explicitly exit() here.
IMHO you can add an Error** in multifd_recv_new_channel() parameter
list to do that, and even through migration_ioc_process_incoming().
What do you think?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-25 12:55 ` Dr. David Alan Gilbert
@ 2018-10-26 12:59   ` Fei Li
  0 siblings, 0 replies; 12+ messages in thread
From: Fei Li @ 2018-10-26 12:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, quintela, peterx



On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:
> * Fei Li (fli@suse.com) wrote:
>> Hi,
>> these two patches are to fix live migration issues. The first is
>> about multifd, and the second is to fix some error handling.
>>
>> But I have a question about using multifd migration.
>> In our current code, when multifd is used during migration, if there
>> is an error before the destination receives all new channels (I mean
>> multifd_recv_new_channel(ioc)), the destination does not exit but
>> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
>> the source exits.
>>
>> My question is about the state of the destination host if fails during
>> this period. I did a test, after applying [1/2] patch, if
>> multifd_new_send_channel_async() fails, the destination host hangs for
>> a while then later pops up a window saying
>>      "'QEMU (...) [stopped]' is not responding.
>>      You may choose to wait a short while for it to continue or force
>>      the application to quit entirely."
>> But after closing the window by clicking, the qemu on the dest still
>> hangs there until I exclusively kill the qemu on the source.
> That sounds like the main thread is blocked for some reason?
Yes, the main thread on  the dst is keeps looping.
> But I don't
> normally use the window setup;  if you try with -nographic and can see
> the HMP (or a QMP) monitor, can you see if the monitor still responds?

Thanks for the `-nographic` reminder, I harvested an interesting 
phenonmenon:
If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.


The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel

(qemu)
(qemu)

To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off block: off return-path: off pause-before-switchover: 
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off 
late-block-activate: off
Migration status: setup /* I added some codes to set the status to 
"failed", but still not working, details see below */
total time: 0 milliseconds

I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
  if (s->vm_was_running) {   vm_start();   }   )
But the hang situation is still there.
> If it doesn't then try and get a backtrace.
>
> The monitor really shouldn't block, so it would be interesting to see.
>
> Dave
I set two breakpoints and get the following backtrace, hope they can 
help. :)

Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
     ioc=0x555557995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368    {
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
     ioc=0x555557995af0, iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0,
     errp=0x7fffffffdb38) at io/channel-socket.c:463
463    {
(gdb) n
464        QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
......
483     retry:
(gdb)
484        ret = recvmsg(sioc->fd, &msg, sflags);
(gdb) bt
#0  qio_channel_socket_readv (ioc=0x555557995af0, iov=0x5555568777d0, 
niov=1,
     fds=0x0, nfds=0x0, errp=0x7fffffffdb38) at io/channel-socket.c:484
#1  0x0000555555d156c5 in qio_channel_readv_full (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffffffdb38)
     at io/channel.c:65
#2  0x0000555555d15b26 in qio_channel_readv (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, errp=0x7fffffffdb38) at io/channel.c:197
#3  0x0000555555d15853 in qio_channel_readv_all_eof (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:106
#4  0x0000555555d1595c in qio_channel_readv_all (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:142
#5  0x0000555555d15d0c in qio_channel_read_all (ioc=0x555557995af0,
     buf=0x7fffffffdad0 "\340\"zVUU", buflen=25, errp=0x7fffffffdb38)
     at io/channel.c:246
#6  0x000055555587695c in multifd_recv_initial_packet (c=0x555557995af0,
     errp=0x7fffffffdb38) at /build/gitcode/qemu-build/migration/ram.c:653
#7  0x00005555558788fb in multifd_recv_new_channel (ioc=0x555557995af0)
     at /build/gitcode/qemu-build/migration/ram.c:1374
#8  0x0000555555bc9978 in migration_ioc_process_incoming 
(ioc=0x555557995af0)
     at migration/migration.c:573
#9  0x0000555555bd0c69 in migration_channel_process_incoming 
(ioc=0x555557995af0)
     at migration/channel.c:47
#10 0x0000555555bcf7e8 in socket_accept_incoming_migration (
     listener=0x5555578dcae0, cioc=0x555557995af0, opaque=0x0)
     at migration/socket.c:166
#11 0x0000555555d2051f in qio_net_listener_channel_func 
(ioc=0x5555579c7180,
     condition=G_IO_IN, opaque=0x5555578dcae0) at io/net-listener.c:53
#12 0x0000555555d1c0a2 in qio_channel_fd_source_dispatch 
(source=0x5555568d5970,
---Type <return> to continue, or q <return> to quit---
     callback=0x555555d20473 <qio_net_listener_channel_func>,
     user_data=0x5555578dcae0) at io/channel-watch.c:84
#13 0x00007ffff6353dc5 in g_main_context_dispatch ()
    from /usr/lib64/libglib-2.0.so.0
#14 0x0000555555d7d1ad in glib_pollfds_poll () at util/main-loop.c:215
#15 0x0000555555d7d227 in os_host_main_loop_wait (timeout=0) at 
util/main-loop.c:238
#16 0x0000555555d7d2e0 in main_loop_wait (nonblocking=0) at 
util/main-loop.c:497
#17 0x00005555559cd679 in main_loop () at vl.c:1884
#18 0x00005555559d4f1e in main (argc=32, argv=0x7fffffffe0b8, 
envp=0x7fffffffe1c0)
     at vl.c:4618
(gdb) n

Thread 1 "qemu-system-x86" received signal SIGINT, Interrupt.
0x00007ffff5606f64 in recvmsg () from /lib64/libpthread.so.0
(gdb) c
Continuing.

After I input above `n`, the dst just hangs here, seems waiting for the 
result of
recvmsg(sioc->fd, &msg, sflags); Later even I use ctrl+c to kill it, the 
dst still hangs.

Have a nice day, thanks
Fei
>
>> The source host keeps running as expected, but I guess the hang
>> phenonmenon in the dest is not right.
>> Would someone kindly give some suggestions on this? Thanks a lot.
>>
>>
>> Fei Li (2):
>>    migration: fix the multifd code
>>    migration: fix some error handling
>>
>>   migration/migration.c    |  5 +----
>>   migration/postcopy-ram.c |  3 +++
>>   migration/ram.c          | 33 +++++++++++++++++++++++----------
>>   migration/ram.h          |  2 +-
>>   4 files changed, 28 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.13.7
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-25 12:58     ` Peter Xu
@ 2018-10-26 13:10       ` Fei Li
  2018-10-26 13:35         ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Fei Li @ 2018-10-26 13:10 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, dgilbert, quintela



On 10/25/2018 08:58 PM, Peter Xu wrote:
> On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
>
> [...]
>
>> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>>   /* Return true if multifd is ready for the migration, otherwise false */
>>   bool multifd_recv_new_channel(QIOChannel *ioc)
>>   {
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>       MultiFDRecvParams *p;
>>       Error *local_err = NULL;
>>       int id;
>>
>>       id = multifd_recv_initial_packet(ioc, &local_err);
>>       if (id < 0) {
>> -        multifd_recv_terminate_threads(local_err);
>> -        return false;
>> +        error_reportf_err(local_err,
>> +                          "failed to receive packet via multifd channel %x:
>> ",
>> +                          multifd_recv_state->count);
>> +        goto fail;
>>       }
>>
>>       p = &multifd_recv_state->params[id];
>>       if (p->c != NULL) {
>>           error_setg(&local_err, "multifd: received id '%d' already setup'",
>>                      id);
>> -        multifd_recv_terminate_threads(local_err);
>> -        return false;
>> +        goto fail;
>>       }
>>       p->c = ioc;
>>       object_ref(OBJECT(ioc));
>> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>>                          QEMU_THREAD_JOINABLE);
>>       atomic_inc(&multifd_recv_state->count);
>>       return multifd_recv_state->count == migrate_multifd_channels();
>> +fail:
>> +    multifd_recv_terminate_threads(local_err);
>> +    qemu_fclose(mis->from_src_file);
>> +    mis->from_src_file = NULL;
>> +    exit(EXIT_FAILURE);
>>   }
> Yeah I think it makes sense to at least report some details when error
> happens, but I'm not sure whether it's good to explicitly exit() here.
> IMHO you can add an Error** in multifd_recv_new_channel() parameter
> list to do that, and even through migration_ioc_process_incoming().
> What do you think?
>
> Regards,
>
You mean exit() in migration_ioc_process_incoming(), or further
caller migration_channel_process_incoming()? Actually either is
ok for me. :) But today I find if using postcopy and multifd together
to do live migration, it seems the hang still occurs even with the
above codes, so sad about that. I will keep debugging and see
how to fix this.

Have a nice day, thanks
Fei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-26 13:10       ` Fei Li
@ 2018-10-26 13:35         ` Peter Xu
  2018-10-26 15:24           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2018-10-26 13:35 UTC (permalink / raw)
  To: Fei Li; +Cc: qemu-devel, dgilbert, quintela

On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> 
> 
> On 10/25/2018 08:58 PM, Peter Xu wrote:
> > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > 
> > [...]
> > 
> > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > >   /* Return true if multifd is ready for the migration, otherwise false */
> > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > >   {
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > >       MultiFDRecvParams *p;
> > >       Error *local_err = NULL;
> > >       int id;
> > > 
> > >       id = multifd_recv_initial_packet(ioc, &local_err);
> > >       if (id < 0) {
> > > -        multifd_recv_terminate_threads(local_err);
> > > -        return false;
> > > +        error_reportf_err(local_err,
> > > +                          "failed to receive packet via multifd channel %x:
> > > ",
> > > +                          multifd_recv_state->count);
> > > +        goto fail;
> > >       }
> > > 
> > >       p = &multifd_recv_state->params[id];
> > >       if (p->c != NULL) {
> > >           error_setg(&local_err, "multifd: received id '%d' already setup'",
> > >                      id);
> > > -        multifd_recv_terminate_threads(local_err);
> > > -        return false;
> > > +        goto fail;
> > >       }
> > >       p->c = ioc;
> > >       object_ref(OBJECT(ioc));
> > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > >                          QEMU_THREAD_JOINABLE);
> > >       atomic_inc(&multifd_recv_state->count);
> > >       return multifd_recv_state->count == migrate_multifd_channels();
> > > +fail:
> > > +    multifd_recv_terminate_threads(local_err);
> > > +    qemu_fclose(mis->from_src_file);
> > > +    mis->from_src_file = NULL;
> > > +    exit(EXIT_FAILURE);
> > >   }
> > Yeah I think it makes sense to at least report some details when error
> > happens, but I'm not sure whether it's good to explicitly exit() here.
> > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > list to do that, and even through migration_ioc_process_incoming().
> > What do you think?
> > 
> > Regards,
> > 
> You mean exit() in migration_ioc_process_incoming(), or further
> caller migration_channel_process_incoming()? Actually either is
> ok for me. :) But today I find if using postcopy and multifd together
> to do live migration, it seems the hang still occurs even with the
> above codes, so sad about that. I will keep debugging and see
> how to fix this.

Maybe you can move the error_report_err() in
migration_channel_process_incoming() out of the TLS path so we can
report the error if either TLS or non-TLS case got something wrong.

And I don't even know whether multifd could work with postcopy...

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-26 13:35         ` Peter Xu
@ 2018-10-26 15:24           ` Dr. David Alan Gilbert
  2018-10-29  7:15             ` Fei Li
  0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-26 15:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fei Li, qemu-devel, quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> > 
> > 
> > On 10/25/2018 08:58 PM, Peter Xu wrote:
> > > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > > 
> > > [...]
> > > 
> > > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > > >   /* Return true if multifd is ready for the migration, otherwise false */
> > > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >   {
> > > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > >       MultiFDRecvParams *p;
> > > >       Error *local_err = NULL;
> > > >       int id;
> > > > 
> > > >       id = multifd_recv_initial_packet(ioc, &local_err);
> > > >       if (id < 0) {
> > > > -        multifd_recv_terminate_threads(local_err);
> > > > -        return false;
> > > > +        error_reportf_err(local_err,
> > > > +                          "failed to receive packet via multifd channel %x:
> > > > ",
> > > > +                          multifd_recv_state->count);
> > > > +        goto fail;
> > > >       }
> > > > 
> > > >       p = &multifd_recv_state->params[id];
> > > >       if (p->c != NULL) {
> > > >           error_setg(&local_err, "multifd: received id '%d' already setup'",
> > > >                      id);
> > > > -        multifd_recv_terminate_threads(local_err);
> > > > -        return false;
> > > > +        goto fail;
> > > >       }
> > > >       p->c = ioc;
> > > >       object_ref(OBJECT(ioc));
> > > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >                          QEMU_THREAD_JOINABLE);
> > > >       atomic_inc(&multifd_recv_state->count);
> > > >       return multifd_recv_state->count == migrate_multifd_channels();
> > > > +fail:
> > > > +    multifd_recv_terminate_threads(local_err);
> > > > +    qemu_fclose(mis->from_src_file);
> > > > +    mis->from_src_file = NULL;
> > > > +    exit(EXIT_FAILURE);
> > > >   }
> > > Yeah I think it makes sense to at least report some details when error
> > > happens, but I'm not sure whether it's good to explicitly exit() here.
> > > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > > list to do that, and even through migration_ioc_process_incoming().
> > > What do you think?
> > > 
> > > Regards,
> > > 
> > You mean exit() in migration_ioc_process_incoming(), or further
> > caller migration_channel_process_incoming()? Actually either is
> > ok for me. :) But today I find if using postcopy and multifd together
> > to do live migration, it seems the hang still occurs even with the
> > above codes, so sad about that. I will keep debugging and see
> > how to fix this.
> 
> Maybe you can move the error_report_err() in
> migration_channel_process_incoming() out of the TLS path so we can
> report the error if either TLS or non-TLS case got something wrong.
> 
> And I don't even know whether multifd could work with postcopy...

Nope, it's not expected to work yet.

Dave

> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
  2018-10-26 15:24           ` Dr. David Alan Gilbert
@ 2018-10-29  7:15             ` Fei Li
  0 siblings, 0 replies; 12+ messages in thread
From: Fei Li @ 2018-10-29  7:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu; +Cc: qemu-devel, quintela



On 10/26/2018 11:24 PM, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
>>>
>>> On 10/25/2018 08:58 PM, Peter Xu wrote:
>>>> On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>>>>>    /* Return true if multifd is ready for the migration, otherwise false */
>>>>>    bool multifd_recv_new_channel(QIOChannel *ioc)
>>>>>    {
>>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>>>        MultiFDRecvParams *p;
>>>>>        Error *local_err = NULL;
>>>>>        int id;
>>>>>
>>>>>        id = multifd_recv_initial_packet(ioc, &local_err);
>>>>>        if (id < 0) {
>>>>> -        multifd_recv_terminate_threads(local_err);
>>>>> -        return false;
>>>>> +        error_reportf_err(local_err,
>>>>> +                          "failed to receive packet via multifd channel %x:
>>>>> ",
>>>>> +                          multifd_recv_state->count);
>>>>> +        goto fail;
>>>>>        }
>>>>>
>>>>>        p = &multifd_recv_state->params[id];
>>>>>        if (p->c != NULL) {
>>>>>            error_setg(&local_err, "multifd: received id '%d' already setup'",
>>>>>                       id);
>>>>> -        multifd_recv_terminate_threads(local_err);
>>>>> -        return false;
>>>>> +        goto fail;
>>>>>        }
>>>>>        p->c = ioc;
>>>>>        object_ref(OBJECT(ioc));
>>>>> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>>>>>                           QEMU_THREAD_JOINABLE);
>>>>>        atomic_inc(&multifd_recv_state->count);
>>>>>        return multifd_recv_state->count == migrate_multifd_channels();
>>>>> +fail:
>>>>> +    multifd_recv_terminate_threads(local_err);
>>>>> +    qemu_fclose(mis->from_src_file);
>>>>> +    mis->from_src_file = NULL;
>>>>> +    exit(EXIT_FAILURE);
>>>>>    }
>>>> Yeah I think it makes sense to at least report some details when error
>>>> happens, but I'm not sure whether it's good to explicitly exit() here.
>>>> IMHO you can add an Error** in multifd_recv_new_channel() parameter
>>>> list to do that, and even through migration_ioc_process_incoming().
>>>> What do you think?
>>>>
>>>> Regards,
>>>>
>>> You mean exit() in migration_ioc_process_incoming(), or further
>>> caller migration_channel_process_incoming()? Actually either is
>>> ok for me. :) But today I find if using postcopy and multifd together
>>> to do live migration, it seems the hang still occurs even with the
>>> above codes, so sad about that. I will keep debugging and see
>>> how to fix this.
>> Maybe you can move the error_report_err() in
>> migration_channel_process_incoming() out of the TLS path so we can
>> report the error if either TLS or non-TLS case got something wrong.
Thanks for the advice. I will do the update in the next version. :)
>>
>> And I don't even know whether multifd could work with postcopy...
> Nope, it's not expected to work yet.
>
> Dave
Thanks for the helpful information. :)

BTW, in the next version, I'd like to merge these three migration 
patches into
the "[PATCH RFC v6 ] qemu_thread_create: propagate the error to callers 
to handle",
and cc you inside the patches. Please help to review.

Have a nice day, thanks again
Fei
>
>> Regards,
>>
>> -- 
>> Peter Xu
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-10-29  7:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code Fei Li
2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 2/2] migration: fix some error handling Fei Li
2018-10-24 21:27 ` [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Peter Xu
2018-10-25  9:04   ` Fei Li
2018-10-25 12:58     ` Peter Xu
2018-10-26 13:10       ` Fei Li
2018-10-26 13:35         ` Peter Xu
2018-10-26 15:24           ` Dr. David Alan Gilbert
2018-10-29  7:15             ` Fei Li
2018-10-25 12:55 ` Dr. David Alan Gilbert
2018-10-26 12:59   ` Fei Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.