[PATCH 0/3] migration: Three more fixes for postcopy recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] migration: Three more fixes for postcopy recovery
@ 2021-07-08 19:06 Peter Xu
  2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Peter Xu @ 2021-07-08 19:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, Dr . David Alan Gilbert,
	peterx, Leonardo Bras Soares Passos

A few more issues spotted by either Xiaohui or me.  Please see and review
individual patches for what they do.  Thanks,

Peter Xu (3):
  migration: Release return path early for paused postcopy
  migration: Don't do migrate cleanup if during postcopy resume
  migration: Clear error at entry of migrate_fd_connect()

 migration/migration.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] migration: Release return path early for paused postcopy
  2021-07-08 19:06 [PATCH 0/3] migration: Three more fixes for postcopy recovery Peter Xu
@ 2021-07-08 19:06 ` Peter Xu
  2021-07-08 19:13   ` Peter Xu
  2021-07-12 17:44   ` Dr. David Alan Gilbert
  2021-07-08 19:06 ` [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Peter Xu @ 2021-07-08 19:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, Dr . David Alan Gilbert,
	peterx, Li Xiaohui, Leonardo Bras Soares Passos

When postcopy pause triggered, we rely on the migration thread to cleanup the
to_dst_file handle, and the return path thread to cleanup the from_dst_file
handle (which is stored in the local variable "rp").

Within the process, from_dst_file cleanup (qemu_fclose) is postponed until it's
setup again due to a postcopy recovery.

It used to work before yank was born; after yank is introduced we rely on the
refcount of IOC to correctly unregister yank function in channel_close().  If
without the early and on-time release of from_dst_file handle the yank function
will be leftover during paused postcopy.

Without this patch, below steps (quoted from Xiaohui) could trigger qemu src
crash:

  1.Boot vm on src host
  2.Boot vm on dst host
  3.Enable postcopy on src&dst host
  4.Load stressapptest in vm and set postcopy speed to 50M
  5.Start migration from src to dst host, change into postcopy mode when migration is active.
  6.When postcopy is active, down the network card(do migration via this network) on dst host.
  7.Wait untill postcopy is paused on src&dst host.
  8.Before up network card, recover migration on dst host, will get error like following.
  9.Ignore the error of step 8, go on recovering migration on src host:

  After step 9, qemu on src host will core dump after some seconds:
  qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
  1.sh: line 38: 44662 Aborted                 (core dumped)

Reported-by: Li Xiaohui <xiaohuixiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5ff7ba9d5c..8786104c9a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2818,12 +2818,12 @@ out:
              * Maybe there is something we can do: it looks like a
              * network down issue, and we pause for a recovery.
              */
+            qemu_fclose(rp);
+            ms->rp_state.from_dst_file = NULL;
+            rp = NULL;
             if (postcopy_pause_return_path_thread(ms)) {
                 /* Reload rp, reset the rest */
-                if (rp != ms->rp_state.from_dst_file) {
-                    qemu_fclose(rp);
-                    rp = ms->rp_state.from_dst_file;
-                }
+                rp = ms->rp_state.from_dst_file;
                 ms->rp_state.error = false;
                 goto retry;
             }
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume
  2021-07-08 19:06 [PATCH 0/3] migration: Three more fixes for postcopy recovery Peter Xu
  2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
@ 2021-07-08 19:06 ` Peter Xu
  2021-07-12 18:33   ` Dr. David Alan Gilbert
  2021-07-08 19:06 ` [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect() Peter Xu
  2021-07-13 11:04 ` [PATCH 0/3] migration: Three more fixes for postcopy recovery Dr. David Alan Gilbert
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2021-07-08 19:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, Dr . David Alan Gilbert,
	peterx, Leonardo Bras Soares Passos

Below process could crash qemu with postcopy recovery:

  1. (hmp) migrate -d ..
  2. (hmp) migrate_start_postcopy
  3. [network down, postcopy paused]
  4. (hmp) migrate -r $WRONG_PORT
     when try the recover on an invalid $WRONG_PORT, cleanup_bh will be cleared
  5. (hmp) migrate -r $RIGHT_PORT
     [qemu crash on assert(cleanup_bh)]

The thing is we shouldn't cleanup if it's postcopy resume; the error is set
mostly because the channel is wrong, so we return directly waiting for the user
to retry.

migrate_fd_cleanup() should only be called when migration is cancelled or
completed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8786104c9a..bb1edf862a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3975,7 +3975,18 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
     }
     if (error_in) {
         migrate_fd_error(s, error_in);
-        migrate_fd_cleanup(s);
+        if (resume) {
+            /*
+             * Don't do cleanup for resume if channel is invalid, but only dump
+             * the error.  We wait for another channel connect from the user.
+             * The error_report still gives HMP user a hint on what failed.
+             * It's normally done in migrate_fd_cleanup(), but call it here
+             * explicitly.
+             */
+            error_report_err(error_copy(s->error));
+        } else {
+            migrate_fd_cleanup(s);
+        }
         return;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect()
  2021-07-08 19:06 [PATCH 0/3] migration: Three more fixes for postcopy recovery Peter Xu
  2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
  2021-07-08 19:06 ` [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume Peter Xu
@ 2021-07-08 19:06 ` Peter Xu
  2021-07-12 18:40   ` Dr. David Alan Gilbert
  2021-07-13 11:04 ` [PATCH 0/3] migration: Three more fixes for postcopy recovery Dr. David Alan Gilbert
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2021-07-08 19:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, Dr . David Alan Gilbert,
	peterx, Leonardo Bras Soares Passos

For each "migrate" command, remember to clear the s->error before going on.
For one reason, when there's a new error it'll be still remembered; see
migrate_set_error() who only sets the error if error==NULL.  Meanwhile if a
failed migration completes (e.g., postcopy recovered and finished), we
shouldn't dump an error when calling migrate_fd_cleanup() at last.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index bb1edf862a..338df01e97 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1855,6 +1855,15 @@ void migrate_set_error(MigrationState *s, const Error *error)
     }
 }
 
+static void migrate_error_free(MigrationState *s)
+{
+    QEMU_LOCK_GUARD(&s->error_mutex);
+    if (s->error) {
+        error_free(s->error);
+        s->error = NULL;
+    }
+}
+
 void migrate_fd_error(MigrationState *s, const Error *error)
 {
     trace_migrate_fd_error(error_get_pretty(error));
@@ -3966,6 +3975,13 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
     int64_t rate_limit;
     bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
 
+    /*
+     * If there's a previous error, free it and prepare for another one.
+     * Meanwhile if migration completes successfully, there won't have an error
+     * dumped when calling migrate_fd_cleanup().
+     */
+    migrate_error_free(s);
+
     s->expected_downtime = s->parameters.downtime_limit;
     if (resume) {
         assert(s->cleanup_bh);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] migration: Release return path early for paused postcopy
  2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
@ 2021-07-08 19:13   ` Peter Xu
  2021-07-12 17:44   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Xu @ 2021-07-08 19:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, Dr . David Alan Gilbert,
	Li Xiaohui, Leonardo Bras Soares Passos

On Thu, Jul 08, 2021 at 03:06:51PM -0400, Peter Xu wrote:
> Reported-by: Li Xiaohui <xiaohuixiaohli@redhat.com>
                           ^^^^^^^^^^^^^^^^^^^^^^^^^
I don't know how did I messed this up.. but it should be <xiaohli@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] migration: Release return path early for paused postcopy
  2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
  2021-07-08 19:13   ` Peter Xu
@ 2021-07-12 17:44   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-12 17:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lukas Straub, Juan Quintela, Li Xiaohui, qemu-devel, Li Xiaohui,
	Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> When postcopy pause triggered, we rely on the migration thread to cleanup the
> to_dst_file handle, and the return path thread to cleanup the from_dst_file
> handle (which is stored in the local variable "rp").
> 
> Within the process, from_dst_file cleanup (qemu_fclose) is postponed until it's
> setup again due to a postcopy recovery.
> 
> It used to work before yank was born; after yank is introduced we rely on the
> refcount of IOC to correctly unregister yank function in channel_close().  If
> without the early and on-time release of from_dst_file handle the yank function
> will be leftover during paused postcopy.
> 
> Without this patch, below steps (quoted from Xiaohui) could trigger qemu src
> crash:
> 
>   1.Boot vm on src host
>   2.Boot vm on dst host
>   3.Enable postcopy on src&dst host
>   4.Load stressapptest in vm and set postcopy speed to 50M
>   5.Start migration from src to dst host, change into postcopy mode when migration is active.
>   6.When postcopy is active, down the network card(do migration via this network) on dst host.
>   7.Wait untill postcopy is paused on src&dst host.
>   8.Before up network card, recover migration on dst host, will get error like following.
>   9.Ignore the error of step 8, go on recovering migration on src host:
> 
>   After step 9, qemu on src host will core dump after some seconds:
>   qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
>   1.sh: line 38: 44662 Aborted                 (core dumped)
> 
> Reported-by: Li Xiaohui <xiaohuixiaohli@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

(and I can cleanup the email address problem)

> ---
>  migration/migration.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 5ff7ba9d5c..8786104c9a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2818,12 +2818,12 @@ out:
>               * Maybe there is something we can do: it looks like a
>               * network down issue, and we pause for a recovery.
>               */
> +            qemu_fclose(rp);
> +            ms->rp_state.from_dst_file = NULL;
> +            rp = NULL;
>              if (postcopy_pause_return_path_thread(ms)) {
>                  /* Reload rp, reset the rest */
> -                if (rp != ms->rp_state.from_dst_file) {
> -                    qemu_fclose(rp);
> -                    rp = ms->rp_state.from_dst_file;
> -                }
> +                rp = ms->rp_state.from_dst_file;
>                  ms->rp_state.error = false;
>                  goto retry;
>              }
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume
  2021-07-08 19:06 ` [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume Peter Xu
@ 2021-07-12 18:33   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-12 18:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Li Xiaohui, Lukas Straub, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Below process could crash qemu with postcopy recovery:
> 
>   1. (hmp) migrate -d ..
>   2. (hmp) migrate_start_postcopy
>   3. [network down, postcopy paused]
>   4. (hmp) migrate -r $WRONG_PORT
>      when try the recover on an invalid $WRONG_PORT, cleanup_bh will be cleared
>   5. (hmp) migrate -r $RIGHT_PORT
>      [qemu crash on assert(cleanup_bh)]
> 
> The thing is we shouldn't cleanup if it's postcopy resume; the error is set
> mostly because the channel is wrong, so we return directly waiting for the user
> to retry.
> 
> migrate_fd_cleanup() should only be called when migration is cancelled or
> completed.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 8786104c9a..bb1edf862a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3975,7 +3975,18 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>      }
>      if (error_in) {
>          migrate_fd_error(s, error_in);
> -        migrate_fd_cleanup(s);
> +        if (resume) {
> +            /*
> +             * Don't do cleanup for resume if channel is invalid, but only dump
> +             * the error.  We wait for another channel connect from the user.
> +             * The error_report still gives HMP user a hint on what failed.
> +             * It's normally done in migrate_fd_cleanup(), but call it here
> +             * explicitly.
> +             */
> +            error_report_err(error_copy(s->error));
> +        } else {
> +            migrate_fd_cleanup(s);
> +        }
>          return;
>      }
>  
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect()
  2021-07-08 19:06 ` [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect() Peter Xu
@ 2021-07-12 18:40   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-12 18:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Li Xiaohui, Lukas Straub, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> For each "migrate" command, remember to clear the s->error before going on.
> For one reason, when there's a new error it'll be still remembered; see
> migrate_set_error() who only sets the error if error==NULL.  Meanwhile if a
> failed migration completes (e.g., postcopy recovered and finished), we
> shouldn't dump an error when calling migrate_fd_cleanup() at last.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index bb1edf862a..338df01e97 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1855,6 +1855,15 @@ void migrate_set_error(MigrationState *s, const Error *error)
>      }
>  }
>  
> +static void migrate_error_free(MigrationState *s)
> +{
> +    QEMU_LOCK_GUARD(&s->error_mutex);
> +    if (s->error) {
> +        error_free(s->error);
> +        s->error = NULL;
> +    }
> +}
> +
>  void migrate_fd_error(MigrationState *s, const Error *error)
>  {
>      trace_migrate_fd_error(error_get_pretty(error));
> @@ -3966,6 +3975,13 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>      int64_t rate_limit;
>      bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>  
> +    /*
> +     * If there's a previous error, free it and prepare for another one.
> +     * Meanwhile if migration completes successfully, there won't have an error
> +     * dumped when calling migrate_fd_cleanup().
> +     */
> +    migrate_error_free(s);
> +
>      s->expected_downtime = s->parameters.downtime_limit;
>      if (resume) {
>          assert(s->cleanup_bh);
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] migration: Three more fixes for postcopy recovery
  2021-07-08 19:06 [PATCH 0/3] migration: Three more fixes for postcopy recovery Peter Xu
                   ` (2 preceding siblings ...)
  2021-07-08 19:06 ` [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect() Peter Xu
@ 2021-07-13 11:04 ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-13 11:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Li Xiaohui, Lukas Straub, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> A few more issues spotted by either Xiaohui or me.  Please see and review
> individual patches for what they do.  Thanks,
> 
> Peter Xu (3):
>   migration: Release return path early for paused postcopy
>   migration: Don't do migrate cleanup if during postcopy resume
>   migration: Clear error at entry of migrate_fd_connect()
> 
>  migration/migration.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)

Queued

> 
> -- 
> 2.31.1
> 
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-13 11:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-08 19:06 [PATCH 0/3] migration: Three more fixes for postcopy recovery Peter Xu
2021-07-08 19:06 ` [PATCH 1/3] migration: Release return path early for paused postcopy Peter Xu
2021-07-08 19:13   ` Peter Xu
2021-07-12 17:44   ` Dr. David Alan Gilbert
2021-07-08 19:06 ` [PATCH 2/3] migration: Don't do migrate cleanup if during postcopy resume Peter Xu
2021-07-12 18:33   ` Dr. David Alan Gilbert
2021-07-08 19:06 ` [PATCH 3/3] migration: Clear error at entry of migrate_fd_connect() Peter Xu
2021-07-12 18:40   ` Dr. David Alan Gilbert
2021-07-13 11:04 ` [PATCH 0/3] migration: Three more fixes for postcopy recovery Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.