All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Laurent Vivier <lvivier@redhat.com>,
	Alexey Perevalov <a.perevalov@samsung.com>,
	Juan Quintela <quintela@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [RFC 11/29] migration: new postcopy-pause state
Date: Mon, 31 Jul 2017 20:06:18 +0100	[thread overview]
Message-ID: <20170731190617.GC3095@work-vm> (raw)
In-Reply-To: <1501229198-30588-12-git-send-email-peterx@redhat.com>

* Peter Xu (peterx@redhat.com) wrote:
> Introducing a new state "postcopy-paused", which can be used to pause a
> postcopy migration. It is targeted to support network failures during
> postcopy migration. Now when network down for postcopy, the source side
> will not fail the migration. Instead we convert the status into this new
> paused state, and we will try to wait for a rescue in the future.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

I think this should probably be split into:
   a) A patch that adds a new state and the entries in query_migrate etc
   b) A patch that wires up the semaphore and the use of the state.

> ---
>  migration/migration.c  | 78 +++++++++++++++++++++++++++++++++++++++++++++++---
>  migration/migration.h  |  3 ++
>  migration/trace-events |  1 +
>  qapi-schema.json       |  5 +++-
>  4 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index efee87e..0bc70c8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -470,6 +470,7 @@ static bool migration_is_setup_or_active(int state)
>      switch (state) {
>      case MIGRATION_STATUS_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
>      case MIGRATION_STATUS_SETUP:
>          return true;
>  
> @@ -545,6 +546,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>      case MIGRATION_STATUS_ACTIVE:
>      case MIGRATION_STATUS_CANCELLING:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
>           /* TODO add some postcopy stats */
>          info->has_status = true;
>          info->has_total_time = true;
> @@ -991,6 +993,8 @@ static void migrate_fd_cleanup(void *opaque)
>  
>      notifier_list_notify(&migration_state_notifiers, s);
>      block_cleanup_parameters(s);
> +
> +    qemu_sem_destroy(&s->postcopy_pause_sem);
>  }
>  
>  void migrate_fd_error(MigrationState *s, const Error *error)
> @@ -1134,6 +1138,7 @@ MigrationState *migrate_init(void)
>      s->migration_thread_running = false;
>      error_free(s->error);
>      s->error = NULL;
> +    qemu_sem_init(&s->postcopy_pause_sem, 0);
>  
>      migrate_set_state(&s->state, MIGRATION_STATUS_NONE, MIGRATION_STATUS_SETUP);
>  
> @@ -1942,6 +1947,69 @@ static bool postcopy_should_start(MigrationState *s)
>  }
>  
>  /*
> + * We don't return until we are in a safe state to continue current
> + * postcopy migration.  Returns true to continue the migration, or
> + * false to terminate current migration.
> + */
> +static bool postcopy_pause(MigrationState *s)
> +{
> +    assert(s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);

I never like asserts on the sending side.

> +    migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> +                      MIGRATION_STATUS_POSTCOPY_PAUSED);
> +
> +    /* Current channel is possibly broken. Release it. */
> +    assert(s->to_dst_file);
> +    qemu_file_shutdown(s->to_dst_file);
> +    qemu_fclose(s->to_dst_file);
> +    s->to_dst_file = NULL;

That does scare me a little; I think it's OK, I'm not sure what happens
to the ->from_dst_file fd and the return-path processing.

> +    /*
> +     * We wait until things fixed up. Then someone will setup the
> +     * status back for us.
> +     */
> +    while (s->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
> +        qemu_sem_wait(&s->postcopy_pause_sem);
> +    }

Something should get written to stderr prior to this, so when we
find a migration apparently stuck we can tell why.

Dave

> +
> +    trace_postcopy_pause_continued();
> +
> +    return true;
> +}
> +
> +/* Return true if we want to stop the migration, otherwise false. */
> +static bool migration_detect_error(MigrationState *s)
> +{
> +    int ret;
> +
> +    /* Try to detect any file errors */
> +    ret = qemu_file_get_error(s->to_dst_file);
> +
> +    if (!ret) {
> +        /* Everything is fine */
> +        return false;
> +    }
> +
> +    if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
> +        /*
> +         * For postcopy, we allow the network to be down for a
> +         * while. After that, it can be continued by a
> +         * recovery phase.
> +         */
> +        return !postcopy_pause(s);
> +    } else {
> +        /*
> +         * For precopy (or postcopy with error outside IO), we fail
> +         * with no time.
> +         */
> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +        trace_migration_thread_file_err();
> +
> +        /* Time to stop the migration, now. */
> +        return true;
> +    }
> +}
> +
> +/*
>   * Master migration thread on the source VM.
>   * It drives the migration and pumps the data down the outgoing channel.
>   */
> @@ -2037,12 +2105,14 @@ static void *migration_thread(void *opaque)
>              }
>          }
>  
> -        if (qemu_file_get_error(s->to_dst_file)) {
> -            migrate_set_state(&s->state, current_active_state,
> -                              MIGRATION_STATUS_FAILED);
> -            trace_migration_thread_file_err();
> +        /*
> +         * Try to detect any kind of failures, and see whether we
> +         * should stop the migration now.
> +         */
> +        if (migration_detect_error(s)) {
>              break;
>          }
> +
>          current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>          if (current_time >= initial_time + BUFFER_DELAY) {
>              uint64_t transferred_bytes = qemu_ftell(s->to_dst_file) -
> diff --git a/migration/migration.h b/migration/migration.h
> index e902bae..24cdaf6 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -151,6 +151,9 @@ struct MigrationState
>      bool send_configuration;
>      /* Whether we send section footer during migration */
>      bool send_section_footer;
> +
> +    /* Needed by postcopy-pause state */
> +    QemuSemaphore postcopy_pause_sem;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> diff --git a/migration/trace-events b/migration/trace-events
> index 08d00fa..2211acc 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -98,6 +98,7 @@ migration_thread_setup_complete(void) ""
>  open_return_path_on_source(void) ""
>  open_return_path_on_source_continue(void) ""
>  postcopy_start(void) ""
> +postcopy_pause_continued(void) ""
>  postcopy_start_set_run(void) ""
>  source_return_path_thread_bad_end(void) ""
>  source_return_path_thread_end(void) ""
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 9c6c3e1..2a36b80 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -667,6 +667,8 @@
>  #
>  # @postcopy-active: like active, but now in postcopy mode. (since 2.5)
>  #
> +# @postcopy-paused: during postcopy but paused. (since 2.10)
> +#
>  # @completed: migration is finished.
>  #
>  # @failed: some error occurred during migration process.
> @@ -679,7 +681,8 @@
>  ##
>  { 'enum': 'MigrationStatus',
>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
> -            'active', 'postcopy-active', 'completed', 'failed', 'colo' ] }
> +            'active', 'postcopy-active', 'postcopy-paused',
> +            'completed', 'failed', 'colo' ] }
>  
>  ##
>  # @MigrationInfo:
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

  parent reply	other threads:[~2017-07-31 19:06 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-28  8:06 [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 01/29] migration: fix incorrect postcopy recved_bitmap Peter Xu
2017-07-31 16:34   ` Dr. David Alan Gilbert
2017-08-01  2:11     ` Peter Xu
2017-08-01  5:48       ` Alexey Perevalov
2017-08-01  6:02         ` Peter Xu
2017-08-01  6:12           ` Alexey Perevalov
2017-07-28  8:06 ` [Qemu-devel] [RFC 02/29] migration: fix comment disorder in RAMState Peter Xu
2017-07-31 16:39   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 03/29] io: fix qio_channel_socket_accept err handling Peter Xu
2017-07-31 16:53   ` Dr. David Alan Gilbert
2017-08-01  2:25     ` Peter Xu
2017-08-01  8:32       ` Daniel P. Berrange
2017-08-01  8:55         ` Dr. David Alan Gilbert
2017-08-02  3:21           ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 04/29] bitmap: introduce bitmap_invert() Peter Xu
2017-07-31 17:11   ` Dr. David Alan Gilbert
2017-08-01  2:43     ` Peter Xu
2017-08-01  8:40       ` Dr. David Alan Gilbert
2017-08-02  3:20         ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 05/29] bitmap: introduce bitmap_count_one() Peter Xu
2017-07-31 17:58   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 06/29] migration: dump str in migrate_set_state trace Peter Xu
2017-07-31 18:27   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 07/29] migration: better error handling with QEMUFile Peter Xu
2017-07-31 18:39   ` Dr. David Alan Gilbert
2017-08-01  5:49     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 08/29] migration: reuse mis->userfault_quit_fd Peter Xu
2017-07-31 18:42   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 09/29] migration: provide postcopy_fault_thread_notify() Peter Xu
2017-07-31 18:45   ` Dr. David Alan Gilbert
2017-08-01  3:01     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 10/29] migration: new property "x-postcopy-fast" Peter Xu
2017-07-31 18:52   ` Dr. David Alan Gilbert
2017-08-01  3:13     ` Peter Xu
2017-08-01  8:50       ` Dr. David Alan Gilbert
2017-08-02  3:31         ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 11/29] migration: new postcopy-pause state Peter Xu
2017-07-28 15:53   ` Eric Blake
2017-07-31  7:02     ` Peter Xu
2017-07-31 19:06   ` Dr. David Alan Gilbert [this message]
2017-08-01  6:28     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 12/29] migration: allow dst vm pause on postcopy Peter Xu
2017-08-01  9:47   ` Dr. David Alan Gilbert
2017-08-02  5:06     ` Peter Xu
2017-08-03 14:03       ` Dr. David Alan Gilbert
2017-08-04  3:43         ` Peter Xu
2017-08-04  9:33           ` Dr. David Alan Gilbert
2017-08-04  9:44             ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 13/29] migration: allow src return path to pause Peter Xu
2017-08-01 10:01   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 14/29] migration: allow send_rq to fail Peter Xu
2017-08-01 10:30   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 15/29] migration: allow fault thread to pause Peter Xu
2017-08-01 10:41   ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 16/29] qmp: hmp: add migrate "resume" option Peter Xu
2017-07-28 15:57   ` Eric Blake
2017-07-31  7:05     ` Peter Xu
2017-08-01 10:42   ` Dr. David Alan Gilbert
2017-08-01 11:03   ` Daniel P. Berrange
2017-08-02  5:56     ` Peter Xu
2017-08-02  9:28       ` Daniel P. Berrange
2017-07-28  8:06 ` [Qemu-devel] [RFC 17/29] migration: rebuild channel on source Peter Xu
2017-08-01 10:59   ` Dr. David Alan Gilbert
2017-08-02  6:14     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 18/29] migration: new state "postcopy-recover" Peter Xu
2017-08-01 11:36   ` Dr. David Alan Gilbert
2017-08-02  6:42     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 19/29] migration: let dst listen on port always Peter Xu
2017-08-01 10:56   ` Daniel P. Berrange
2017-08-02  7:02     ` Peter Xu
2017-08-02  9:26       ` Daniel P. Berrange
2017-08-02 11:02         ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 20/29] migration: wakeup dst ram-load-thread for recover Peter Xu
2017-08-03  9:28   ` Dr. David Alan Gilbert
2017-08-04  5:46     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 21/29] migration: new cmd MIG_CMD_RECV_BITMAP Peter Xu
2017-08-03  9:49   ` Dr. David Alan Gilbert
2017-08-04  6:08     ` Peter Xu
2017-08-04  6:15       ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 22/29] migration: new message MIG_RP_MSG_RECV_BITMAP Peter Xu
2017-08-03 10:50   ` Dr. David Alan Gilbert
2017-08-04  6:59     ` Peter Xu
2017-08-04  9:49       ` Dr. David Alan Gilbert
2017-08-07  6:11         ` Peter Xu
2017-08-07  9:04           ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 23/29] migration: new cmd MIG_CMD_POSTCOPY_RESUME Peter Xu
2017-08-03 11:05   ` Dr. David Alan Gilbert
2017-08-04  7:04     ` Peter Xu
2017-08-04  7:09       ` Peter Xu
2017-08-04  8:30         ` Dr. David Alan Gilbert
2017-08-04  9:22           ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 24/29] migration: new message MIG_RP_MSG_RESUME_ACK Peter Xu
2017-08-03 11:21   ` Dr. David Alan Gilbert
2017-08-04  7:23     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 25/29] migration: introduce SaveVMHandlers.resume_prepare Peter Xu
2017-08-03 11:38   ` Dr. David Alan Gilbert
2017-08-04  7:39     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 26/29] migration: synchronize dirty bitmap for resume Peter Xu
2017-08-03 11:56   ` Dr. David Alan Gilbert
2017-08-04  7:49     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 27/29] migration: setup ramstate " Peter Xu
2017-08-03 12:37   ` Dr. David Alan Gilbert
2017-08-04  8:39     ` Peter Xu
2017-07-28  8:06 ` [Qemu-devel] [RFC 28/29] migration: final handshake for the resume Peter Xu
2017-08-03 13:47   ` Dr. David Alan Gilbert
2017-08-04  9:05     ` Peter Xu
2017-08-04  9:53       ` Dr. David Alan Gilbert
2017-07-28  8:06 ` [Qemu-devel] [RFC 29/29] migration: reset migrate thread vars when resumed Peter Xu
2017-08-03 13:54   ` Dr. David Alan Gilbert
2017-08-04  8:52     ` Peter Xu
2017-08-04  9:52       ` Dr. David Alan Gilbert
2017-08-07  6:57         ` Peter Xu
2017-07-28 10:06 ` [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery Peter Xu
2017-08-03 15:57 ` Dr. David Alan Gilbert
2017-08-21  7:47   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170731190617.GC3095@work-vm \
    --to=dgilbert@redhat.com \
    --cc=a.perevalov@samsung.com \
    --cc=aarcange@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.