All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
@ 2017-12-15 17:16 Dr. David Alan Gilbert (git)
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error * Dr. David Alan Gilbert (git)
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-12-15 17:16 UTC (permalink / raw)
  To: qemu-devel, berrange; +Cc: quintela, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  Where a channel fails asynchronously during connect, call
back through the migration code so it can clean up.
  In particular this causes the transition of a 'cancelling' state
to 'cancelled' in the case of:

    migrate -d tcp:deadhost:port
     <host tries to connect>
    migrate_cancel

previously the status would get stuck in cancelling because
the final cleanup didn't happen.

  This is the second part of the fix for:
https://bugzilla.redhat.com/show_bug.cgi?id=1525899

Dave

Dr. David Alan Gilbert (2):
  migration: Allow migrate_fd_connect to take an Error *
  migration: Route errors down  through migration_channel_connect

 migration/channel.c    | 32 ++++++++++++++++----------------
 migration/channel.h    |  3 ++-
 migration/exec.c       |  2 +-
 migration/fd.c         |  2 +-
 migration/migration.c  |  7 ++++++-
 migration/migration.h  |  2 +-
 migration/rdma.c       |  2 +-
 migration/socket.c     |  4 +---
 migration/tls.c        |  3 +--
 migration/trace-events |  2 +-
 10 files changed, 31 insertions(+), 28 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error *
  2017-12-15 17:16 [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Dr. David Alan Gilbert (git)
@ 2017-12-15 17:16 ` Dr. David Alan Gilbert (git)
  2018-01-28 18:53   ` Juan Quintela
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect Dr. David Alan Gilbert (git)
  2017-12-19  5:16 ` [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Peter Xu
  2 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-12-15 17:16 UTC (permalink / raw)
  To: qemu-devel, berrange; +Cc: quintela, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow whatever is performing the connection to pass migrate_fd_connect
an error to indicate there was a problem during connection, an allow
us to clean up.

The caller must free the error.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/channel.c   | 2 +-
 migration/migration.c | 7 ++++++-
 migration/migration.h | 2 +-
 migration/rdma.c      | 2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 70ec7ea3b7..fdb7ddbd17 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -78,6 +78,6 @@ void migration_channel_connect(MigrationState *s,
 
         s->to_dst_file = f;
 
-        migrate_fd_connect(s);
+        migrate_fd_connect(s, NULL);
     }
 }
diff --git a/migration/migration.c b/migration/migration.c
index 4de3b551fe..10a4260f27 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2333,10 +2333,15 @@ static void *migration_thread(void *opaque)
     return NULL;
 }
 
-void migrate_fd_connect(MigrationState *s)
+void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     s->expected_downtime = s->parameters.downtime_limit;
     s->cleanup_bh = qemu_bh_new(migrate_fd_cleanup, s);
+    if (error_in) {
+        migrate_fd_error(s, error_in);
+        migrate_fd_cleanup(s);
+        return;
+    }
 
     qemu_file_set_blocking(s->to_dst_file, true);
     qemu_file_set_rate_limit(s->to_dst_file,
diff --git a/migration/migration.h b/migration/migration.h
index 663415fe48..e67e5e87c2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -168,7 +168,7 @@ uint64_t migrate_max_downtime(void);
 void migrate_set_error(MigrationState *s, const Error *error);
 void migrate_fd_error(MigrationState *s, const Error *error);
 
-void migrate_fd_connect(MigrationState *s);
+void migrate_fd_connect(MigrationState *s, Error *error_in);
 
 MigrationState *migrate_init(void);
 bool migration_is_blocked(Error **errp);
diff --git a/migration/rdma.c b/migration/rdma.c
index ca56594328..cc7049c594 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3758,7 +3758,7 @@ void rdma_start_outgoing_migration(void *opaque,
     trace_rdma_start_outgoing_migration_after_rdma_connect();
 
     s->to_dst_file = qemu_fopen_rdma(rdma, "wb");
-    migrate_fd_connect(s);
+    migrate_fd_connect(s, NULL);
     return;
 err:
     g_free(rdma);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect
  2017-12-15 17:16 [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Dr. David Alan Gilbert (git)
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error * Dr. David Alan Gilbert (git)
@ 2017-12-15 17:16 ` Dr. David Alan Gilbert (git)
  2018-01-28 18:53   ` Juan Quintela
  2017-12-19  5:16 ` [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Peter Xu
  2 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-12-15 17:16 UTC (permalink / raw)
  To: qemu-devel, berrange; +Cc: quintela, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Route async errors (especially from sockets) down through
migration_channel_connect and on to migrate_fd_connect where they
can be cleaned up.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/channel.c    | 32 ++++++++++++++++----------------
 migration/channel.h    |  3 ++-
 migration/exec.c       |  2 +-
 migration/fd.c         |  2 +-
 migration/socket.c     |  4 +---
 migration/tls.c        |  3 +--
 migration/trace-events |  2 +-
 7 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index fdb7ddbd17..c5eaf0fa0e 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -55,29 +55,29 @@ void migration_channel_process_incoming(QIOChannel *ioc)
  * @s: Current migration state
  * @ioc: Channel to which we are connecting
  * @hostname: Where we want to connect
+ * @error: Error indicating failure to connect, free'd here
  */
 void migration_channel_connect(MigrationState *s,
                                QIOChannel *ioc,
-                               const char *hostname)
+                               const char *hostname,
+                               Error *error)
 {
     trace_migration_set_outgoing_channel(
-        ioc, object_get_typename(OBJECT(ioc)), hostname);
+        ioc, object_get_typename(OBJECT(ioc)), hostname, error);
 
-    if (s->parameters.tls_creds &&
-        *s->parameters.tls_creds &&
-        !object_dynamic_cast(OBJECT(ioc),
-                             TYPE_QIO_CHANNEL_TLS)) {
-        Error *local_err = NULL;
-        migration_tls_channel_connect(s, ioc, hostname, &local_err);
-        if (local_err) {
-            migrate_fd_error(s, local_err);
-            error_free(local_err);
-        }
-    } else {
-        QEMUFile *f = qemu_fopen_channel_output(ioc);
+    if (!error) {
+        if (s->parameters.tls_creds &&
+            *s->parameters.tls_creds &&
+            !object_dynamic_cast(OBJECT(ioc),
+                                 TYPE_QIO_CHANNEL_TLS)) {
+            migration_tls_channel_connect(s, ioc, hostname, &error);
+        } else {
+            QEMUFile *f = qemu_fopen_channel_output(ioc);
 
-        s->to_dst_file = f;
+            s->to_dst_file = f;
 
-        migrate_fd_connect(s, NULL);
+        }
     }
+    migrate_fd_connect(s, error);
+    error_free(error);
 }
diff --git a/migration/channel.h b/migration/channel.h
index e4b40579a1..67a461c28a 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -22,5 +22,6 @@ void migration_channel_process_incoming(QIOChannel *ioc);
 
 void migration_channel_connect(MigrationState *s,
                                QIOChannel *ioc,
-                               const char *hostname);
+                               const char *hostname,
+                               Error *error_in);
 #endif
diff --git a/migration/exec.c b/migration/exec.c
index f3be1baf2e..c9537974ad 100644
--- a/migration/exec.c
+++ b/migration/exec.c
@@ -39,7 +39,7 @@ void exec_start_outgoing_migration(MigrationState *s, const char *command, Error
     }
 
     qio_channel_set_name(ioc, "migration-exec-outgoing");
-    migration_channel_connect(s, ioc, NULL);
+    migration_channel_connect(s, ioc, NULL, NULL);
     object_unref(OBJECT(ioc));
 }
 
diff --git a/migration/fd.c b/migration/fd.c
index 30de4b9847..6284a97cba 100644
--- a/migration/fd.c
+++ b/migration/fd.c
@@ -39,7 +39,7 @@ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error **
     }
 
     qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing");
-    migration_channel_connect(s, ioc, NULL);
+    migration_channel_connect(s, ioc, NULL, NULL);
     object_unref(OBJECT(ioc));
 }
 
diff --git a/migration/socket.c b/migration/socket.c
index dee869044a..eff685f7b2 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -79,12 +79,10 @@ static void socket_outgoing_migration(QIOTask *task,
 
     if (qio_task_propagate_error(task, &err)) {
         trace_migration_socket_outgoing_error(error_get_pretty(err));
-        migrate_fd_error(data->s, err);
-        error_free(err);
     } else {
         trace_migration_socket_outgoing_connected(data->hostname);
-        migration_channel_connect(data->s, sioc, data->hostname);
     }
+    migration_channel_connect(data->s, sioc, data->hostname, err);
     object_unref(OBJECT(sioc));
 }
 
diff --git a/migration/tls.c b/migration/tls.c
index 026a008667..a29b35b33c 100644
--- a/migration/tls.c
+++ b/migration/tls.c
@@ -118,11 +118,10 @@ static void migration_tls_outgoing_handshake(QIOTask *task,
 
     if (qio_task_propagate_error(task, &err)) {
         trace_migration_tls_outgoing_handshake_error(error_get_pretty(err));
-        migrate_fd_error(s, err);
     } else {
         trace_migration_tls_outgoing_handshake_complete();
-        migration_channel_connect(s, ioc, NULL);
     }
+    migration_channel_connect(s, ioc, NULL, err);
     object_unref(OBJECT(ioc));
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 6f29fcc686..93961dea16 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -114,7 +114,7 @@ migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth,
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
-migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
+migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname, void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"
 
 # migration/rdma.c
 qemu_rdma_accept_incoming_migration(void) ""
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-15 17:16 [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Dr. David Alan Gilbert (git)
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error * Dr. David Alan Gilbert (git)
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect Dr. David Alan Gilbert (git)
@ 2017-12-19  5:16 ` Peter Xu
  2017-12-19 10:14   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2017-12-19  5:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, berrange, quintela

On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   Where a channel fails asynchronously during connect, call
> back through the migration code so it can clean up.
>   In particular this causes the transition of a 'cancelling' state
> to 'cancelled' in the case of:
> 
>     migrate -d tcp:deadhost:port
>      <host tries to connect>
>     migrate_cancel
> 
> previously the status would get stuck in cancelling because
> the final cleanup didn't happen.
> 
>   This is the second part of the fix for:
> https://bugzilla.redhat.com/show_bug.cgi?id=1525899

IIUC this series tries to deliver the connection error a long way
until migrate_fd_connect() to handle it. But, haven't we already have
a function migrate_fd_error() to do that (which is faster, and
simpler)?

void migrate_fd_error(MigrationState *s, const Error *error)
{
    trace_migrate_fd_error(error_get_pretty(error));
    assert(s->to_dst_file == NULL);
    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                      MIGRATION_STATUS_FAILED);
    migrate_set_error(s, error);
    notifier_list_notify(&migration_state_notifiers, s);
    block_cleanup_parameters(s);
}

I think it's not handling the case when cancelling.  If we let it to
handle the cancelling case well, would it be a simpler fix?

Moreover, I think this is another good example that migration is not
handling the cleanup "cleanly" in general... I really hope we can do
this better in 2.12.  I'll see whether I can give it a shot, but in
all cases it'll be after the merging of existing patches since there
are already quite a lot of dangling patches.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-19  5:16 ` [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Peter Xu
@ 2017-12-19 10:14   ` Dr. David Alan Gilbert
  2017-12-19 11:21     ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2017-12-19 10:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Hi,
> >   Where a channel fails asynchronously during connect, call
> > back through the migration code so it can clean up.
> >   In particular this causes the transition of a 'cancelling' state
> > to 'cancelled' in the case of:
> > 
> >     migrate -d tcp:deadhost:port
> >      <host tries to connect>
> >     migrate_cancel
> > 
> > previously the status would get stuck in cancelling because
> > the final cleanup didn't happen.
> > 
> >   This is the second part of the fix for:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> 
> IIUC this series tries to deliver the connection error a long way
> until migrate_fd_connect() to handle it. But, haven't we already have
> a function migrate_fd_error() to do that (which is faster, and
> simpler)?
> 
> void migrate_fd_error(MigrationState *s, const Error *error)
> {
>     trace_migrate_fd_error(error_get_pretty(error));
>     assert(s->to_dst_file == NULL);
>     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                       MIGRATION_STATUS_FAILED);
>     migrate_set_error(s, error);
>     notifier_list_notify(&migration_state_notifiers, s);
>     block_cleanup_parameters(s);
> }
> 
> I think it's not handling the case when cancelling.  If we let it to
> handle the cancelling case well, would it be a simpler fix?
>
> Moreover, I think this is another good example that migration is not
> handling the cleanup "cleanly" in general... I really hope we can do
> this better in 2.12.  I'll see whether I can give it a shot, but in
> all cases it'll be after the merging of existing patches since there
> are already quite a lot of dangling patches.

No, I think migrate_fd_error is the cause of the problem here, not the
answer.

If we stick to the simple rule that a migration must always call
migrate_fd_cleanup then the cancellation problems are fixed - I think
that's how we make migration 'clean' - a single cleanup routine
that always gets called.

Dave


> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-19 10:14   ` Dr. David Alan Gilbert
@ 2017-12-19 11:21     ` Peter Xu
  2017-12-19 11:33       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2017-12-19 11:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, berrange, quintela

On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Hi,
> > >   Where a channel fails asynchronously during connect, call
> > > back through the migration code so it can clean up.
> > >   In particular this causes the transition of a 'cancelling' state
> > > to 'cancelled' in the case of:
> > > 
> > >     migrate -d tcp:deadhost:port
> > >      <host tries to connect>
> > >     migrate_cancel
> > > 
> > > previously the status would get stuck in cancelling because
> > > the final cleanup didn't happen.
> > > 
> > >   This is the second part of the fix for:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > 
> > IIUC this series tries to deliver the connection error a long way
> > until migrate_fd_connect() to handle it. But, haven't we already have
> > a function migrate_fd_error() to do that (which is faster, and
> > simpler)?
> > 
> > void migrate_fd_error(MigrationState *s, const Error *error)
> > {
> >     trace_migrate_fd_error(error_get_pretty(error));
> >     assert(s->to_dst_file == NULL);
> >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> >                       MIGRATION_STATUS_FAILED);
> >     migrate_set_error(s, error);
> >     notifier_list_notify(&migration_state_notifiers, s);
> >     block_cleanup_parameters(s);
> > }
> > 
> > I think it's not handling the case when cancelling.  If we let it to
> > handle the cancelling case well, would it be a simpler fix?
> >
> > Moreover, I think this is another good example that migration is not
> > handling the cleanup "cleanly" in general... I really hope we can do
> > this better in 2.12.  I'll see whether I can give it a shot, but in
> > all cases it'll be after the merging of existing patches since there
> > are already quite a lot of dangling patches.
> 
> No, I think migrate_fd_error is the cause of the problem here, not the
> answer.

Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
we should have a single point to clean things up, then can we call
migrate_fd_cleanup() somehow inside migrate_fd_error()?

The thing I don't really understand is: why we want the error be
delivered via those functions (migration_channel_connect,
migrate_fd_connect, etc.) to finally be cleaned up.

> 
> If we stick to the simple rule that a migration must always call
> migrate_fd_cleanup then the cancellation problems are fixed - I think
> that's how we make migration 'clean' - a single cleanup routine
> that always gets called.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-19 11:21     ` Peter Xu
@ 2017-12-19 11:33       ` Dr. David Alan Gilbert
  2017-12-20  3:10         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2017-12-19 11:33 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Hi,
> > > >   Where a channel fails asynchronously during connect, call
> > > > back through the migration code so it can clean up.
> > > >   In particular this causes the transition of a 'cancelling' state
> > > > to 'cancelled' in the case of:
> > > > 
> > > >     migrate -d tcp:deadhost:port
> > > >      <host tries to connect>
> > > >     migrate_cancel
> > > > 
> > > > previously the status would get stuck in cancelling because
> > > > the final cleanup didn't happen.
> > > > 
> > > >   This is the second part of the fix for:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > 
> > > IIUC this series tries to deliver the connection error a long way
> > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > a function migrate_fd_error() to do that (which is faster, and
> > > simpler)?
> > > 
> > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > {
> > >     trace_migrate_fd_error(error_get_pretty(error));
> > >     assert(s->to_dst_file == NULL);
> > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > >                       MIGRATION_STATUS_FAILED);
> > >     migrate_set_error(s, error);
> > >     notifier_list_notify(&migration_state_notifiers, s);
> > >     block_cleanup_parameters(s);
> > > }
> > > 
> > > I think it's not handling the case when cancelling.  If we let it to
> > > handle the cancelling case well, would it be a simpler fix?
> > >
> > > Moreover, I think this is another good example that migration is not
> > > handling the cleanup "cleanly" in general... I really hope we can do
> > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > all cases it'll be after the merging of existing patches since there
> > > are already quite a lot of dangling patches.
> > 
> > No, I think migrate_fd_error is the cause of the problem here, not the
> > answer.
> 
> Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> we should have a single point to clean things up, then can we call
> migrate_fd_cleanup() somehow inside migrate_fd_error()?
> 
> The thing I don't really understand is: why we want the error be
> delivered via those functions (migration_channel_connect,
> migrate_fd_connect, etc.) to finally be cleaned up.

migrate_fd_cleanup has a lot of code for dealing with different cases:
   a) Closing the to_dst_file
   b) joining the thread if its already running
   c) Cleaning up multifd (stub)
   d) finishing of cancel
   e) notification
   f) block cleanup

we seem to have copied some of those into migrate_fd_error - but not all
of them.  In this case the one we're missing is (d) got finishing
cancel;
when you issue a cancel command we move from whatever state we were in
to the 'cancelling' state, various things get cleaned up and eventually
we move to cancelled; that wasn't happening because we missed the
code in migrate_fd_cleanup out.  So we could copy that code (copied code
is bad) or we could just make sure migrate_fd_cleanup is called like
it normally is.

The other thinking is that at the moment the migration/socket.c and tls.c
etc code has to choose between callbacks into the main migration code,
either migration_channel_connet or migrate_fd_error - now it's simpler,
once you've asked for an outgoing migration you always get a callback
to migration_channel_connect.  Much more predictable.

Dave

> > 
> > If we stick to the simple rule that a migration must always call
> > migrate_fd_cleanup then the cancellation problems are fixed - I think
> > that's how we make migration 'clean' - a single cleanup routine
> > that always gets called.
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-19 11:33       ` Dr. David Alan Gilbert
@ 2017-12-20  3:10         ` Peter Xu
  2017-12-20 18:15           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2017-12-20  3:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, berrange, quintela

On Tue, Dec 19, 2017 at 11:33:21AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Hi,
> > > > >   Where a channel fails asynchronously during connect, call
> > > > > back through the migration code so it can clean up.
> > > > >   In particular this causes the transition of a 'cancelling' state
> > > > > to 'cancelled' in the case of:
> > > > > 
> > > > >     migrate -d tcp:deadhost:port
> > > > >      <host tries to connect>
> > > > >     migrate_cancel
> > > > > 
> > > > > previously the status would get stuck in cancelling because
> > > > > the final cleanup didn't happen.
> > > > > 
> > > > >   This is the second part of the fix for:
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > > 
> > > > IIUC this series tries to deliver the connection error a long way
> > > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > > a function migrate_fd_error() to do that (which is faster, and
> > > > simpler)?
> > > > 
> > > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > > {
> > > >     trace_migrate_fd_error(error_get_pretty(error));
> > > >     assert(s->to_dst_file == NULL);
> > > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > >                       MIGRATION_STATUS_FAILED);
> > > >     migrate_set_error(s, error);
> > > >     notifier_list_notify(&migration_state_notifiers, s);
> > > >     block_cleanup_parameters(s);
> > > > }
> > > > 
> > > > I think it's not handling the case when cancelling.  If we let it to
> > > > handle the cancelling case well, would it be a simpler fix?
> > > >
> > > > Moreover, I think this is another good example that migration is not
> > > > handling the cleanup "cleanly" in general... I really hope we can do
> > > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > > all cases it'll be after the merging of existing patches since there
> > > > are already quite a lot of dangling patches.
> > > 
> > > No, I think migrate_fd_error is the cause of the problem here, not the
> > > answer.
> > 
> > Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> > we should have a single point to clean things up, then can we call
> > migrate_fd_cleanup() somehow inside migrate_fd_error()?
> > 
> > The thing I don't really understand is: why we want the error be
> > delivered via those functions (migration_channel_connect,
> > migrate_fd_connect, etc.) to finally be cleaned up.
> 
> migrate_fd_cleanup has a lot of code for dealing with different cases:
>    a) Closing the to_dst_file
>    b) joining the thread if its already running
>    c) Cleaning up multifd (stub)
>    d) finishing of cancel
>    e) notification
>    f) block cleanup
> 
> we seem to have copied some of those into migrate_fd_error - but not all
> of them.  In this case the one we're missing is (d) got finishing
> cancel;
> when you issue a cancel command we move from whatever state we were in
> to the 'cancelling' state, various things get cleaned up and eventually
> we move to cancelled; that wasn't happening because we missed the
> code in migrate_fd_cleanup out.  So we could copy that code (copied code
> is bad) or we could just make sure migrate_fd_cleanup is called like
> it normally is.

I fully agree on above.

> 
> The other thinking is that at the moment the migration/socket.c and tls.c
> etc code has to choose between callbacks into the main migration code,
> either migration_channel_connet or migrate_fd_error - now it's simpler,
> once you've asked for an outgoing migration you always get a callback
> to migration_channel_connect.  Much more predictable.

This is the point I still don't understand, on why we must go into
migrate_fd_connect(), even if error happens before that point.

What I meant is pasted at the end.  Again, I don't know whether it
works, just want to show what I meant.  I'm fine with current approach
too.  Thanks,

----------------------------------

diff --git a/migration/migration.c b/migration/migration.c                     
index 4de3b551fe..fd9b509ab1 100644    
--- a/migration/migration.c            
+++ b/migration/migration.c            
@@ -1074,8 +1074,10 @@ static void migrate_fd_cleanup(void *opaque)            
 {                                     
     MigrationState *s = opaque;       
                                       
-    qemu_bh_delete(s->cleanup_bh);    
-    s->cleanup_bh = NULL;             
+    if (s->cleanup_bh) {              
+        qemu_bh_delete(s->cleanup_bh);
+        s->cleanup_bh = NULL;         
+    }                                 
                                       
     if (s->to_dst_file) {             
         Error *local_err = NULL;      
@@ -1124,11 +1126,15 @@ void migrate_fd_error(MigrationState *s, const Error *error)                                                                          
 {                                     
     trace_migrate_fd_error(error_get_pretty(error));                          
     assert(s->to_dst_file == NULL);   
+    /*                                
+     * If we are still in setup, switch to failure.  It's also                
+     * possible that the migration has been cancelled, then we do             
+     * nothing here.                  
+     */                               
     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,                      
                       MIGRATION_STATUS_FAILED);                               
     migrate_set_error(s, error);      
-    notifier_list_notify(&migration_state_notifiers, s);                      
-    block_cleanup_parameters(s);      
+    migrate_fd_cleanup(s);            
 }                                     
                                       
 static void migrate_fd_cancel(MigrationState *s)                              

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
  2017-12-20  3:10         ` Peter Xu
@ 2017-12-20 18:15           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2017-12-20 18:15 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 19, 2017 at 11:33:21AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > Hi,
> > > > > >   Where a channel fails asynchronously during connect, call
> > > > > > back through the migration code so it can clean up.
> > > > > >   In particular this causes the transition of a 'cancelling' state
> > > > > > to 'cancelled' in the case of:
> > > > > > 
> > > > > >     migrate -d tcp:deadhost:port
> > > > > >      <host tries to connect>
> > > > > >     migrate_cancel
> > > > > > 
> > > > > > previously the status would get stuck in cancelling because
> > > > > > the final cleanup didn't happen.
> > > > > > 
> > > > > >   This is the second part of the fix for:
> > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > > > 
> > > > > IIUC this series tries to deliver the connection error a long way
> > > > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > > > a function migrate_fd_error() to do that (which is faster, and
> > > > > simpler)?
> > > > > 
> > > > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > > > {
> > > > >     trace_migrate_fd_error(error_get_pretty(error));
> > > > >     assert(s->to_dst_file == NULL);
> > > > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > > >                       MIGRATION_STATUS_FAILED);
> > > > >     migrate_set_error(s, error);
> > > > >     notifier_list_notify(&migration_state_notifiers, s);
> > > > >     block_cleanup_parameters(s);
> > > > > }
> > > > > 
> > > > > I think it's not handling the case when cancelling.  If we let it to
> > > > > handle the cancelling case well, would it be a simpler fix?
> > > > >
> > > > > Moreover, I think this is another good example that migration is not
> > > > > handling the cleanup "cleanly" in general... I really hope we can do
> > > > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > > > all cases it'll be after the merging of existing patches since there
> > > > > are already quite a lot of dangling patches.
> > > > 
> > > > No, I think migrate_fd_error is the cause of the problem here, not the
> > > > answer.
> > > 
> > > Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> > > we should have a single point to clean things up, then can we call
> > > migrate_fd_cleanup() somehow inside migrate_fd_error()?
> > > 
> > > The thing I don't really understand is: why we want the error be
> > > delivered via those functions (migration_channel_connect,
> > > migrate_fd_connect, etc.) to finally be cleaned up.
> > 
> > migrate_fd_cleanup has a lot of code for dealing with different cases:
> >    a) Closing the to_dst_file
> >    b) joining the thread if its already running
> >    c) Cleaning up multifd (stub)
> >    d) finishing of cancel
> >    e) notification
> >    f) block cleanup
> > 
> > we seem to have copied some of those into migrate_fd_error - but not all
> > of them.  In this case the one we're missing is (d) got finishing
> > cancel;
> > when you issue a cancel command we move from whatever state we were in
> > to the 'cancelling' state, various things get cleaned up and eventually
> > we move to cancelled; that wasn't happening because we missed the
> > code in migrate_fd_cleanup out.  So we could copy that code (copied code
> > is bad) or we could just make sure migrate_fd_cleanup is called like
> > it normally is.
> 
> I fully agree on above.
> 
> > 
> > The other thinking is that at the moment the migration/socket.c and tls.c
> > etc code has to choose between callbacks into the main migration code,
> > either migration_channel_connet or migrate_fd_error - now it's simpler,
> > once you've asked for an outgoing migration you always get a callback
> > to migration_channel_connect.  Much more predictable.
> 
> This is the point I still don't understand, on why we must go into
> migrate_fd_connect(), even if error happens before that point.

To me it just doesn't feel clean, and is the reason this error happened
in the first place.

Dave

> What I meant is pasted at the end.  Again, I don't know whether it
> works, just want to show what I meant.  I'm fine with current approach
> too.  Thanks,
> 
> ----------------------------------
> 
> diff --git a/migration/migration.c b/migration/migration.c                     
> index 4de3b551fe..fd9b509ab1 100644    
> --- a/migration/migration.c            
> +++ b/migration/migration.c            
> @@ -1074,8 +1074,10 @@ static void migrate_fd_cleanup(void *opaque)            
>  {                                     
>      MigrationState *s = opaque;       
>                                        
> -    qemu_bh_delete(s->cleanup_bh);    
> -    s->cleanup_bh = NULL;             
> +    if (s->cleanup_bh) {              
> +        qemu_bh_delete(s->cleanup_bh);
> +        s->cleanup_bh = NULL;         
> +    }                                 
>                                        
>      if (s->to_dst_file) {             
>          Error *local_err = NULL;      
> @@ -1124,11 +1126,15 @@ void migrate_fd_error(MigrationState *s, const Error *error)                                                                          
>  {                                     
>      trace_migrate_fd_error(error_get_pretty(error));                          
>      assert(s->to_dst_file == NULL);   
> +    /*                                
> +     * If we are still in setup, switch to failure.  It's also                
> +     * possible that the migration has been cancelled, then we do             
> +     * nothing here.                  
> +     */                               
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,                      
>                        MIGRATION_STATUS_FAILED);                               
>      migrate_set_error(s, error);      
> -    notifier_list_notify(&migration_state_notifiers, s);                      
> -    block_cleanup_parameters(s);      
> +    migrate_fd_cleanup(s);            
>  }                                     
>                                        
>  static void migrate_fd_cancel(MigrationState *s)                              
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error *
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error * Dr. David Alan Gilbert (git)
@ 2018-01-28 18:53   ` Juan Quintela
  0 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2018-01-28 18:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, berrange, peterx

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Allow whatever is performing the connection to pass migrate_fd_connect
> an error to indicate there was a problem during connection, an allow
> us to clean up.
>
> The caller must free the error.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect
  2017-12-15 17:16 ` [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect Dr. David Alan Gilbert (git)
@ 2018-01-28 18:53   ` Juan Quintela
  0 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2018-01-28 18:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, berrange, peterx

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Route async errors (especially from sockets) down through
> migration_channel_connect and on to migrate_fd_connect where they
> can be cleaned up.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-28 18:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-15 17:16 [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Dr. David Alan Gilbert (git)
2017-12-15 17:16 ` [Qemu-devel] [PATCH 1/2] migration: Allow migrate_fd_connect to take an Error * Dr. David Alan Gilbert (git)
2018-01-28 18:53   ` Juan Quintela
2017-12-15 17:16 ` [Qemu-devel] [PATCH 2/2] migration: Route errors down through migration_channel_connect Dr. David Alan Gilbert (git)
2018-01-28 18:53   ` Juan Quintela
2017-12-19  5:16 ` [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling Peter Xu
2017-12-19 10:14   ` Dr. David Alan Gilbert
2017-12-19 11:21     ` Peter Xu
2017-12-19 11:33       ` Dr. David Alan Gilbert
2017-12-20  3:10         ` Peter Xu
2017-12-20 18:15           ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.