All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank
@ 2021-07-21 19:34 Peter Xu
  2021-07-21 19:34 ` [PATCH v2 1/5] migration: Fix missing join() of rp_thread Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

v2:
- Pick r-b for Dave on patch 1/3
- Move migration_file_get_ioc() from patch 5 to patch 4, meanwhile rename it to
  qemu_file_get_ioc(). [Dave]
- Patch 2 "migration: Shutdown src in await_return_path_close_on_source()" is
  replaced by patch "migration: Make from_dst_file accesses thread-safe" [Dave]

Patch 1 fixes a possible race that migration thread can accidentally skip
join() of rp_thread even if the return thread is enabled.  Patch 1 is suspected
to also be the root cause of the recent hard-to-reproduce migration-test
failure here reported by PMM:

https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/

I didn't reproduce it myself; but after co-debugged with Dave it's suspected
that the race of rp_thread could be the cause.  It's not exposed before because
yank is soo strict on releasing instances, while we're not that strict before,
and didn't join() on rp_thread wasn't so dangerous after all when migration
succeeded before.

Patch 2 fixes another theoretical race on accessing from_dst_file spotted by
Dave.  I don't think there's known issues with it, but may still worth fixing.

Patch 3 should be a cleanup on yank that I think would be nice to have.

Patch 4-5 are further cleanups to remove the ref==1 check in channel_close(),
finally, as I always thought that's a bit hackish.  So I used explicit
unregister of the yank function at necessary places to replace that ref==1 one.

I still think having patch 3-5 altogether would be great, however I think patch
1 should still be the most important to be reviewed.  Also it would be great to
know whether patch 1 could fix the known yank crash.

Please review, thanks.

Peter Xu (5):
  migration: Fix missing join() of rp_thread
  migration: Make from_dst_file accesses thread-safe
  migration: Introduce migration_ioc_[un]register_yank()
  migration: Teach QEMUFile to be QIOChannel-aware
  migration: Move the yank unregister of channel_close out

 migration/channel.c           | 15 ++-----------
 migration/migration.c         | 41 +++++++++++++++++++++++++++-------
 migration/migration.h         | 15 ++++++++++---
 migration/multifd.c           |  8 ++-----
 migration/qemu-file-channel.c | 11 ++-------
 migration/qemu-file.c         | 17 +++++++++++++-
 migration/qemu-file.h         |  4 +++-
 migration/ram.c               |  3 ++-
 migration/savevm.c            | 11 +++++++--
 migration/yank_functions.c    | 42 +++++++++++++++++++++++++++++++++++
 migration/yank_functions.h    |  3 +++
 11 files changed, 126 insertions(+), 44 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/5] migration: Fix missing join() of rp_thread
  2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
@ 2021-07-21 19:34 ` Peter Xu
  2021-07-21 19:34 ` [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

It's possible that the migration thread skip the join() of the rp_thread in
below race and crash on src right at finishing migration:

       migration_thread                     rp_thread
       ----------------                     ---------
    migration_completion()
                                        (before rp_thread quits)
                                        from_dst_file=NULL
                                        [thread got scheduled out]
      s->rp_state.from_dst_file==NULL
        (skip join() of rp_thread)
    migrate_fd_cleanup()
      qemu_fclose(s->to_dst_file)
      yank_unregister_instance()
        assert(yank_find_entry())  <------- crash

It could mostly happen with postcopy, but that shouldn't be required, e.g., I
think it could also trigger with MIGRATION_CAPABILITY_RETURN_PATH set.

It's suspected that above race could be the root cause of a recent (but rare)
migration-test break reported by either Dave or PMM:

https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/

The issue is: from_dst_file is reset in the rp_thread, so if the thread reset
it to NULL fast enough then the migration thread will assume there's no
rp_thread at all.

This could potentially cause more severe issue (e.g. crash) after the yank code.

Fix it by using a boolean to keep "whether we've created rp_thread".

Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 4 +++-
 migration/migration.h | 7 +++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2d306582eb..21b94f75a3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2867,6 +2867,7 @@ static int open_return_path_on_source(MigrationState *ms,
 
     qemu_thread_create(&ms->rp_state.rp_thread, "return path",
                        source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
+    ms->rp_state.rp_thread_created = true;
 
     trace_open_return_path_on_source_continue();
 
@@ -2891,6 +2892,7 @@ static int await_return_path_close_on_source(MigrationState *ms)
     }
     trace_await_return_path_close_on_source_joining();
     qemu_thread_join(&ms->rp_state.rp_thread);
+    ms->rp_state.rp_thread_created = false;
     trace_await_return_path_close_on_source_close();
     return ms->rp_state.error;
 }
@@ -3170,7 +3172,7 @@ static void migration_completion(MigrationState *s)
      * it will wait for the destination to send it's status in
      * a SHUT command).
      */
-    if (s->rp_state.from_dst_file) {
+    if (s->rp_state.rp_thread_created) {
         int rp_error;
         trace_migration_return_path_end_before();
         rp_error = await_return_path_close_on_source(s);
diff --git a/migration/migration.h b/migration/migration.h
index 2ebb740dfa..c302879fad 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -195,6 +195,13 @@ struct MigrationState {
         QEMUFile     *from_dst_file;
         QemuThread    rp_thread;
         bool          error;
+        /*
+         * We can also check non-zero of rp_thread, but there's no "official"
+         * way to do this, so this bool makes it slightly more elegant.
+         * Checking from_dst_file for this is racy because from_dst_file will
+         * be cleared in the rp_thread!
+         */
+        bool          rp_thread_created;
         QemuSemaphore rp_sem;
     } rp_state;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
  2021-07-21 19:34 ` [PATCH v2 1/5] migration: Fix missing join() of rp_thread Peter Xu
@ 2021-07-21 19:34 ` Peter Xu
  2021-07-21 21:15   ` Eric Blake
  2021-07-22 15:19   ` Dr. David Alan Gilbert
  2021-07-21 19:34 ` [PATCH v2 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

Accessing from_dst_file is potentially racy in current code base like below:

  if (s->from_dst_file)
    do_something(s->from_dst_file);

Because from_dst_file can be reset right after the check in another
thread (rp_thread).  One example is migrate_fd_cancel().

Use the same qemu_file_lock to protect it too, just like to_dst_file.

When it's safe to access without lock, comment it.

There's one special reference in migration_thread() that can be replaced by
the newly introduced rp_thread_created flag.

Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 32 +++++++++++++++++++++++++-------
 migration/migration.h |  8 +++++---
 migration/ram.c       |  1 +
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 21b94f75a3..fa70400f98 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1879,10 +1879,12 @@ static void migrate_fd_cancel(MigrationState *s)
     QEMUFile *f = migrate_get_current()->to_dst_file;
     trace_migrate_fd_cancel();
 
+    qemu_mutex_lock(&s->qemu_file_lock);
     if (s->rp_state.from_dst_file) {
         /* shutdown the rp socket, so causing the rp thread to shutdown */
         qemu_file_shutdown(s->rp_state.from_dst_file);
     }
+    qemu_mutex_unlock(&s->qemu_file_lock);
 
     do {
         old_state = s->state;
@@ -2686,6 +2688,22 @@ static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
     return 0;
 }
 
+/* Release ms->rp_state.from_dst_file in a safe way */
+static void migration_release_from_dst_file(MigrationState *ms)
+{
+    QEMUFile *file = ms->rp_state.from_dst_file;
+
+    qemu_mutex_lock(&ms->qemu_file_lock);
+    /*
+     * Reset the from_dst_file pointer first before releasing it, as we can't
+     * block within lock section
+     */
+    ms->rp_state.from_dst_file = NULL;
+    qemu_mutex_unlock(&ms->qemu_file_lock);
+
+    qemu_fclose(file);
+}
+
 /*
  * Handles messages sent on the return path towards the source VM
  *
@@ -2827,11 +2845,13 @@ out:
              * Maybe there is something we can do: it looks like a
              * network down issue, and we pause for a recovery.
              */
-            qemu_fclose(rp);
-            ms->rp_state.from_dst_file = NULL;
+            migration_release_from_dst_file(ms);
             rp = NULL;
             if (postcopy_pause_return_path_thread(ms)) {
-                /* Reload rp, reset the rest */
+                /*
+                 * Reload rp, reset the rest.  Referencing it is save since
+                 * it's reset only by us above, or when migration completes
+                 */
                 rp = ms->rp_state.from_dst_file;
                 ms->rp_state.error = false;
                 goto retry;
@@ -2843,8 +2863,7 @@ out:
     }
 
     trace_source_return_path_thread_end();
-    ms->rp_state.from_dst_file = NULL;
-    qemu_fclose(rp);
+    migration_release_from_dst_file(ms);
     rcu_unregister_thread();
     return NULL;
 }
@@ -2852,7 +2871,6 @@ out:
 static int open_return_path_on_source(MigrationState *ms,
                                       bool create_thread)
 {
-
     ms->rp_state.from_dst_file = qemu_file_get_return_path(ms->to_dst_file);
     if (!ms->rp_state.from_dst_file) {
         return -1;
@@ -3746,7 +3764,7 @@ static void *migration_thread(void *opaque)
      * If we opened the return path, we need to make sure dst has it
      * opened as well.
      */
-    if (s->rp_state.from_dst_file) {
+    if (s->rp_state.rp_thread_created) {
         /* Now tell the dest that it should open its end so it can reply */
         qemu_savevm_send_open_return_path(s->to_dst_file);
 
diff --git a/migration/migration.h b/migration/migration.h
index c302879fad..7a5aa8c2fd 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -154,12 +154,13 @@ struct MigrationState {
     QemuThread thread;
     QEMUBH *vm_start_bh;
     QEMUBH *cleanup_bh;
+    /* Protected by qemu_file_lock */
     QEMUFile *to_dst_file;
     QIOChannelBuffer *bioc;
     /*
-     * Protects to_dst_file pointer.  We need to make sure we won't
-     * yield or hang during the critical section, since this lock will
-     * be used in OOB command handler.
+     * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
+     * won't yield or hang during the critical section, since this lock will be
+     * used in OOB command handler.
      */
     QemuMutex qemu_file_lock;
 
@@ -192,6 +193,7 @@ struct MigrationState {
 
     /* State related to return path */
     struct {
+        /* Protected by qemu_file_lock */
         QEMUFile     *from_dst_file;
         QemuThread    rp_thread;
         bool          error;
diff --git a/migration/ram.c b/migration/ram.c
index b5fc454b2f..f728f5072f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4012,6 +4012,7 @@ static void ram_dirty_bitmap_reload_notify(MigrationState *s)
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
 {
     int ret = -EINVAL;
+    /* from_dst_file is always valid because we're within rp_thread */
     QEMUFile *file = s->rp_state.from_dst_file;
     unsigned long *le_bitmap, nbits = block->used_length >> TARGET_PAGE_BITS;
     uint64_t local_size = DIV_ROUND_UP(nbits, 8);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/5] migration: Introduce migration_ioc_[un]register_yank()
  2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
  2021-07-21 19:34 ` [PATCH v2 1/5] migration: Fix missing join() of rp_thread Peter Xu
  2021-07-21 19:34 ` [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
@ 2021-07-21 19:34 ` Peter Xu
  2021-07-21 19:34 ` [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
  2021-07-21 19:34 ` [PATCH v2 5/5] migration: Move the yank unregister of channel_close out Peter Xu
  4 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

There're plenty of places in migration/* that checks against either socket or
tls typed ioc for yank operations.  Provide two helpers to hide all these
information.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/channel.c           | 15 ++-------------
 migration/multifd.c           |  8 ++------
 migration/qemu-file-channel.c |  8 ++------
 migration/yank_functions.c    | 28 ++++++++++++++++++++++++++++
 migration/yank_functions.h    |  2 ++
 5 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 01275a9162..c4fc000a1a 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -44,13 +44,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
                              TYPE_QIO_CHANNEL_TLS)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
     } else {
-        if (object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-            object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) {
-            yank_register_function(MIGRATION_YANK_INSTANCE,
-                                   migration_yank_iochannel,
-                                   QIO_CHANNEL(ioc));
-        }
-
+        migration_ioc_register_yank(ioc);
         migration_ioc_process_incoming(ioc, &local_err);
     }
 
@@ -94,12 +88,7 @@ void migration_channel_connect(MigrationState *s,
         } else {
             QEMUFile *f = qemu_fopen_channel_output(ioc);
 
-            if (object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-                object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) {
-                yank_register_function(MIGRATION_YANK_INSTANCE,
-                                       migration_yank_iochannel,
-                                       QIO_CHANNEL(ioc));
-            }
+            migration_ioc_register_yank(ioc);
 
             qemu_mutex_lock(&s->qemu_file_lock);
             s->to_dst_file = f;
diff --git a/migration/multifd.c b/migration/multifd.c
index ab41590e71..377da78f5b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -987,12 +987,8 @@ int multifd_load_cleanup(Error **errp)
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
-        if ((object_dynamic_cast(OBJECT(p->c), TYPE_QIO_CHANNEL_SOCKET) ||
-             object_dynamic_cast(OBJECT(p->c), TYPE_QIO_CHANNEL_TLS))
-            && OBJECT(p->c)->ref == 1) {
-            yank_unregister_function(MIGRATION_YANK_INSTANCE,
-                                     migration_yank_iochannel,
-                                     QIO_CHANNEL(p->c));
+        if (OBJECT(p->c)->ref == 1) {
+            migration_ioc_unregister_yank(p->c);
         }
 
         object_unref(OBJECT(p->c));
diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index fad340ea7a..867a5ed0c3 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -107,12 +107,8 @@ static int channel_close(void *opaque, Error **errp)
     int ret;
     QIOChannel *ioc = QIO_CHANNEL(opaque);
     ret = qio_channel_close(ioc, errp);
-    if ((object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-         object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS))
-        && OBJECT(ioc)->ref == 1) {
-        yank_unregister_function(MIGRATION_YANK_INSTANCE,
-                                 migration_yank_iochannel,
-                                 QIO_CHANNEL(ioc));
+    if (OBJECT(ioc)->ref == 1) {
+        migration_ioc_unregister_yank(ioc);
     }
     object_unref(OBJECT(ioc));
     return ret;
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index 96c90e17dc..23697173ae 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -11,6 +11,9 @@
 #include "qapi/error.h"
 #include "io/channel.h"
 #include "yank_functions.h"
+#include "qemu/yank.h"
+#include "io/channel-socket.h"
+#include "io/channel-tls.h"
 
 void migration_yank_iochannel(void *opaque)
 {
@@ -18,3 +21,28 @@ void migration_yank_iochannel(void *opaque)
 
     qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }
+
+/* Return whether yank is supported on this ioc */
+static bool migration_ioc_yank_supported(QIOChannel *ioc)
+{
+    return object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
+        object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS);
+}
+
+void migration_ioc_register_yank(QIOChannel *ioc)
+{
+    if (migration_ioc_yank_supported(ioc)) {
+        yank_register_function(MIGRATION_YANK_INSTANCE,
+                               migration_yank_iochannel,
+                               QIO_CHANNEL(ioc));
+    }
+}
+
+void migration_ioc_unregister_yank(QIOChannel *ioc)
+{
+    if (migration_ioc_yank_supported(ioc)) {
+        yank_unregister_function(MIGRATION_YANK_INSTANCE,
+                                 migration_yank_iochannel,
+                                 QIO_CHANNEL(ioc));
+    }
+}
diff --git a/migration/yank_functions.h b/migration/yank_functions.h
index 055ea22523..74c7f18c91 100644
--- a/migration/yank_functions.h
+++ b/migration/yank_functions.h
@@ -15,3 +15,5 @@
  * @opaque: QIOChannel to shutdown
  */
 void migration_yank_iochannel(void *opaque);
+void migration_ioc_register_yank(QIOChannel *ioc);
+void migration_ioc_unregister_yank(QIOChannel *ioc);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware
  2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (2 preceding siblings ...)
  2021-07-21 19:34 ` [PATCH v2 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
@ 2021-07-21 19:34 ` Peter Xu
  2021-07-22 15:21   ` Dr. David Alan Gilbert
  2021-07-21 19:34 ` [PATCH v2 5/5] migration: Move the yank unregister of channel_close out Peter Xu
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

migration uses QIOChannel typed qemufiles.  In follow up patches, we'll need
the capability to identify this fact, so that we can get the backing QIOChannel
from a QEMUFile.

We can also define types for QEMUFile but so far since we only need to be able
to identify QIOChannel, introduce a boolean which is simpler.

Introduce another helper qemu_file_get_ioc() to return the ioc backend of a
qemufile if has_ioc is set.

No functional change.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/qemu-file-channel.c |  4 ++--
 migration/qemu-file.c         | 17 ++++++++++++++++-
 migration/qemu-file.h         |  4 +++-
 migration/ram.c               |  2 +-
 migration/savevm.c            |  4 ++--
 5 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index 867a5ed0c3..2f8b1fcd46 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -187,11 +187,11 @@ static const QEMUFileOps channel_output_ops = {
 QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc)
 {
     object_ref(OBJECT(ioc));
-    return qemu_fopen_ops(ioc, &channel_input_ops);
+    return qemu_fopen_ops(ioc, &channel_input_ops, true);
 }
 
 QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
 {
     object_ref(OBJECT(ioc));
-    return qemu_fopen_ops(ioc, &channel_output_ops);
+    return qemu_fopen_ops(ioc, &channel_output_ops, true);
 }
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1eacf9e831..6338d8e2ff 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -55,6 +55,8 @@ struct QEMUFile {
     Error *last_error_obj;
     /* has the file has been shutdown */
     bool shutdown;
+    /* Whether opaque points to a QIOChannel */
+    bool has_ioc;
 };
 
 /*
@@ -101,7 +103,7 @@ bool qemu_file_mode_is_not_valid(const char *mode)
     return false;
 }
 
-QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
+QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc)
 {
     QEMUFile *f;
 
@@ -109,6 +111,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
 
     f->opaque = opaque;
     f->ops = ops;
+    f->has_ioc = has_ioc;
     return f;
 }
 
@@ -851,3 +854,15 @@ void qemu_file_set_blocking(QEMUFile *f, bool block)
         f->ops->set_blocking(f->opaque, block, NULL);
     }
 }
+
+/*
+ * Return the ioc object if it's a migration channel.  Note: it can return NULL
+ * for callers passing in a non-migration qemufile.  E.g. see qemu_fopen_bdrv()
+ * and its usage in e.g. load_snapshot().  So we need to check against NULL
+ * before using it.  If without the check, migration_incoming_state_destroy()
+ * could fail for load_snapshot().
+ */
+QIOChannel *qemu_file_get_ioc(QEMUFile *file)
+{
+    return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index a9b6d6ccb7..3f36d4dc8c 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -27,6 +27,7 @@
 
 #include <zlib.h>
 #include "exec/cpu-common.h"
+#include "io/channel.h"
 
 /* Read a chunk of data from a file at the given position.  The pos argument
  * can be ignored if the file is only be used for streaming.  The number of
@@ -119,7 +120,7 @@ typedef struct QEMUFileHooks {
     QEMURamSaveFunc *save_page;
 } QEMUFileHooks;
 
-QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
+QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc);
 void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks *hooks);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
@@ -179,5 +180,6 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags, void *data);
 size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                              ram_addr_t offset, size_t size,
                              uint64_t *bytes_sent);
+QIOChannel *qemu_file_get_ioc(QEMUFile *file);
 
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index f728f5072f..08b3cb7a4a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -550,7 +550,7 @@ static int compress_threads_save_setup(void)
         /* comp_param[i].file is just used as a dummy buffer to save data,
          * set its ops to empty.
          */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
+        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops, false);
         comp_param[i].done = true;
         comp_param[i].quit = false;
         qemu_mutex_init(&comp_param[i].mutex);
diff --git a/migration/savevm.c b/migration/savevm.c
index 72848b946c..96b5e5d639 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -168,9 +168,9 @@ static const QEMUFileOps bdrv_write_ops = {
 static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
 {
     if (is_writable) {
-        return qemu_fopen_ops(bs, &bdrv_write_ops);
+        return qemu_fopen_ops(bs, &bdrv_write_ops, false);
     }
-    return qemu_fopen_ops(bs, &bdrv_read_ops);
+    return qemu_fopen_ops(bs, &bdrv_read_ops, false);
 }
 
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/5] migration: Move the yank unregister of channel_close out
  2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (3 preceding siblings ...)
  2021-07-21 19:34 ` [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
@ 2021-07-21 19:34 ` Peter Xu
  2021-07-22 15:27   ` Dr. David Alan Gilbert
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2021-07-21 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos

It's efficient, but hackish to call yank unregister calls in channel_close(),
especially it'll be hard to debug when qemu crashed with some yank function
leaked.

Remove that hack, but instead explicitly unregister yank functions at the
places where needed, they are:

  (on src)
  - migrate_fd_cleanup
  - postcopy_pause

  (on dst)
  - migration_incoming_state_destroy
  - postcopy_pause_incoming

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c         |  5 +++++
 migration/qemu-file-channel.c |  3 ---
 migration/savevm.c            |  7 +++++++
 migration/yank_functions.c    | 14 ++++++++++++++
 migration/yank_functions.h    |  1 +
 5 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index fa70400f98..bfeb65b8f7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -59,6 +59,7 @@
 #include "multifd.h"
 #include "qemu/yank.h"
 #include "sysemu/cpus.h"
+#include "yank_functions.h"
 
 #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
 
@@ -273,6 +274,7 @@ void migration_incoming_state_destroy(void)
     }
 
     if (mis->from_src_file) {
+        migration_ioc_unregister_yank_from_file(mis->from_src_file);
         qemu_fclose(mis->from_src_file);
         mis->from_src_file = NULL;
     }
@@ -1811,6 +1813,7 @@ static void migrate_fd_cleanup(MigrationState *s)
          * Close the file handle without the lock to make sure the
          * critical section won't block for long.
          */
+        migration_ioc_unregister_yank_from_file(tmp);
         qemu_fclose(tmp);
     }
 
@@ -3352,6 +3355,8 @@ static MigThrError postcopy_pause(MigrationState *s)
 
         /* Current channel is possibly broken. Release it. */
         assert(s->to_dst_file);
+        /* Unregister yank for current channel */
+        migration_ioc_unregister_yank_from_file(s->to_dst_file);
         qemu_mutex_lock(&s->qemu_file_lock);
         file = s->to_dst_file;
         s->to_dst_file = NULL;
diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index 2f8b1fcd46..bb5a5752df 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -107,9 +107,6 @@ static int channel_close(void *opaque, Error **errp)
     int ret;
     QIOChannel *ioc = QIO_CHANNEL(opaque);
     ret = qio_channel_close(ioc, errp);
-    if (OBJECT(ioc)->ref == 1) {
-        migration_ioc_unregister_yank(ioc);
-    }
     object_unref(OBJECT(ioc));
     return ret;
 }
diff --git a/migration/savevm.c b/migration/savevm.c
index 96b5e5d639..7b7b64bd13 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -65,6 +65,7 @@
 #include "qemu/bitmap.h"
 #include "net/announce.h"
 #include "qemu/yank.h"
+#include "yank_functions.h"
 
 const unsigned int postcopy_ram_discard_version;
 
@@ -2568,6 +2569,12 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
     /* Clear the triggered bit to allow one recovery */
     mis->postcopy_recover_triggered = false;
 
+    /*
+     * Unregister yank with either from/to src would work, since ioc behind it
+     * is the same
+     */
+    migration_ioc_unregister_yank_from_file(mis->from_src_file);
+
     assert(mis->from_src_file);
     qemu_file_shutdown(mis->from_src_file);
     qemu_fclose(mis->from_src_file);
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index 23697173ae..8c08aef14a 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -14,6 +14,7 @@
 #include "qemu/yank.h"
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
+#include "qemu-file.h"
 
 void migration_yank_iochannel(void *opaque)
 {
@@ -46,3 +47,16 @@ void migration_ioc_unregister_yank(QIOChannel *ioc)
                                  QIO_CHANNEL(ioc));
     }
 }
+
+void migration_ioc_unregister_yank_from_file(QEMUFile *file)
+{
+    QIOChannel *ioc = qemu_file_get_ioc(file);
+
+    if (ioc) {
+        /*
+         * For migration qemufiles, we'll always reach here.  Though we'll skip
+         * calls from e.g. savevm/loadvm as they don't use yank.
+         */
+        migration_ioc_unregister_yank(ioc);
+    }
+}
diff --git a/migration/yank_functions.h b/migration/yank_functions.h
index 74c7f18c91..a7577955ed 100644
--- a/migration/yank_functions.h
+++ b/migration/yank_functions.h
@@ -17,3 +17,4 @@
 void migration_yank_iochannel(void *opaque);
 void migration_ioc_register_yank(QIOChannel *ioc);
 void migration_ioc_unregister_yank(QIOChannel *ioc);
+void migration_ioc_unregister_yank_from_file(QEMUFile *file);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-21 19:34 ` [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
@ 2021-07-21 21:15   ` Eric Blake
  2021-07-22 14:44     ` Peter Xu
  2021-07-22 15:19   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Blake @ 2021-07-21 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Dr . David Alan Gilbert, Leonardo Bras Soares Passos

On Wed, Jul 21, 2021 at 03:34:06PM -0400, Peter Xu wrote:
> Accessing from_dst_file is potentially racy in current code base like below:
> 
>   if (s->from_dst_file)
>     do_something(s->from_dst_file);
> 
> Because from_dst_file can be reset right after the check in another
> thread (rp_thread).  One example is migrate_fd_cancel().
> 
> Use the same qemu_file_lock to protect it too, just like to_dst_file.
> 
> When it's safe to access without lock, comment it.
> 
> There's one special reference in migration_thread() that can be replaced by
> the newly introduced rp_thread_created flag.
> 
> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 32 +++++++++++++++++++++++++-------
>  migration/migration.h |  8 +++++---
>  migration/ram.c       |  1 +
>  3 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 21b94f75a3..fa70400f98 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1879,10 +1879,12 @@ static void migrate_fd_cancel(MigrationState *s)
>      QEMUFile *f = migrate_get_current()->to_dst_file;
>      trace_migrate_fd_cancel();
>  
> +    qemu_mutex_lock(&s->qemu_file_lock);
>      if (s->rp_state.from_dst_file) {
>          /* shutdown the rp socket, so causing the rp thread to shutdown */
>          qemu_file_shutdown(s->rp_state.from_dst_file);
>      }
> +    qemu_mutex_unlock(&s->qemu_file_lock);

Worth using WITH_QEMU_LOCK_GUARD?

> @@ -2827,11 +2845,13 @@ out:
>               * Maybe there is something we can do: it looks like a
>               * network down issue, and we pause for a recovery.
>               */
> -            qemu_fclose(rp);
> -            ms->rp_state.from_dst_file = NULL;
> +            migration_release_from_dst_file(ms);
>              rp = NULL;
>              if (postcopy_pause_return_path_thread(ms)) {
> -                /* Reload rp, reset the rest */
> +                /*
> +                 * Reload rp, reset the rest.  Referencing it is save since

s/save/safe/

> +                 * it's reset only by us above, or when migration completes
> +                 */
>                  rp = ms->rp_state.from_dst_file;
>                  ms->rp_state.error = false;
>                  goto retry;

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-21 21:15   ` Eric Blake
@ 2021-07-22 14:44     ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-22 14:44 UTC (permalink / raw)
  To: Eric Blake
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Dr . David Alan Gilbert, Leonardo Bras Soares Passos

On Wed, Jul 21, 2021 at 04:15:27PM -0500, Eric Blake wrote:
> On Wed, Jul 21, 2021 at 03:34:06PM -0400, Peter Xu wrote:
> > Accessing from_dst_file is potentially racy in current code base like below:
> > 
> >   if (s->from_dst_file)
> >     do_something(s->from_dst_file);
> > 
> > Because from_dst_file can be reset right after the check in another
> > thread (rp_thread).  One example is migrate_fd_cancel().
> > 
> > Use the same qemu_file_lock to protect it too, just like to_dst_file.
> > 
> > When it's safe to access without lock, comment it.
> > 
> > There's one special reference in migration_thread() that can be replaced by
> > the newly introduced rp_thread_created flag.
> > 
> > Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.c | 32 +++++++++++++++++++++++++-------
> >  migration/migration.h |  8 +++++---
> >  migration/ram.c       |  1 +
> >  3 files changed, 31 insertions(+), 10 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 21b94f75a3..fa70400f98 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1879,10 +1879,12 @@ static void migrate_fd_cancel(MigrationState *s)
> >      QEMUFile *f = migrate_get_current()->to_dst_file;
> >      trace_migrate_fd_cancel();
> >  
> > +    qemu_mutex_lock(&s->qemu_file_lock);
> >      if (s->rp_state.from_dst_file) {
> >          /* shutdown the rp socket, so causing the rp thread to shutdown */
> >          qemu_file_shutdown(s->rp_state.from_dst_file);
> >      }
> > +    qemu_mutex_unlock(&s->qemu_file_lock);
> 
> Worth using WITH_QEMU_LOCK_GUARD?

Sure.

> 
> > @@ -2827,11 +2845,13 @@ out:
> >               * Maybe there is something we can do: it looks like a
> >               * network down issue, and we pause for a recovery.
> >               */
> > -            qemu_fclose(rp);
> > -            ms->rp_state.from_dst_file = NULL;
> > +            migration_release_from_dst_file(ms);
> >              rp = NULL;
> >              if (postcopy_pause_return_path_thread(ms)) {
> > -                /* Reload rp, reset the rest */
> > +                /*
> > +                 * Reload rp, reset the rest.  Referencing it is save since
> 
> s/save/safe/

Will fix.

I'll wait for some more comments before I repost.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-21 19:34 ` [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
  2021-07-21 21:15   ` Eric Blake
@ 2021-07-22 15:19   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 15:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Accessing from_dst_file is potentially racy in current code base like below:
> 
>   if (s->from_dst_file)
>     do_something(s->from_dst_file);
> 
> Because from_dst_file can be reset right after the check in another
> thread (rp_thread).  One example is migrate_fd_cancel().
> 
> Use the same qemu_file_lock to protect it too, just like to_dst_file.
> 
> When it's safe to access without lock, comment it.
> 
> There's one special reference in migration_thread() that can be replaced by
> the newly introduced rp_thread_created flag.
> 
> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yep, with Eric's comments

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 32 +++++++++++++++++++++++++-------
>  migration/migration.h |  8 +++++---
>  migration/ram.c       |  1 +
>  3 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 21b94f75a3..fa70400f98 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1879,10 +1879,12 @@ static void migrate_fd_cancel(MigrationState *s)
>      QEMUFile *f = migrate_get_current()->to_dst_file;
>      trace_migrate_fd_cancel();
>  
> +    qemu_mutex_lock(&s->qemu_file_lock);
>      if (s->rp_state.from_dst_file) {
>          /* shutdown the rp socket, so causing the rp thread to shutdown */
>          qemu_file_shutdown(s->rp_state.from_dst_file);
>      }
> +    qemu_mutex_unlock(&s->qemu_file_lock);
>  
>      do {
>          old_state = s->state;
> @@ -2686,6 +2688,22 @@ static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
>      return 0;
>  }
>  
> +/* Release ms->rp_state.from_dst_file in a safe way */
> +static void migration_release_from_dst_file(MigrationState *ms)
> +{
> +    QEMUFile *file = ms->rp_state.from_dst_file;
> +
> +    qemu_mutex_lock(&ms->qemu_file_lock);
> +    /*
> +     * Reset the from_dst_file pointer first before releasing it, as we can't
> +     * block within lock section
> +     */
> +    ms->rp_state.from_dst_file = NULL;
> +    qemu_mutex_unlock(&ms->qemu_file_lock);
> +
> +    qemu_fclose(file);
> +}
> +
>  /*
>   * Handles messages sent on the return path towards the source VM
>   *
> @@ -2827,11 +2845,13 @@ out:
>               * Maybe there is something we can do: it looks like a
>               * network down issue, and we pause for a recovery.
>               */
> -            qemu_fclose(rp);
> -            ms->rp_state.from_dst_file = NULL;
> +            migration_release_from_dst_file(ms);
>              rp = NULL;
>              if (postcopy_pause_return_path_thread(ms)) {
> -                /* Reload rp, reset the rest */
> +                /*
> +                 * Reload rp, reset the rest.  Referencing it is save since
> +                 * it's reset only by us above, or when migration completes
> +                 */
>                  rp = ms->rp_state.from_dst_file;
>                  ms->rp_state.error = false;
>                  goto retry;
> @@ -2843,8 +2863,7 @@ out:
>      }
>  
>      trace_source_return_path_thread_end();
> -    ms->rp_state.from_dst_file = NULL;
> -    qemu_fclose(rp);
> +    migration_release_from_dst_file(ms);
>      rcu_unregister_thread();
>      return NULL;
>  }
> @@ -2852,7 +2871,6 @@ out:
>  static int open_return_path_on_source(MigrationState *ms,
>                                        bool create_thread)
>  {
> -
>      ms->rp_state.from_dst_file = qemu_file_get_return_path(ms->to_dst_file);
>      if (!ms->rp_state.from_dst_file) {
>          return -1;
> @@ -3746,7 +3764,7 @@ static void *migration_thread(void *opaque)
>       * If we opened the return path, we need to make sure dst has it
>       * opened as well.
>       */
> -    if (s->rp_state.from_dst_file) {
> +    if (s->rp_state.rp_thread_created) {
>          /* Now tell the dest that it should open its end so it can reply */
>          qemu_savevm_send_open_return_path(s->to_dst_file);
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index c302879fad..7a5aa8c2fd 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -154,12 +154,13 @@ struct MigrationState {
>      QemuThread thread;
>      QEMUBH *vm_start_bh;
>      QEMUBH *cleanup_bh;
> +    /* Protected by qemu_file_lock */
>      QEMUFile *to_dst_file;
>      QIOChannelBuffer *bioc;
>      /*
> -     * Protects to_dst_file pointer.  We need to make sure we won't
> -     * yield or hang during the critical section, since this lock will
> -     * be used in OOB command handler.
> +     * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
> +     * won't yield or hang during the critical section, since this lock will be
> +     * used in OOB command handler.
>       */
>      QemuMutex qemu_file_lock;
>  
> @@ -192,6 +193,7 @@ struct MigrationState {
>  
>      /* State related to return path */
>      struct {
> +        /* Protected by qemu_file_lock */
>          QEMUFile     *from_dst_file;
>          QemuThread    rp_thread;
>          bool          error;
> diff --git a/migration/ram.c b/migration/ram.c
> index b5fc454b2f..f728f5072f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4012,6 +4012,7 @@ static void ram_dirty_bitmap_reload_notify(MigrationState *s)
>  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>  {
>      int ret = -EINVAL;
> +    /* from_dst_file is always valid because we're within rp_thread */
>      QEMUFile *file = s->rp_state.from_dst_file;
>      unsigned long *le_bitmap, nbits = block->used_length >> TARGET_PAGE_BITS;
>      uint64_t local_size = DIV_ROUND_UP(nbits, 8);
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware
  2021-07-21 19:34 ` [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
@ 2021-07-22 15:21   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> migration uses QIOChannel typed qemufiles.  In follow up patches, we'll need
> the capability to identify this fact, so that we can get the backing QIOChannel
> from a QEMUFile.
> 
> We can also define types for QEMUFile but so far since we only need to be able
> to identify QIOChannel, introduce a boolean which is simpler.
> 
> Introduce another helper qemu_file_get_ioc() to return the ioc backend of a
> qemufile if has_ioc is set.
> 
> No functional change.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yep, one day we'll sort out the block case, but until then

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/qemu-file-channel.c |  4 ++--
>  migration/qemu-file.c         | 17 ++++++++++++++++-
>  migration/qemu-file.h         |  4 +++-
>  migration/ram.c               |  2 +-
>  migration/savevm.c            |  4 ++--
>  5 files changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index 867a5ed0c3..2f8b1fcd46 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -187,11 +187,11 @@ static const QEMUFileOps channel_output_ops = {
>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc)
>  {
>      object_ref(OBJECT(ioc));
> -    return qemu_fopen_ops(ioc, &channel_input_ops);
> +    return qemu_fopen_ops(ioc, &channel_input_ops, true);
>  }
>  
>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>  {
>      object_ref(OBJECT(ioc));
> -    return qemu_fopen_ops(ioc, &channel_output_ops);
> +    return qemu_fopen_ops(ioc, &channel_output_ops, true);
>  }
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 1eacf9e831..6338d8e2ff 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -55,6 +55,8 @@ struct QEMUFile {
>      Error *last_error_obj;
>      /* has the file has been shutdown */
>      bool shutdown;
> +    /* Whether opaque points to a QIOChannel */
> +    bool has_ioc;
>  };
>  
>  /*
> @@ -101,7 +103,7 @@ bool qemu_file_mode_is_not_valid(const char *mode)
>      return false;
>  }
>  
> -QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
> +QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc)
>  {
>      QEMUFile *f;
>  
> @@ -109,6 +111,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
>  
>      f->opaque = opaque;
>      f->ops = ops;
> +    f->has_ioc = has_ioc;
>      return f;
>  }
>  
> @@ -851,3 +854,15 @@ void qemu_file_set_blocking(QEMUFile *f, bool block)
>          f->ops->set_blocking(f->opaque, block, NULL);
>      }
>  }
> +
> +/*
> + * Return the ioc object if it's a migration channel.  Note: it can return NULL
> + * for callers passing in a non-migration qemufile.  E.g. see qemu_fopen_bdrv()
> + * and its usage in e.g. load_snapshot().  So we need to check against NULL
> + * before using it.  If without the check, migration_incoming_state_destroy()
> + * could fail for load_snapshot().
> + */
> +QIOChannel *qemu_file_get_ioc(QEMUFile *file)
> +{
> +    return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
> +}
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index a9b6d6ccb7..3f36d4dc8c 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -27,6 +27,7 @@
>  
>  #include <zlib.h>
>  #include "exec/cpu-common.h"
> +#include "io/channel.h"
>  
>  /* Read a chunk of data from a file at the given position.  The pos argument
>   * can be ignored if the file is only be used for streaming.  The number of
> @@ -119,7 +120,7 @@ typedef struct QEMUFileHooks {
>      QEMURamSaveFunc *save_page;
>  } QEMUFileHooks;
>  
> -QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
> +QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc);
>  void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks *hooks);
>  int qemu_get_fd(QEMUFile *f);
>  int qemu_fclose(QEMUFile *f);
> @@ -179,5 +180,6 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags, void *data);
>  size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>                               ram_addr_t offset, size_t size,
>                               uint64_t *bytes_sent);
> +QIOChannel *qemu_file_get_ioc(QEMUFile *file);
>  
>  #endif
> diff --git a/migration/ram.c b/migration/ram.c
> index f728f5072f..08b3cb7a4a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -550,7 +550,7 @@ static int compress_threads_save_setup(void)
>          /* comp_param[i].file is just used as a dummy buffer to save data,
>           * set its ops to empty.
>           */
> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
> +        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops, false);
>          comp_param[i].done = true;
>          comp_param[i].quit = false;
>          qemu_mutex_init(&comp_param[i].mutex);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 72848b946c..96b5e5d639 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -168,9 +168,9 @@ static const QEMUFileOps bdrv_write_ops = {
>  static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
>  {
>      if (is_writable) {
> -        return qemu_fopen_ops(bs, &bdrv_write_ops);
> +        return qemu_fopen_ops(bs, &bdrv_write_ops, false);
>      }
> -    return qemu_fopen_ops(bs, &bdrv_read_ops);
> +    return qemu_fopen_ops(bs, &bdrv_read_ops, false);
>  }
>  
>  
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/5] migration: Move the yank unregister of channel_close out
  2021-07-21 19:34 ` [PATCH v2 5/5] migration: Move the yank unregister of channel_close out Peter Xu
@ 2021-07-22 15:27   ` Dr. David Alan Gilbert
  2021-07-22 16:19     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 15:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> It's efficient, but hackish to call yank unregister calls in channel_close(),
> especially it'll be hard to debug when qemu crashed with some yank function
> leaked.
> 
> Remove that hack, but instead explicitly unregister yank functions at the
> places where needed, they are:
> 
>   (on src)
>   - migrate_fd_cleanup
>   - postcopy_pause
> 
>   (on dst)
>   - migration_incoming_state_destroy
>   - postcopy_pause_incoming
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c         |  5 +++++
>  migration/qemu-file-channel.c |  3 ---
>  migration/savevm.c            |  7 +++++++
>  migration/yank_functions.c    | 14 ++++++++++++++
>  migration/yank_functions.h    |  1 +
>  5 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index fa70400f98..bfeb65b8f7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -59,6 +59,7 @@
>  #include "multifd.h"
>  #include "qemu/yank.h"
>  #include "sysemu/cpus.h"
> +#include "yank_functions.h"
>  
>  #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
>  
> @@ -273,6 +274,7 @@ void migration_incoming_state_destroy(void)
>      }
>  
>      if (mis->from_src_file) {
> +        migration_ioc_unregister_yank_from_file(mis->from_src_file);
>          qemu_fclose(mis->from_src_file);
>          mis->from_src_file = NULL;
>      }
> @@ -1811,6 +1813,7 @@ static void migrate_fd_cleanup(MigrationState *s)
>           * Close the file handle without the lock to make sure the
>           * critical section won't block for long.
>           */
> +        migration_ioc_unregister_yank_from_file(tmp);
>          qemu_fclose(tmp);
>      }
>  
> @@ -3352,6 +3355,8 @@ static MigThrError postcopy_pause(MigrationState *s)
>  
>          /* Current channel is possibly broken. Release it. */
>          assert(s->to_dst_file);
> +        /* Unregister yank for current channel */
> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);

Should this go inside the lock?

Dave

>          qemu_mutex_lock(&s->qemu_file_lock);
>          file = s->to_dst_file;
>          s->to_dst_file = NULL;
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index 2f8b1fcd46..bb5a5752df 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -107,9 +107,6 @@ static int channel_close(void *opaque, Error **errp)
>      int ret;
>      QIOChannel *ioc = QIO_CHANNEL(opaque);
>      ret = qio_channel_close(ioc, errp);
> -    if (OBJECT(ioc)->ref == 1) {
> -        migration_ioc_unregister_yank(ioc);
> -    }
>      object_unref(OBJECT(ioc));
>      return ret;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 96b5e5d639..7b7b64bd13 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -65,6 +65,7 @@
>  #include "qemu/bitmap.h"
>  #include "net/announce.h"
>  #include "qemu/yank.h"
> +#include "yank_functions.h"
>  
>  const unsigned int postcopy_ram_discard_version;
>  
> @@ -2568,6 +2569,12 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
>      /* Clear the triggered bit to allow one recovery */
>      mis->postcopy_recover_triggered = false;
>  
> +    /*
> +     * Unregister yank with either from/to src would work, since ioc behind it
> +     * is the same
> +     */
> +    migration_ioc_unregister_yank_from_file(mis->from_src_file);
> +
>      assert(mis->from_src_file);
>      qemu_file_shutdown(mis->from_src_file);
>      qemu_fclose(mis->from_src_file);
> diff --git a/migration/yank_functions.c b/migration/yank_functions.c
> index 23697173ae..8c08aef14a 100644
> --- a/migration/yank_functions.c
> +++ b/migration/yank_functions.c
> @@ -14,6 +14,7 @@
>  #include "qemu/yank.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
> +#include "qemu-file.h"
>  
>  void migration_yank_iochannel(void *opaque)
>  {
> @@ -46,3 +47,16 @@ void migration_ioc_unregister_yank(QIOChannel *ioc)
>                                   QIO_CHANNEL(ioc));
>      }
>  }
> +
> +void migration_ioc_unregister_yank_from_file(QEMUFile *file)
> +{
> +    QIOChannel *ioc = qemu_file_get_ioc(file);
> +
> +    if (ioc) {
> +        /*
> +         * For migration qemufiles, we'll always reach here.  Though we'll skip
> +         * calls from e.g. savevm/loadvm as they don't use yank.
> +         */
> +        migration_ioc_unregister_yank(ioc);
> +    }
> +}
> diff --git a/migration/yank_functions.h b/migration/yank_functions.h
> index 74c7f18c91..a7577955ed 100644
> --- a/migration/yank_functions.h
> +++ b/migration/yank_functions.h
> @@ -17,3 +17,4 @@
>  void migration_yank_iochannel(void *opaque);
>  void migration_ioc_register_yank(QIOChannel *ioc);
>  void migration_ioc_unregister_yank(QIOChannel *ioc);
> +void migration_ioc_unregister_yank_from_file(QEMUFile *file);
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/5] migration: Move the yank unregister of channel_close out
  2021-07-22 15:27   ` Dr. David Alan Gilbert
@ 2021-07-22 16:19     ` Peter Xu
  2021-07-22 17:09       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2021-07-22 16:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

On Thu, Jul 22, 2021 at 04:27:35PM +0100, Dr. David Alan Gilbert wrote:
> > @@ -3352,6 +3355,8 @@ static MigThrError postcopy_pause(MigrationState *s)
> >  
> >          /* Current channel is possibly broken. Release it. */
> >          assert(s->to_dst_file);
> > +        /* Unregister yank for current channel */
> > +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
> 
> Should this go inside the lock?

Shouldn't need to; as we've got the assert() right above so otherwise we'll
abrt otherwise :)

The mutex lock/unlock right below this one is not protecting us from someone
changing it but really for being able to wait until someone finished using it
then we won't crash someone else.

I think the rational is to_dst_file is managed by migration thread while
from_dst_file is managed by rp_thread.

Maybe I add a comment above?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/5] migration: Move the yank unregister of channel_close out
  2021-07-22 16:19     ` Peter Xu
@ 2021-07-22 17:09       ` Dr. David Alan Gilbert
  2021-07-22 17:35         ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 17:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Jul 22, 2021 at 04:27:35PM +0100, Dr. David Alan Gilbert wrote:
> > > @@ -3352,6 +3355,8 @@ static MigThrError postcopy_pause(MigrationState *s)
> > >  
> > >          /* Current channel is possibly broken. Release it. */
> > >          assert(s->to_dst_file);
> > > +        /* Unregister yank for current channel */
> > > +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
> > 
> > Should this go inside the lock?
> 
> Shouldn't need to; as we've got the assert() right above so otherwise we'll
> abrt otherwise :)

Hmm OK; although with out always having to think about it then you worry
about something getting in after the assert.

> The mutex lock/unlock right below this one is not protecting us from someone
> changing it but really for being able to wait until someone finished using it
> then we won't crash someone else.
> 
> I think the rational is to_dst_file is managed by migration thread while
> from_dst_file is managed by rp_thread.
> 
> Maybe I add a comment above?

OK, I almost pushed this further, but then I started to get worried that
we'd trace off a race with ordering on two locks with yank_lock; so yeh
lets just add a comment.

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/5] migration: Move the yank unregister of channel_close out
  2021-07-22 17:09       ` Dr. David Alan Gilbert
@ 2021-07-22 17:35         ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2021-07-22 17:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos

On Thu, Jul 22, 2021 at 06:09:03PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Jul 22, 2021 at 04:27:35PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -3352,6 +3355,8 @@ static MigThrError postcopy_pause(MigrationState *s)
> > > >  
> > > >          /* Current channel is possibly broken. Release it. */
> > > >          assert(s->to_dst_file);
> > > > +        /* Unregister yank for current channel */
> > > > +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
> > > 
> > > Should this go inside the lock?
> > 
> > Shouldn't need to; as we've got the assert() right above so otherwise we'll
> > abrt otherwise :)
> 
> Hmm OK; although with out always having to think about it then you worry
> about something getting in after the assert.

Right; the point is still that to_dst_file shouldn't be changed by other
thread, or it'll bring some more challenge.

If it can be mnodified when reaching this line, it means it can also reach
earlier at assert(), then we coredump.  So we should guaratee it won't happen,
or we'd better also remove that assertion..

I think the challenge here is, we do have a mutex to protect the files and from
that pov it's the same as other mutex.  However it's different in that this
mutex is also used in the oob handlers so we want to "keep it non-blocking and
as small as possible for the critical sections".

I didn't put it into the mutex protection because the yank code will further go
to take the yank_lock so it'll make things slightly more complicated (then the
file mutex is correlated to yank_lock too!).  I don't think that's a problem
because yank_lock is also "non-blocking" (ah! it can still block, got scheduled
out, but there's no explicit things that will proactively sleep..).  So since I
think it's safe without the lock then I kept it outside of the lock then we
don't need to discuss the safely of having it inside the critical section.

(However then I noticed we still need justification on why it can be moved out..)

> 
> > The mutex lock/unlock right below this one is not protecting us from someone
> > changing it but really for being able to wait until someone finished using it
> > then we won't crash someone else.
> > 
> > I think the rational is to_dst_file is managed by migration thread while
> > from_dst_file is managed by rp_thread.
> > 
> > Maybe I add a comment above?
> 
> OK, I almost pushed this further, but then I started to get worried that
> we'd trace off a race with ordering on two locks with yank_lock; so yeh
> lets just add a comment.

Agreed.  I think ordering is not a huge problem as yank_lock is very limitedly
used in yank and protect yank instance/functions only, so there shouldn't be a
path we take file mutex within it.  Will repost shortly, thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-07-22 17:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-21 19:34 [PATCH v2 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
2021-07-21 19:34 ` [PATCH v2 1/5] migration: Fix missing join() of rp_thread Peter Xu
2021-07-21 19:34 ` [PATCH v2 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
2021-07-21 21:15   ` Eric Blake
2021-07-22 14:44     ` Peter Xu
2021-07-22 15:19   ` Dr. David Alan Gilbert
2021-07-21 19:34 ` [PATCH v2 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
2021-07-21 19:34 ` [PATCH v2 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
2021-07-22 15:21   ` Dr. David Alan Gilbert
2021-07-21 19:34 ` [PATCH v2 5/5] migration: Move the yank unregister of channel_close out Peter Xu
2021-07-22 15:27   ` Dr. David Alan Gilbert
2021-07-22 16:19     ` Peter Xu
2021-07-22 17:09       ` Dr. David Alan Gilbert
2021-07-22 17:35         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.