All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank
@ 2021-07-22 17:58 Peter Xu
  2021-07-22 17:58 ` [PATCH v3 1/5] migration: Fix missing join() of rp_thread Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

v3:
- Use WITH_QEMU_LOCK_GUARD() for patch 2 [Eric]
  (potentially I can also replace other existing uses of qemu_file_lock into
   WITH_QEMU_LOCK_GUARD, but I decided to took Dave's r-b first and leave that
   for later)
- Added r-bs for Dave on patch 2/4
- Add a comment in patch 5 to explain why it's safe to unregister yank without
  qemu_file_lock, in postcopy_pause() [Dave]

v2:
- Pick r-b for Dave on patch 1/3
- Move migration_file_get_ioc() from patch 5 to patch 4, meanwhile rename it to
  qemu_file_get_ioc(). [Dave]
- Patch 2 "migration: Shutdown src in await_return_path_close_on_source()" is
  replaced by patch "migration: Make from_dst_file accesses thread-safe" [Dave]

Patch 1 fixes a possible race that migration thread can accidentally skip
join() of rp_thread even if the return thread is enabled.  Patch 1 is suspected
to also be the root cause of the recent hard-to-reproduce migration-test
failure here reported by PMM:

https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/

I didn't reproduce it myself; but after co-debugged with Dave it's suspected
that the race of rp_thread could be the cause.  It's not exposed before because
yank is soo strict on releasing instances, while we're not that strict before,
and didn't join() on rp_thread wasn't so dangerous after all when migration
succeeded before.

Patch 2 fixes another theoretical race on accessing from_dst_file spotted by
Dave.  I don't think there's known issues with it, but may still worth fixing.

Patch 3 should be a cleanup on yank that I think would be nice to have.

Patch 4-5 are further cleanups to remove the ref==1 check in channel_close(),
finally, as I always thought that's a bit hackish.  So I used explicit
unregister of the yank function at necessary places to replace that ref==1 one.

I still think having patch 3-5 altogether would be great, however I think patch
1 should still be the most important to be reviewed.  Also it would be great to
know whether patch 1 could fix the known yank crash.

Please review, thanks.

Peter Xu (5):
  migration: Fix missing join() of rp_thread
  migration: Make from_dst_file accesses thread-safe
  migration: Introduce migration_ioc_[un]register_yank()
  migration: Teach QEMUFile to be QIOChannel-aware
  migration: Move the yank unregister of channel_close out

 migration/channel.c           | 15 ++-------
 migration/migration.c         | 57 +++++++++++++++++++++++++++--------
 migration/migration.h         | 15 +++++++--
 migration/multifd.c           |  8 ++---
 migration/qemu-file-channel.c | 11 ++-----
 migration/qemu-file.c         | 17 ++++++++++-
 migration/qemu-file.h         |  4 ++-
 migration/ram.c               |  3 +-
 migration/savevm.c            | 11 +++++--
 migration/yank_functions.c    | 42 ++++++++++++++++++++++++++
 migration/yank_functions.h    |  3 ++
 11 files changed, 138 insertions(+), 48 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/5] migration: Fix missing join() of rp_thread
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
@ 2021-07-22 17:58 ` Peter Xu
  2021-07-22 17:58 ` [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

It's possible that the migration thread skip the join() of the rp_thread in
below race and crash on src right at finishing migration:

       migration_thread                     rp_thread
       ----------------                     ---------
    migration_completion()
                                        (before rp_thread quits)
                                        from_dst_file=NULL
                                        [thread got scheduled out]
      s->rp_state.from_dst_file==NULL
        (skip join() of rp_thread)
    migrate_fd_cleanup()
      qemu_fclose(s->to_dst_file)
      yank_unregister_instance()
        assert(yank_find_entry())  <------- crash

It could mostly happen with postcopy, but that shouldn't be required, e.g., I
think it could also trigger with MIGRATION_CAPABILITY_RETURN_PATH set.

It's suspected that above race could be the root cause of a recent (but rare)
migration-test break reported by either Dave or PMM:

https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/

The issue is: from_dst_file is reset in the rp_thread, so if the thread reset
it to NULL fast enough then the migration thread will assume there's no
rp_thread at all.

This could potentially cause more severe issue (e.g. crash) after the yank code.

Fix it by using a boolean to keep "whether we've created rp_thread".

Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 4 +++-
 migration/migration.h | 7 +++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2d306582eb..21b94f75a3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2867,6 +2867,7 @@ static int open_return_path_on_source(MigrationState *ms,
 
     qemu_thread_create(&ms->rp_state.rp_thread, "return path",
                        source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
+    ms->rp_state.rp_thread_created = true;
 
     trace_open_return_path_on_source_continue();
 
@@ -2891,6 +2892,7 @@ static int await_return_path_close_on_source(MigrationState *ms)
     }
     trace_await_return_path_close_on_source_joining();
     qemu_thread_join(&ms->rp_state.rp_thread);
+    ms->rp_state.rp_thread_created = false;
     trace_await_return_path_close_on_source_close();
     return ms->rp_state.error;
 }
@@ -3170,7 +3172,7 @@ static void migration_completion(MigrationState *s)
      * it will wait for the destination to send it's status in
      * a SHUT command).
      */
-    if (s->rp_state.from_dst_file) {
+    if (s->rp_state.rp_thread_created) {
         int rp_error;
         trace_migration_return_path_end_before();
         rp_error = await_return_path_close_on_source(s);
diff --git a/migration/migration.h b/migration/migration.h
index 2ebb740dfa..c302879fad 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -195,6 +195,13 @@ struct MigrationState {
         QEMUFile     *from_dst_file;
         QemuThread    rp_thread;
         bool          error;
+        /*
+         * We can also check non-zero of rp_thread, but there's no "official"
+         * way to do this, so this bool makes it slightly more elegant.
+         * Checking from_dst_file for this is racy because from_dst_file will
+         * be cleared in the rp_thread!
+         */
+        bool          rp_thread_created;
         QemuSemaphore rp_sem;
     } rp_state;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
  2021-07-22 17:58 ` [PATCH v3 1/5] migration: Fix missing join() of rp_thread Peter Xu
@ 2021-07-22 17:58 ` Peter Xu
  2021-07-22 19:32   ` Peter Xu
  2021-07-22 17:58 ` [PATCH v3 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

Accessing from_dst_file is potentially racy in current code base like below:

  if (s->from_dst_file)
    do_something(s->from_dst_file);

Because from_dst_file can be reset right after the check in another
thread (rp_thread).  One example is migrate_fd_cancel().

Use the same qemu_file_lock to protect it too, just like to_dst_file.

When it's safe to access without lock, comment it.

There's one special reference in migration_thread() that can be replaced by
the newly introduced rp_thread_created flag.

Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 39 +++++++++++++++++++++++++++++----------
 migration/migration.h |  8 +++++---
 migration/ram.c       |  1 +
 3 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 21b94f75a3..b2c48b7e17 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1879,9 +1879,11 @@ static void migrate_fd_cancel(MigrationState *s)
     QEMUFile *f = migrate_get_current()->to_dst_file;
     trace_migrate_fd_cancel();
 
-    if (s->rp_state.from_dst_file) {
-        /* shutdown the rp socket, so causing the rp thread to shutdown */
-        qemu_file_shutdown(s->rp_state.from_dst_file);
+    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
+        if (s->rp_state.from_dst_file) {
+            /* shutdown the rp socket, so causing the rp thread to shutdown */
+            qemu_file_shutdown(s->rp_state.from_dst_file);
+        }
     }
 
     do {
@@ -2686,6 +2688,23 @@ static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
     return 0;
 }
 
+/* Release ms->rp_state.from_dst_file in a safe way */
+static void migration_release_from_dst_file(MigrationState *ms)
+{
+    QEMUFile *file;
+
+    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
+        /*
+         * Reset the from_dst_file pointer first before releasing it, as we
+         * can't block within lock section
+         */
+        file = ms->rp_state.from_dst_file;
+        ms->rp_state.from_dst_file = NULL;
+    }
+
+    qemu_fclose(file);
+}
+
 /*
  * Handles messages sent on the return path towards the source VM
  *
@@ -2827,11 +2846,13 @@ out:
              * Maybe there is something we can do: it looks like a
              * network down issue, and we pause for a recovery.
              */
-            qemu_fclose(rp);
-            ms->rp_state.from_dst_file = NULL;
+            migration_release_from_dst_file(ms);
             rp = NULL;
             if (postcopy_pause_return_path_thread(ms)) {
-                /* Reload rp, reset the rest */
+                /*
+                 * Reload rp, reset the rest.  Referencing it is safe since
+                 * it's reset only by us above, or when migration completes
+                 */
                 rp = ms->rp_state.from_dst_file;
                 ms->rp_state.error = false;
                 goto retry;
@@ -2843,8 +2864,7 @@ out:
     }
 
     trace_source_return_path_thread_end();
-    ms->rp_state.from_dst_file = NULL;
-    qemu_fclose(rp);
+    migration_release_from_dst_file(ms);
     rcu_unregister_thread();
     return NULL;
 }
@@ -2852,7 +2872,6 @@ out:
 static int open_return_path_on_source(MigrationState *ms,
                                       bool create_thread)
 {
-
     ms->rp_state.from_dst_file = qemu_file_get_return_path(ms->to_dst_file);
     if (!ms->rp_state.from_dst_file) {
         return -1;
@@ -3746,7 +3765,7 @@ static void *migration_thread(void *opaque)
      * If we opened the return path, we need to make sure dst has it
      * opened as well.
      */
-    if (s->rp_state.from_dst_file) {
+    if (s->rp_state.rp_thread_created) {
         /* Now tell the dest that it should open its end so it can reply */
         qemu_savevm_send_open_return_path(s->to_dst_file);
 
diff --git a/migration/migration.h b/migration/migration.h
index c302879fad..7a5aa8c2fd 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -154,12 +154,13 @@ struct MigrationState {
     QemuThread thread;
     QEMUBH *vm_start_bh;
     QEMUBH *cleanup_bh;
+    /* Protected by qemu_file_lock */
     QEMUFile *to_dst_file;
     QIOChannelBuffer *bioc;
     /*
-     * Protects to_dst_file pointer.  We need to make sure we won't
-     * yield or hang during the critical section, since this lock will
-     * be used in OOB command handler.
+     * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
+     * won't yield or hang during the critical section, since this lock will be
+     * used in OOB command handler.
      */
     QemuMutex qemu_file_lock;
 
@@ -192,6 +193,7 @@ struct MigrationState {
 
     /* State related to return path */
     struct {
+        /* Protected by qemu_file_lock */
         QEMUFile     *from_dst_file;
         QemuThread    rp_thread;
         bool          error;
diff --git a/migration/ram.c b/migration/ram.c
index b5fc454b2f..f728f5072f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4012,6 +4012,7 @@ static void ram_dirty_bitmap_reload_notify(MigrationState *s)
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
 {
     int ret = -EINVAL;
+    /* from_dst_file is always valid because we're within rp_thread */
     QEMUFile *file = s->rp_state.from_dst_file;
     unsigned long *le_bitmap, nbits = block->used_length >> TARGET_PAGE_BITS;
     uint64_t local_size = DIV_ROUND_UP(nbits, 8);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 3/5] migration: Introduce migration_ioc_[un]register_yank()
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
  2021-07-22 17:58 ` [PATCH v3 1/5] migration: Fix missing join() of rp_thread Peter Xu
  2021-07-22 17:58 ` [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
@ 2021-07-22 17:58 ` Peter Xu
  2021-07-22 17:58 ` [PATCH v3 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

There're plenty of places in migration/* that checks against either socket or
tls typed ioc for yank operations.  Provide two helpers to hide all these
information.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/channel.c           | 15 ++-------------
 migration/multifd.c           |  8 ++------
 migration/qemu-file-channel.c |  8 ++------
 migration/yank_functions.c    | 28 ++++++++++++++++++++++++++++
 migration/yank_functions.h    |  2 ++
 5 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 01275a9162..c4fc000a1a 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -44,13 +44,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
                              TYPE_QIO_CHANNEL_TLS)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
     } else {
-        if (object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-            object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) {
-            yank_register_function(MIGRATION_YANK_INSTANCE,
-                                   migration_yank_iochannel,
-                                   QIO_CHANNEL(ioc));
-        }
-
+        migration_ioc_register_yank(ioc);
         migration_ioc_process_incoming(ioc, &local_err);
     }
 
@@ -94,12 +88,7 @@ void migration_channel_connect(MigrationState *s,
         } else {
             QEMUFile *f = qemu_fopen_channel_output(ioc);
 
-            if (object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-                object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) {
-                yank_register_function(MIGRATION_YANK_INSTANCE,
-                                       migration_yank_iochannel,
-                                       QIO_CHANNEL(ioc));
-            }
+            migration_ioc_register_yank(ioc);
 
             qemu_mutex_lock(&s->qemu_file_lock);
             s->to_dst_file = f;
diff --git a/migration/multifd.c b/migration/multifd.c
index ab41590e71..377da78f5b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -987,12 +987,8 @@ int multifd_load_cleanup(Error **errp)
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
-        if ((object_dynamic_cast(OBJECT(p->c), TYPE_QIO_CHANNEL_SOCKET) ||
-             object_dynamic_cast(OBJECT(p->c), TYPE_QIO_CHANNEL_TLS))
-            && OBJECT(p->c)->ref == 1) {
-            yank_unregister_function(MIGRATION_YANK_INSTANCE,
-                                     migration_yank_iochannel,
-                                     QIO_CHANNEL(p->c));
+        if (OBJECT(p->c)->ref == 1) {
+            migration_ioc_unregister_yank(p->c);
         }
 
         object_unref(OBJECT(p->c));
diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index fad340ea7a..867a5ed0c3 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -107,12 +107,8 @@ static int channel_close(void *opaque, Error **errp)
     int ret;
     QIOChannel *ioc = QIO_CHANNEL(opaque);
     ret = qio_channel_close(ioc, errp);
-    if ((object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
-         object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS))
-        && OBJECT(ioc)->ref == 1) {
-        yank_unregister_function(MIGRATION_YANK_INSTANCE,
-                                 migration_yank_iochannel,
-                                 QIO_CHANNEL(ioc));
+    if (OBJECT(ioc)->ref == 1) {
+        migration_ioc_unregister_yank(ioc);
     }
     object_unref(OBJECT(ioc));
     return ret;
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index 96c90e17dc..23697173ae 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -11,6 +11,9 @@
 #include "qapi/error.h"
 #include "io/channel.h"
 #include "yank_functions.h"
+#include "qemu/yank.h"
+#include "io/channel-socket.h"
+#include "io/channel-tls.h"
 
 void migration_yank_iochannel(void *opaque)
 {
@@ -18,3 +21,28 @@ void migration_yank_iochannel(void *opaque)
 
     qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }
+
+/* Return whether yank is supported on this ioc */
+static bool migration_ioc_yank_supported(QIOChannel *ioc)
+{
+    return object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET) ||
+        object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS);
+}
+
+void migration_ioc_register_yank(QIOChannel *ioc)
+{
+    if (migration_ioc_yank_supported(ioc)) {
+        yank_register_function(MIGRATION_YANK_INSTANCE,
+                               migration_yank_iochannel,
+                               QIO_CHANNEL(ioc));
+    }
+}
+
+void migration_ioc_unregister_yank(QIOChannel *ioc)
+{
+    if (migration_ioc_yank_supported(ioc)) {
+        yank_unregister_function(MIGRATION_YANK_INSTANCE,
+                                 migration_yank_iochannel,
+                                 QIO_CHANNEL(ioc));
+    }
+}
diff --git a/migration/yank_functions.h b/migration/yank_functions.h
index 055ea22523..74c7f18c91 100644
--- a/migration/yank_functions.h
+++ b/migration/yank_functions.h
@@ -15,3 +15,5 @@
  * @opaque: QIOChannel to shutdown
  */
 void migration_yank_iochannel(void *opaque);
+void migration_ioc_register_yank(QIOChannel *ioc);
+void migration_ioc_unregister_yank(QIOChannel *ioc);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 4/5] migration: Teach QEMUFile to be QIOChannel-aware
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (2 preceding siblings ...)
  2021-07-22 17:58 ` [PATCH v3 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
@ 2021-07-22 17:58 ` Peter Xu
  2021-07-22 17:58 ` [PATCH v3 5/5] migration: Move the yank unregister of channel_close out Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

migration uses QIOChannel typed qemufiles.  In follow up patches, we'll need
the capability to identify this fact, so that we can get the backing QIOChannel
from a QEMUFile.

We can also define types for QEMUFile but so far since we only need to be able
to identify QIOChannel, introduce a boolean which is simpler.

Introduce another helper qemu_file_get_ioc() to return the ioc backend of a
qemufile if has_ioc is set.

No functional change.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/qemu-file-channel.c |  4 ++--
 migration/qemu-file.c         | 17 ++++++++++++++++-
 migration/qemu-file.h         |  4 +++-
 migration/ram.c               |  2 +-
 migration/savevm.c            |  4 ++--
 5 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index 867a5ed0c3..2f8b1fcd46 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -187,11 +187,11 @@ static const QEMUFileOps channel_output_ops = {
 QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc)
 {
     object_ref(OBJECT(ioc));
-    return qemu_fopen_ops(ioc, &channel_input_ops);
+    return qemu_fopen_ops(ioc, &channel_input_ops, true);
 }
 
 QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
 {
     object_ref(OBJECT(ioc));
-    return qemu_fopen_ops(ioc, &channel_output_ops);
+    return qemu_fopen_ops(ioc, &channel_output_ops, true);
 }
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1eacf9e831..6338d8e2ff 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -55,6 +55,8 @@ struct QEMUFile {
     Error *last_error_obj;
     /* has the file has been shutdown */
     bool shutdown;
+    /* Whether opaque points to a QIOChannel */
+    bool has_ioc;
 };
 
 /*
@@ -101,7 +103,7 @@ bool qemu_file_mode_is_not_valid(const char *mode)
     return false;
 }
 
-QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
+QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc)
 {
     QEMUFile *f;
 
@@ -109,6 +111,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops)
 
     f->opaque = opaque;
     f->ops = ops;
+    f->has_ioc = has_ioc;
     return f;
 }
 
@@ -851,3 +854,15 @@ void qemu_file_set_blocking(QEMUFile *f, bool block)
         f->ops->set_blocking(f->opaque, block, NULL);
     }
 }
+
+/*
+ * Return the ioc object if it's a migration channel.  Note: it can return NULL
+ * for callers passing in a non-migration qemufile.  E.g. see qemu_fopen_bdrv()
+ * and its usage in e.g. load_snapshot().  So we need to check against NULL
+ * before using it.  If without the check, migration_incoming_state_destroy()
+ * could fail for load_snapshot().
+ */
+QIOChannel *qemu_file_get_ioc(QEMUFile *file)
+{
+    return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index a9b6d6ccb7..3f36d4dc8c 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -27,6 +27,7 @@
 
 #include <zlib.h>
 #include "exec/cpu-common.h"
+#include "io/channel.h"
 
 /* Read a chunk of data from a file at the given position.  The pos argument
  * can be ignored if the file is only be used for streaming.  The number of
@@ -119,7 +120,7 @@ typedef struct QEMUFileHooks {
     QEMURamSaveFunc *save_page;
 } QEMUFileHooks;
 
-QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
+QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops, bool has_ioc);
 void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks *hooks);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
@@ -179,5 +180,6 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags, void *data);
 size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                              ram_addr_t offset, size_t size,
                              uint64_t *bytes_sent);
+QIOChannel *qemu_file_get_ioc(QEMUFile *file);
 
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index f728f5072f..08b3cb7a4a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -550,7 +550,7 @@ static int compress_threads_save_setup(void)
         /* comp_param[i].file is just used as a dummy buffer to save data,
          * set its ops to empty.
          */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
+        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops, false);
         comp_param[i].done = true;
         comp_param[i].quit = false;
         qemu_mutex_init(&comp_param[i].mutex);
diff --git a/migration/savevm.c b/migration/savevm.c
index 72848b946c..96b5e5d639 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -168,9 +168,9 @@ static const QEMUFileOps bdrv_write_ops = {
 static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
 {
     if (is_writable) {
-        return qemu_fopen_ops(bs, &bdrv_write_ops);
+        return qemu_fopen_ops(bs, &bdrv_write_ops, false);
     }
-    return qemu_fopen_ops(bs, &bdrv_read_ops);
+    return qemu_fopen_ops(bs, &bdrv_read_ops, false);
 }
 
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 5/5] migration: Move the yank unregister of channel_close out
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (3 preceding siblings ...)
  2021-07-22 17:58 ` [PATCH v3 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
@ 2021-07-22 17:58 ` Peter Xu
  2021-07-22 18:41   ` Dr. David Alan Gilbert
  2021-07-24 17:21 ` [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Lukas Straub
  2021-07-26 11:45 ` Dr. David Alan Gilbert
  6 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2021-07-22 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, peterx, Leonardo Bras Soares Passos,
	Eric Blake

It's efficient, but hackish to call yank unregister calls in channel_close(),
especially it'll be hard to debug when qemu crashed with some yank function
leaked.

Remove that hack, but instead explicitly unregister yank functions at the
places where needed, they are:

  (on src)
  - migrate_fd_cleanup
  - postcopy_pause

  (on dst)
  - migration_incoming_state_destroy
  - postcopy_pause_incoming

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c         | 14 +++++++++++++-
 migration/qemu-file-channel.c |  3 ---
 migration/savevm.c            |  7 +++++++
 migration/yank_functions.c    | 14 ++++++++++++++
 migration/yank_functions.h    |  1 +
 5 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b2c48b7e17..a50330016c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -59,6 +59,7 @@
 #include "multifd.h"
 #include "qemu/yank.h"
 #include "sysemu/cpus.h"
+#include "yank_functions.h"
 
 #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
 
@@ -273,6 +274,7 @@ void migration_incoming_state_destroy(void)
     }
 
     if (mis->from_src_file) {
+        migration_ioc_unregister_yank_from_file(mis->from_src_file);
         qemu_fclose(mis->from_src_file);
         mis->from_src_file = NULL;
     }
@@ -1811,6 +1813,7 @@ static void migrate_fd_cleanup(MigrationState *s)
          * Close the file handle without the lock to make sure the
          * critical section won't block for long.
          */
+        migration_ioc_unregister_yank_from_file(tmp);
         qemu_fclose(tmp);
     }
 
@@ -3351,8 +3354,17 @@ static MigThrError postcopy_pause(MigrationState *s)
     while (true) {
         QEMUFile *file;
 
-        /* Current channel is possibly broken. Release it. */
+        /*
+         * Current channel is possibly broken. Release it.  Note that this is
+         * guaranteed even without lock because to_dst_file should only be
+         * modified by the migration thread.  That also guarantees that the
+         * unregister of yank is safe too without the lock.  It should be safe
+         * even to be within the qemu_file_lock, but we didn't do that to avoid
+         * taking more mutex (yank_lock) within qemu_file_lock.  TL;DR: we make
+         * the qemu_file_lock critical section as small as possible.
+         */
         assert(s->to_dst_file);
+        migration_ioc_unregister_yank_from_file(s->to_dst_file);
         qemu_mutex_lock(&s->qemu_file_lock);
         file = s->to_dst_file;
         s->to_dst_file = NULL;
diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index 2f8b1fcd46..bb5a5752df 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -107,9 +107,6 @@ static int channel_close(void *opaque, Error **errp)
     int ret;
     QIOChannel *ioc = QIO_CHANNEL(opaque);
     ret = qio_channel_close(ioc, errp);
-    if (OBJECT(ioc)->ref == 1) {
-        migration_ioc_unregister_yank(ioc);
-    }
     object_unref(OBJECT(ioc));
     return ret;
 }
diff --git a/migration/savevm.c b/migration/savevm.c
index 96b5e5d639..7b7b64bd13 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -65,6 +65,7 @@
 #include "qemu/bitmap.h"
 #include "net/announce.h"
 #include "qemu/yank.h"
+#include "yank_functions.h"
 
 const unsigned int postcopy_ram_discard_version;
 
@@ -2568,6 +2569,12 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
     /* Clear the triggered bit to allow one recovery */
     mis->postcopy_recover_triggered = false;
 
+    /*
+     * Unregister yank with either from/to src would work, since ioc behind it
+     * is the same
+     */
+    migration_ioc_unregister_yank_from_file(mis->from_src_file);
+
     assert(mis->from_src_file);
     qemu_file_shutdown(mis->from_src_file);
     qemu_fclose(mis->from_src_file);
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index 23697173ae..8c08aef14a 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -14,6 +14,7 @@
 #include "qemu/yank.h"
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
+#include "qemu-file.h"
 
 void migration_yank_iochannel(void *opaque)
 {
@@ -46,3 +47,16 @@ void migration_ioc_unregister_yank(QIOChannel *ioc)
                                  QIO_CHANNEL(ioc));
     }
 }
+
+void migration_ioc_unregister_yank_from_file(QEMUFile *file)
+{
+    QIOChannel *ioc = qemu_file_get_ioc(file);
+
+    if (ioc) {
+        /*
+         * For migration qemufiles, we'll always reach here.  Though we'll skip
+         * calls from e.g. savevm/loadvm as they don't use yank.
+         */
+        migration_ioc_unregister_yank(ioc);
+    }
+}
diff --git a/migration/yank_functions.h b/migration/yank_functions.h
index 74c7f18c91..a7577955ed 100644
--- a/migration/yank_functions.h
+++ b/migration/yank_functions.h
@@ -17,3 +17,4 @@
 void migration_yank_iochannel(void *opaque);
 void migration_ioc_register_yank(QIOChannel *ioc);
 void migration_ioc_unregister_yank(QIOChannel *ioc);
+void migration_ioc_unregister_yank_from_file(QEMUFile *file);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 5/5] migration: Move the yank unregister of channel_close out
  2021-07-22 17:58 ` [PATCH v3 5/5] migration: Move the yank unregister of channel_close out Peter Xu
@ 2021-07-22 18:41   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 10+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 18:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos, Eric Blake

* Peter Xu (peterx@redhat.com) wrote:
> It's efficient, but hackish to call yank unregister calls in channel_close(),
> especially it'll be hard to debug when qemu crashed with some yank function
> leaked.
> 
> Remove that hack, but instead explicitly unregister yank functions at the
> places where needed, they are:
> 
>   (on src)
>   - migrate_fd_cleanup
>   - postcopy_pause
> 
>   (on dst)
>   - migration_incoming_state_destroy
>   - postcopy_pause_incoming
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c         | 14 +++++++++++++-
>  migration/qemu-file-channel.c |  3 ---
>  migration/savevm.c            |  7 +++++++
>  migration/yank_functions.c    | 14 ++++++++++++++
>  migration/yank_functions.h    |  1 +
>  5 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b2c48b7e17..a50330016c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -59,6 +59,7 @@
>  #include "multifd.h"
>  #include "qemu/yank.h"
>  #include "sysemu/cpus.h"
> +#include "yank_functions.h"
>  
>  #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
>  
> @@ -273,6 +274,7 @@ void migration_incoming_state_destroy(void)
>      }
>  
>      if (mis->from_src_file) {
> +        migration_ioc_unregister_yank_from_file(mis->from_src_file);
>          qemu_fclose(mis->from_src_file);
>          mis->from_src_file = NULL;
>      }
> @@ -1811,6 +1813,7 @@ static void migrate_fd_cleanup(MigrationState *s)
>           * Close the file handle without the lock to make sure the
>           * critical section won't block for long.
>           */
> +        migration_ioc_unregister_yank_from_file(tmp);
>          qemu_fclose(tmp);
>      }
>  
> @@ -3351,8 +3354,17 @@ static MigThrError postcopy_pause(MigrationState *s)
>      while (true) {
>          QEMUFile *file;
>  
> -        /* Current channel is possibly broken. Release it. */
> +        /*
> +         * Current channel is possibly broken. Release it.  Note that this is
> +         * guaranteed even without lock because to_dst_file should only be
> +         * modified by the migration thread.  That also guarantees that the
> +         * unregister of yank is safe too without the lock.  It should be safe
> +         * even to be within the qemu_file_lock, but we didn't do that to avoid
> +         * taking more mutex (yank_lock) within qemu_file_lock.  TL;DR: we make
> +         * the qemu_file_lock critical section as small as possible.
> +         */
>          assert(s->to_dst_file);
> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
>          qemu_mutex_lock(&s->qemu_file_lock);
>          file = s->to_dst_file;
>          s->to_dst_file = NULL;
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index 2f8b1fcd46..bb5a5752df 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -107,9 +107,6 @@ static int channel_close(void *opaque, Error **errp)
>      int ret;
>      QIOChannel *ioc = QIO_CHANNEL(opaque);
>      ret = qio_channel_close(ioc, errp);
> -    if (OBJECT(ioc)->ref == 1) {
> -        migration_ioc_unregister_yank(ioc);
> -    }
>      object_unref(OBJECT(ioc));
>      return ret;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 96b5e5d639..7b7b64bd13 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -65,6 +65,7 @@
>  #include "qemu/bitmap.h"
>  #include "net/announce.h"
>  #include "qemu/yank.h"
> +#include "yank_functions.h"
>  
>  const unsigned int postcopy_ram_discard_version;
>  
> @@ -2568,6 +2569,12 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
>      /* Clear the triggered bit to allow one recovery */
>      mis->postcopy_recover_triggered = false;
>  
> +    /*
> +     * Unregister yank with either from/to src would work, since ioc behind it
> +     * is the same
> +     */
> +    migration_ioc_unregister_yank_from_file(mis->from_src_file);
> +
>      assert(mis->from_src_file);
>      qemu_file_shutdown(mis->from_src_file);
>      qemu_fclose(mis->from_src_file);
> diff --git a/migration/yank_functions.c b/migration/yank_functions.c
> index 23697173ae..8c08aef14a 100644
> --- a/migration/yank_functions.c
> +++ b/migration/yank_functions.c
> @@ -14,6 +14,7 @@
>  #include "qemu/yank.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
> +#include "qemu-file.h"
>  
>  void migration_yank_iochannel(void *opaque)
>  {
> @@ -46,3 +47,16 @@ void migration_ioc_unregister_yank(QIOChannel *ioc)
>                                   QIO_CHANNEL(ioc));
>      }
>  }
> +
> +void migration_ioc_unregister_yank_from_file(QEMUFile *file)
> +{
> +    QIOChannel *ioc = qemu_file_get_ioc(file);
> +
> +    if (ioc) {
> +        /*
> +         * For migration qemufiles, we'll always reach here.  Though we'll skip
> +         * calls from e.g. savevm/loadvm as they don't use yank.
> +         */
> +        migration_ioc_unregister_yank(ioc);
> +    }
> +}
> diff --git a/migration/yank_functions.h b/migration/yank_functions.h
> index 74c7f18c91..a7577955ed 100644
> --- a/migration/yank_functions.h
> +++ b/migration/yank_functions.h
> @@ -17,3 +17,4 @@
>  void migration_yank_iochannel(void *opaque);
>  void migration_ioc_register_yank(QIOChannel *ioc);
>  void migration_ioc_unregister_yank(QIOChannel *ioc);
> +void migration_ioc_unregister_yank_from_file(QEMUFile *file);
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe
  2021-07-22 17:58 ` [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
@ 2021-07-22 19:32   ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2021-07-22 19:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	Dr . David Alan Gilbert, Leonardo Bras Soares Passos, Eric Blake

On Thu, Jul 22, 2021 at 01:58:38PM -0400, Peter Xu wrote:
> Accessing from_dst_file is potentially racy in current code base like below:
> 
>   if (s->from_dst_file)
>     do_something(s->from_dst_file);
> 
> Because from_dst_file can be reset right after the check in another
> thread (rp_thread).  One example is migrate_fd_cancel().
> 
> Use the same qemu_file_lock to protect it too, just like to_dst_file.
> 
> When it's safe to access without lock, comment it.
> 
> There's one special reference in migration_thread() that can be replaced by
> the newly introduced rp_thread_created flag.
> 
> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

(Dave should have helped fixing this which I appreciated a lot, but just make
 it be together with the record..)

Below needs to be squashed into the patch:

diff --git a/migration/migration.c b/migration/migration.c
index a50330016c..041b8451a6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2696,7 +2696,7 @@ static void migration_release_from_dst_file(MigrationState *ms)
 {
     QEMUFile *file;

-    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
+    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
         /*
          * Reset the from_dst_file pointer first before releasing it, as we
          * can't block within lock section

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (4 preceding siblings ...)
  2021-07-22 17:58 ` [PATCH v3 5/5] migration: Move the yank unregister of channel_close out Peter Xu
@ 2021-07-24 17:21 ` Lukas Straub
  2021-07-26 11:45 ` Dr. David Alan Gilbert
  6 siblings, 0 replies; 10+ messages in thread
From: Lukas Straub @ 2021-07-24 17:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Daniel P . Berrange, Juan Quintela, qemu-devel,
	Dr . David Alan Gilbert, Leonardo Bras Soares Passos, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 3243 bytes --]

On Thu, 22 Jul 2021 13:58:36 -0400
Peter Xu <peterx@redhat.com> wrote:

> v3:
> - Use WITH_QEMU_LOCK_GUARD() for patch 2 [Eric]
>   (potentially I can also replace other existing uses of qemu_file_lock into
>    WITH_QEMU_LOCK_GUARD, but I decided to took Dave's r-b first and leave that
>    for later)
> - Added r-bs for Dave on patch 2/4
> - Add a comment in patch 5 to explain why it's safe to unregister yank without
>   qemu_file_lock, in postcopy_pause() [Dave]
> 
> v2:
> - Pick r-b for Dave on patch 1/3
> - Move migration_file_get_ioc() from patch 5 to patch 4, meanwhile rename it to
>   qemu_file_get_ioc(). [Dave]
> - Patch 2 "migration: Shutdown src in await_return_path_close_on_source()" is
>   replaced by patch "migration: Make from_dst_file accesses thread-safe" [Dave]
> 
> Patch 1 fixes a possible race that migration thread can accidentally skip
> join() of rp_thread even if the return thread is enabled.  Patch 1 is suspected
> to also be the root cause of the recent hard-to-reproduce migration-test
> failure here reported by PMM:
> 
> https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/
> 
> I didn't reproduce it myself; but after co-debugged with Dave it's suspected
> that the race of rp_thread could be the cause.  It's not exposed before because
> yank is soo strict on releasing instances, while we're not that strict before,
> and didn't join() on rp_thread wasn't so dangerous after all when migration
> succeeded before.
> 
> Patch 2 fixes another theoretical race on accessing from_dst_file spotted by
> Dave.  I don't think there's known issues with it, but may still worth fixing.
> 
> Patch 3 should be a cleanup on yank that I think would be nice to have.
> 
> Patch 4-5 are further cleanups to remove the ref==1 check in channel_close(),
> finally, as I always thought that's a bit hackish.  So I used explicit
> unregister of the yank function at necessary places to replace that ref==1 one.
> 
> I still think having patch 3-5 altogether would be great, however I think patch
> 1 should still be the most important to be reviewed.  Also it would be great to
> know whether patch 1 could fix the known yank crash.
> 
> Please review, thanks.
> 
> Peter Xu (5):
>   migration: Fix missing join() of rp_thread
>   migration: Make from_dst_file accesses thread-safe
>   migration: Introduce migration_ioc_[un]register_yank()
>   migration: Teach QEMUFile to be QIOChannel-aware
>   migration: Move the yank unregister of channel_close out
> 
>  migration/channel.c           | 15 ++-------
>  migration/migration.c         | 57 +++++++++++++++++++++++++++--------
>  migration/migration.h         | 15 +++++++--
>  migration/multifd.c           |  8 ++---
>  migration/qemu-file-channel.c | 11 ++-----
>  migration/qemu-file.c         | 17 ++++++++++-
>  migration/qemu-file.h         |  4 ++-
>  migration/ram.c               |  3 +-
>  migration/savevm.c            | 11 +++++--
>  migration/yank_functions.c    | 42 ++++++++++++++++++++++++++
>  migration/yank_functions.h    |  3 ++
>  11 files changed, 138 insertions(+), 48 deletions(-)
> 

Looks good to me,
Reviewed-by: Lukas Straub <lukasstraub2@web.de>

-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank
  2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
                   ` (5 preceding siblings ...)
  2021-07-24 17:21 ` [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Lukas Straub
@ 2021-07-26 11:45 ` Dr. David Alan Gilbert
  6 siblings, 0 replies; 10+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-26 11:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Lukas Straub, Daniel P . Berrange, Juan Quintela,
	qemu-devel, Leonardo Bras Soares Passos, Eric Blake

* Peter Xu (peterx@redhat.com) wrote:
> v3:
> - Use WITH_QEMU_LOCK_GUARD() for patch 2 [Eric]
>   (potentially I can also replace other existing uses of qemu_file_lock into
>    WITH_QEMU_LOCK_GUARD, but I decided to took Dave's r-b first and leave that
>    for later)
> - Added r-bs for Dave on patch 2/4
> - Add a comment in patch 5 to explain why it's safe to unregister yank without
>   qemu_file_lock, in postcopy_pause() [Dave]
> 
> v2:
> - Pick r-b for Dave on patch 1/3
> - Move migration_file_get_ioc() from patch 5 to patch 4, meanwhile rename it to
>   qemu_file_get_ioc(). [Dave]
> - Patch 2 "migration: Shutdown src in await_return_path_close_on_source()" is
>   replaced by patch "migration: Make from_dst_file accesses thread-safe" [Dave]
> 
> Patch 1 fixes a possible race that migration thread can accidentally skip
> join() of rp_thread even if the return thread is enabled.  Patch 1 is suspected
> to also be the root cause of the recent hard-to-reproduce migration-test
> failure here reported by PMM:
> 
> https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/
> 
> I didn't reproduce it myself; but after co-debugged with Dave it's suspected
> that the race of rp_thread could be the cause.  It's not exposed before because
> yank is soo strict on releasing instances, while we're not that strict before,
> and didn't join() on rp_thread wasn't so dangerous after all when migration
> succeeded before.
> 
> Patch 2 fixes another theoretical race on accessing from_dst_file spotted by
> Dave.  I don't think there's known issues with it, but may still worth fixing.
> 
> Patch 3 should be a cleanup on yank that I think would be nice to have.
> 
> Patch 4-5 are further cleanups to remove the ref==1 check in channel_close(),
> finally, as I always thought that's a bit hackish.  So I used explicit
> unregister of the yank function at necessary places to replace that ref==1 one.
> 
> I still think having patch 3-5 altogether would be great, however I think patch
> 1 should still be the most important to be reviewed.  Also it would be great to
> know whether patch 1 could fix the known yank crash.
> 
> Please review, thanks.

Queued with the fix; it survived over a long weekend running about 50k
times OK.

Dave

> Peter Xu (5):
>   migration: Fix missing join() of rp_thread
>   migration: Make from_dst_file accesses thread-safe
>   migration: Introduce migration_ioc_[un]register_yank()
>   migration: Teach QEMUFile to be QIOChannel-aware
>   migration: Move the yank unregister of channel_close out
> 
>  migration/channel.c           | 15 ++-------
>  migration/migration.c         | 57 +++++++++++++++++++++++++++--------
>  migration/migration.h         | 15 +++++++--
>  migration/multifd.c           |  8 ++---
>  migration/qemu-file-channel.c | 11 ++-----
>  migration/qemu-file.c         | 17 ++++++++++-
>  migration/qemu-file.h         |  4 ++-
>  migration/ram.c               |  3 +-
>  migration/savevm.c            | 11 +++++--
>  migration/yank_functions.c    | 42 ++++++++++++++++++++++++++
>  migration/yank_functions.h    |  3 ++
>  11 files changed, 138 insertions(+), 48 deletions(-)
> 
> -- 
> 2.31.1
> 
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-26 11:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 17:58 [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Peter Xu
2021-07-22 17:58 ` [PATCH v3 1/5] migration: Fix missing join() of rp_thread Peter Xu
2021-07-22 17:58 ` [PATCH v3 2/5] migration: Make from_dst_file accesses thread-safe Peter Xu
2021-07-22 19:32   ` Peter Xu
2021-07-22 17:58 ` [PATCH v3 3/5] migration: Introduce migration_ioc_[un]register_yank() Peter Xu
2021-07-22 17:58 ` [PATCH v3 4/5] migration: Teach QEMUFile to be QIOChannel-aware Peter Xu
2021-07-22 17:58 ` [PATCH v3 5/5] migration: Move the yank unregister of channel_close out Peter Xu
2021-07-22 18:41   ` Dr. David Alan Gilbert
2021-07-24 17:21 ` [PATCH v3 0/5] migrations: Fix potential rare race of migration-test after yank Lukas Straub
2021-07-26 11:45 ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.