[PATCH 0/5] migration: Fix migration state reference counting

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] migration: Fix migration state reference counting
@ 2024-01-19 23:39 Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

We currently have a bug when running migration code in bottom
halves. The issue has already been reported in Gitlab[1] and it
started happening very frequently on my machine for some reason.

The issue is that we're dropping the last reference to the
MigrationState object while the cleanup bottom half is still running
and it leads to an use after free. More details on the commit message.

This series fixes the issue and does a refactoring around the
migration BH scheduling aiming to consolidate some code so that it is
less error prone.

1- https://gitlab.com/qemu-project/qemu/-/issues/1969

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1144927625

Fabiano Rosas (5):
  migration: Fix use-after-free of migration state object
  migration: Take reference to migration state around
    bg_migration_vm_start_bh
  migration: Reference migration state around
    loadvm_postcopy_handle_run_bh
  migration: Add a wrapper to qemu_bh_schedule
  migration: Centralize BH creation and dispatch

 migration/migration.c | 82 +++++++++++++++++++++++++------------------
 migration/migration.h |  5 +--
 migration/savevm.c    |  5 +--
 3 files changed, 49 insertions(+), 43 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
@ 2024-01-19 23:39 ` Fabiano Rosas
  2024-01-19 23:43   ` Fabiano Rosas
  2024-01-22  9:49   ` Peter Xu
  2024-01-19 23:39 ` [PATCH 2/5] migration: Take reference to migration state around bg_migration_vm_start_bh Fabiano Rosas
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

We're currently allowing the process_incoming_migration_bh bottom-half
to run without holding a reference to the 'current_migration' object,
which leads to a segmentation fault if the BH is still live after
migration_shutdown() has dropped the last reference to
current_migration.

In my system the bug manifests as migrate_multifd() returning true
when it shouldn't and multifd_load_shutdown() calling
multifd_recv_terminate_threads() which crashes due to an uninitialized
multifd_recv_state.

Fix the issue by holding a reference to the object when scheduling the
BH and dropping it before returning from the BH. The same is already
done for the cleanup_bh at migrate_fd_cleanup_schedule().

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 219447dea1..cf17b68e57 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -648,6 +648,7 @@ static void process_incoming_migration_bh(void *opaque)
                       MIGRATION_STATUS_COMPLETED);
     qemu_bh_delete(mis->bh);
     migration_incoming_state_destroy();
+    object_unref(OBJECT(migrate_get_current()));
 }

 static void coroutine_fn
@@ -713,6 +714,7 @@ process_incoming_migration_co(void *opaque)
     }

     mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
+    object_ref(OBJECT(migrate_get_current()));
     qemu_bh_schedule(mis->bh);
     return;
 fail:
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] migration: Take reference to migration state around bg_migration_vm_start_bh
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
@ 2024-01-19 23:39 ` Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 3/5] migration: Reference migration state around loadvm_postcopy_handle_run_bh Fabiano Rosas
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

We need to hold a reference to the current_migration object around
async calls to avoid it been freed while still in use.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index cf17b68e57..b1213b59ce 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3382,6 +3382,7 @@ static void bg_migration_vm_start_bh(void *opaque)
 
     vm_resume(s->vm_old_state);
     migration_downtime_end(s);
+    object_unref(OBJECT(s));
 }
 
 /**
@@ -3486,6 +3487,7 @@ static void *bg_migration_thread(void *opaque)
      * writes to virtio VQs memory which is in write-protected region.
      */
     s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s);
+    object_ref(OBJECT(s));
     qemu_bh_schedule(s->vm_start_bh);
 
     bql_unlock();
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] migration: Reference migration state around loadvm_postcopy_handle_run_bh
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 2/5] migration: Take reference to migration state around bg_migration_vm_start_bh Fabiano Rosas
@ 2024-01-19 23:39 ` Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 4/5] migration: Add a wrapper to qemu_bh_schedule Fabiano Rosas
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

We need to hold a reference to the current_migration object around
async calls to avoid it been freed while still in use. Even on this
load-side function, we might still use the MigrationState, e.g to
check for capabilities.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/savevm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index 6410705ebe..93387350c7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2174,6 +2174,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
     qemu_bh_delete(mis->bh);
 
     trace_vmstate_downtime_checkpoint("dst-postcopy-bh-vm-started");
+    object_unref(OBJECT(migration_get_current()));
 }
 
 /* After all discards we can start running and asking for pages */
@@ -2189,6 +2190,7 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
 
     postcopy_state_set(POSTCOPY_INCOMING_RUNNING);
     mis->bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, mis);
+    object_ref(OBJECT(migration_get_current()));
     qemu_bh_schedule(mis->bh);
 
     /* We need to finish reading the stream from the package
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] migration: Add a wrapper to qemu_bh_schedule
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
                   ` (2 preceding siblings ...)
  2024-01-19 23:39 ` [PATCH 3/5] migration: Reference migration state around loadvm_postcopy_handle_run_bh Fabiano Rosas
@ 2024-01-19 23:39 ` Fabiano Rosas
  2024-01-19 23:39 ` [PATCH 5/5] migration: Centralize BH creation and dispatch Fabiano Rosas
  2024-01-23  2:19 ` [PATCH 0/5] migration: Fix migration state reference counting Peter Xu
  5 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

Wrap qemu_bh_schedule() to ensure we always hold a reference to the
current_migration object.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b1213b59ce..0e7f101d64 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -199,6 +199,16 @@ void migration_object_init(void)
     dirty_bitmap_mig_init();
 }
 
+static void migration_bh_schedule(MigrationState *s, QEMUBH *bh)
+{
+    /*
+     * Ref the state for bh, because it may be called when
+     * there're already no other refs
+     */
+    object_ref(OBJECT(s));
+    qemu_bh_schedule(bh);
+}
+
 void migration_cancel(const Error *error)
 {
     if (error) {
@@ -714,8 +724,7 @@ process_incoming_migration_co(void *opaque)
     }
 
     mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
-    object_ref(OBJECT(migrate_get_current()));
-    qemu_bh_schedule(mis->bh);
+    migration_bh_schedule(migrate_get_current(), mis->bh);
     return;
 fail:
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
@@ -1332,16 +1341,6 @@ static void migrate_fd_cleanup(MigrationState *s)
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
-static void migrate_fd_cleanup_schedule(MigrationState *s)
-{
-    /*
-     * Ref the state for bh, because it may be called when
-     * there're already no other refs
-     */
-    object_ref(OBJECT(s));
-    qemu_bh_schedule(s->cleanup_bh);
-}
-
 static void migrate_fd_cleanup_bh(void *opaque)
 {
     MigrationState *s = opaque;
@@ -3140,7 +3139,7 @@ static void migration_iteration_finish(MigrationState *s)
         error_report("%s: Unknown ending state %d", __func__, s->state);
         break;
     }
-    migrate_fd_cleanup_schedule(s);
+    migration_bh_schedule(s, s->cleanup_bh);
     bql_unlock();
 }
 
@@ -3171,7 +3170,7 @@ static void bg_migration_iteration_finish(MigrationState *s)
         break;
     }
 
-    migrate_fd_cleanup_schedule(s);
+    migration_bh_schedule(s, s->cleanup_bh);
     bql_unlock();
 }
 
@@ -3487,9 +3486,7 @@ static void *bg_migration_thread(void *opaque)
      * writes to virtio VQs memory which is in write-protected region.
      */
     s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s);
-    object_ref(OBJECT(s));
-    qemu_bh_schedule(s->vm_start_bh);
-
+    migration_bh_schedule(s, s->vm_start_bh);
     bql_unlock();
 
     while (migration_is_active(s)) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] migration: Centralize BH creation and dispatch
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
                   ` (3 preceding siblings ...)
  2024-01-19 23:39 ` [PATCH 4/5] migration: Add a wrapper to qemu_bh_schedule Fabiano Rosas
@ 2024-01-19 23:39 ` Fabiano Rosas
  2024-01-23  2:19 ` [PATCH 0/5] migration: Fix migration state reference counting Peter Xu
  5 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

Now that the migration state reference counting is correct, further
wrap the bottom half dispatch process to avoid future issues.

Move BH creation and scheduling together and wrap the dispatch with an
intermediary function that will ensure we always keep the ref/unref
balanced.

Also move the responsibility of deleting the BH into the wrapper and
remove the now unnecessary pointers.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 65 +++++++++++++++++++++++++------------------
 migration/migration.h |  5 +---
 migration/savevm.c    |  7 +----
 3 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0e7f101d64..d5f705ceef 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -199,8 +199,39 @@ void migration_object_init(void)
     dirty_bitmap_mig_init();
 }
 
-static void migration_bh_schedule(MigrationState *s, QEMUBH *bh)
+typedef struct {
+    QEMUBH *bh;
+    QEMUBHFunc *cb;
+    void *opaque;
+} MigrationBH;
+
+static void migration_bh_dispatch_bh(void *opaque)
 {
+    MigrationState *s = migrate_get_current();
+    MigrationBH *migbh = opaque;
+
+    /* cleanup this BH */
+    qemu_bh_delete(migbh->bh);
+    migbh->bh = NULL;
+
+    /* dispatch the other one */
+    migbh->cb(migbh->opaque);
+    object_unref(OBJECT(s));
+
+    g_free(migbh);
+}
+
+void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
+{
+    MigrationState *s = migrate_get_current();
+    MigrationBH *migbh = g_new0(MigrationBH, 1);
+    QEMUBH *bh = qemu_bh_new(migration_bh_dispatch_bh, migbh);
+
+    /* Store these to dispatch when the BH runs */
+    migbh->bh = bh;
+    migbh->cb = cb;
+    migbh->opaque = opaque;
+
     /*
      * Ref the state for bh, because it may be called when
      * there're already no other refs
@@ -656,9 +687,7 @@ static void process_incoming_migration_bh(void *opaque)
      */
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                       MIGRATION_STATUS_COMPLETED);
-    qemu_bh_delete(mis->bh);
     migration_incoming_state_destroy();
-    object_unref(OBJECT(migrate_get_current()));
 }
 
 static void coroutine_fn
@@ -723,8 +752,7 @@ process_incoming_migration_co(void *opaque)
         goto fail;
     }
 
-    mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
-    migration_bh_schedule(migrate_get_current(), mis->bh);
+    migration_bh_schedule(process_incoming_migration_bh, mis);
     return;
 fail:
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
@@ -1285,9 +1313,6 @@ void migrate_set_state(int *state, int old_state, int new_state)
 
 static void migrate_fd_cleanup(MigrationState *s)
 {
-    qemu_bh_delete(s->cleanup_bh);
-    s->cleanup_bh = NULL;
-
     g_free(s->hostname);
     s->hostname = NULL;
     json_writer_free(s->vmdesc);
@@ -1343,9 +1368,7 @@ static void migrate_fd_cleanup(MigrationState *s)
 
 static void migrate_fd_cleanup_bh(void *opaque)
 {
-    MigrationState *s = opaque;
-    migrate_fd_cleanup(s);
-    object_unref(OBJECT(s));
+    migrate_fd_cleanup(opaque);
 }
 
 void migrate_set_error(MigrationState *s, const Error *error)
@@ -1568,8 +1591,6 @@ int migrate_init(MigrationState *s, Error **errp)
      * parameters/capabilities that the user set, and
      * locks.
      */
-    s->cleanup_bh = 0;
-    s->vm_start_bh = 0;
     s->to_dst_file = NULL;
     s->state = MIGRATION_STATUS_NONE;
     s->rp_state.from_dst_file = NULL;
@@ -3139,7 +3160,8 @@ static void migration_iteration_finish(MigrationState *s)
         error_report("%s: Unknown ending state %d", __func__, s->state);
         break;
     }
-    migration_bh_schedule(s, s->cleanup_bh);
+
+    migration_bh_schedule(migrate_fd_cleanup_bh, s);
     bql_unlock();
 }
 
@@ -3170,7 +3192,7 @@ static void bg_migration_iteration_finish(MigrationState *s)
         break;
     }
 
-    migration_bh_schedule(s, s->cleanup_bh);
+    migration_bh_schedule(migrate_fd_cleanup_bh, s);
     bql_unlock();
 }
 
@@ -3376,12 +3398,8 @@ static void bg_migration_vm_start_bh(void *opaque)
 {
     MigrationState *s = opaque;
 
-    qemu_bh_delete(s->vm_start_bh);
-    s->vm_start_bh = NULL;
-
     vm_resume(s->vm_old_state);
     migration_downtime_end(s);
-    object_unref(OBJECT(s));
 }
 
 /**
@@ -3485,8 +3503,7 @@ static void *bg_migration_thread(void *opaque)
      * calling VM state change notifiers from vm_start() would initiate
      * writes to virtio VQs memory which is in write-protected region.
      */
-    s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s);
-    migration_bh_schedule(s, s->vm_start_bh);
+    migration_bh_schedule(bg_migration_vm_start_bh, s);
     bql_unlock();
 
     while (migration_is_active(s)) {
@@ -3542,12 +3559,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
     migrate_error_free(s);
 
     s->expected_downtime = migrate_downtime_limit();
-    if (resume) {
-        assert(s->cleanup_bh);
-    } else {
-        assert(!s->cleanup_bh);
-        s->cleanup_bh = qemu_bh_new(migrate_fd_cleanup_bh, s);
-    }
     if (error_in) {
         migrate_fd_error(s, error_in);
         if (resume) {
diff --git a/migration/migration.h b/migration/migration.h
index 17972dac34..db98979e6e 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -159,8 +159,6 @@ struct MigrationIncomingState {
     /* PostCopyFD's for external userfaultfds & handlers of shared memory */
     GArray   *postcopy_remote_fds;
 
-    QEMUBH *bh;
-
     int state;
 
     /*
@@ -255,8 +253,6 @@ struct MigrationState {
 
     /*< public >*/
     QemuThread thread;
-    QEMUBH *vm_start_bh;
-    QEMUBH *cleanup_bh;
     /* Protected by qemu_file_lock */
     QEMUFile *to_dst_file;
     /* Postcopy specific transfer channel */
@@ -528,6 +524,7 @@ int foreach_not_ignored_block(RAMBlockIterFunc func, void *opaque);
 void migration_make_urgent_request(void);
 void migration_consume_urgent_request(void);
 bool migration_rate_limit(void);
+void migration_bh_schedule(QEMUBHFunc *cb, void *opaque);
 void migration_cancel(const Error *error);
 
 void migration_populate_vfio_info(MigrationInfo *info);
diff --git a/migration/savevm.c b/migration/savevm.c
index 93387350c7..d612c8a902 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2171,10 +2171,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
         runstate_set(RUN_STATE_PAUSED);
     }
 
-    qemu_bh_delete(mis->bh);
-
     trace_vmstate_downtime_checkpoint("dst-postcopy-bh-vm-started");
-    object_unref(OBJECT(migration_get_current()));
 }
 
 /* After all discards we can start running and asking for pages */
@@ -2189,9 +2186,7 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
     }
 
     postcopy_state_set(POSTCOPY_INCOMING_RUNNING);
-    mis->bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, mis);
-    object_ref(OBJECT(migration_get_current()));
-    qemu_bh_schedule(mis->bh);
+    migration_bh_schedule(loadvm_postcopy_handle_run_bh, mis);
 
     /* We need to finish reading the stream from the package
      * and also stop reading anything more from the stream that loaded the
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
@ 2024-01-19 23:43   ` Fabiano Rosas
  2024-01-22  9:49   ` Peter Xu
  1 sibling, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-19 23:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu

Fabiano Rosas <farosas@suse.de> writes:

> We're currently allowing the process_incoming_migration_bh bottom-half
> to run without holding a reference to the 'current_migration' object,
> which leads to a segmentation fault if the BH is still live after
> migration_shutdown() has dropped the last reference to
> current_migration.
>
> In my system the bug manifests as migrate_multifd() returning true
> when it shouldn't and multifd_load_shutdown() calling
> multifd_recv_terminate_threads() which crashes due to an uninitialized
> multifd_recv_state.
>
> Fix the issue by holding a reference to the object when scheduling the
> BH and dropping it before returning from the BH. The same is already
> done for the cleanup_bh at migrate_fd_cleanup_schedule().
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1969


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
  2024-01-19 23:43   ` Fabiano Rosas
@ 2024-01-22  9:49   ` Peter Xu
  2024-01-22 10:21     ` Peter Xu
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Xu @ 2024-01-22  9:49 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel

On Fri, Jan 19, 2024 at 08:39:18PM -0300, Fabiano Rosas wrote:
> We're currently allowing the process_incoming_migration_bh bottom-half
> to run without holding a reference to the 'current_migration' object,
> which leads to a segmentation fault if the BH is still live after
> migration_shutdown() has dropped the last reference to
> current_migration.
> 
> In my system the bug manifests as migrate_multifd() returning true
> when it shouldn't and multifd_load_shutdown() calling
> multifd_recv_terminate_threads() which crashes due to an uninitialized
> multifd_recv_state.
> 
> Fix the issue by holding a reference to the object when scheduling the
> BH and dropping it before returning from the BH. The same is already
> done for the cleanup_bh at migrate_fd_cleanup_schedule().
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 219447dea1..cf17b68e57 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -648,6 +648,7 @@ static void process_incoming_migration_bh(void *opaque)
>                        MIGRATION_STATUS_COMPLETED);
>      qemu_bh_delete(mis->bh);
>      migration_incoming_state_destroy();
> +    object_unref(OBJECT(migrate_get_current()));
>  }
>  
>  static void coroutine_fn
> @@ -713,6 +714,7 @@ process_incoming_migration_co(void *opaque)
>      }
>  
>      mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
> +    object_ref(OBJECT(migrate_get_current()));
>      qemu_bh_schedule(mis->bh);
>      return;
>  fail:
> -- 
> 2.35.3
> 

I know I missed something, but I'd better ask: use-after-free needs to
happen only after migration_shutdown() / qemu_cleanup(), am I right?

If so, shouldn't qemu_main_loop() already returned?  Then how could any BH
keep running (including migration's) without qemu_main_loop()?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-22  9:49   ` Peter Xu
@ 2024-01-22 10:21     ` Peter Xu
  2024-01-22 16:55       ` Fabiano Rosas
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2024-01-22 10:21 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel

On Mon, Jan 22, 2024 at 05:49:01PM +0800, Peter Xu wrote:
> On Fri, Jan 19, 2024 at 08:39:18PM -0300, Fabiano Rosas wrote:
> > We're currently allowing the process_incoming_migration_bh bottom-half
> > to run without holding a reference to the 'current_migration' object,
> > which leads to a segmentation fault if the BH is still live after
> > migration_shutdown() has dropped the last reference to
> > current_migration.
> > 
> > In my system the bug manifests as migrate_multifd() returning true
> > when it shouldn't and multifd_load_shutdown() calling
> > multifd_recv_terminate_threads() which crashes due to an uninitialized
> > multifd_recv_state.
> > 
> > Fix the issue by holding a reference to the object when scheduling the
> > BH and dropping it before returning from the BH. The same is already
> > done for the cleanup_bh at migrate_fd_cleanup_schedule().
> > 
> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > ---
> >  migration/migration.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 219447dea1..cf17b68e57 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -648,6 +648,7 @@ static void process_incoming_migration_bh(void *opaque)
> >                        MIGRATION_STATUS_COMPLETED);
> >      qemu_bh_delete(mis->bh);
> >      migration_incoming_state_destroy();
> > +    object_unref(OBJECT(migrate_get_current()));
> >  }
> >  
> >  static void coroutine_fn
> > @@ -713,6 +714,7 @@ process_incoming_migration_co(void *opaque)
> >      }
> >  
> >      mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
> > +    object_ref(OBJECT(migrate_get_current()));
> >      qemu_bh_schedule(mis->bh);
> >      return;
> >  fail:
> > -- 
> > 2.35.3
> > 
> 
> I know I missed something, but I'd better ask: use-after-free needs to
> happen only after migration_shutdown() / qemu_cleanup(), am I right?
> 
> If so, shouldn't qemu_main_loop() already returned?  Then how could any BH
> keep running (including migration's) without qemu_main_loop()?

Hmm, I saw a pretty old stack mentioned in commit fd392cfa8e6:

    Original output:
    qemu-system-x86_64: terminating on signal 15 from pid 31980 (<unknown process>)
    =================================================================
    ==31958==ERROR: AddressSanitizer: heap-use-after-free on address 0x61900001d210
      at pc 0x555558a535ca bp 0x7fffffffb190 sp 0x7fffffffb188
    READ of size 8 at 0x61900001d210 thread T0 (qemu-vm-0)
        #0 0x555558a535c9 in migrate_fd_cleanup migration/migration.c:1502:23
        #1 0x5555594fde0a in aio_bh_call util/async.c:90:5
        #2 0x5555594fe522 in aio_bh_poll util/async.c:118:13
        #3 0x555559524783 in aio_poll util/aio-posix.c:725:17
        #4 0x555559504fb3 in aio_wait_bh_oneshot util/aio-wait.c:71:5
        #5 0x5555573bddf6 in virtio_blk_data_plane_stop
          hw/block/dataplane/virtio-blk.c:282:5
        #6 0x5555589d5c09 in virtio_bus_stop_ioeventfd hw/virtio/virtio-bus.c:246:9
        #7 0x5555589e9917 in virtio_pci_stop_ioeventfd hw/virtio/virtio-pci.c:287:5
        #8 0x5555589e22bf in virtio_pci_vmstate_change hw/virtio/virtio-pci.c:1072:9
        #9 0x555557628931 in virtio_vmstate_change hw/virtio/virtio.c:2257:9
        #10 0x555557c36713 in vm_state_notify vl.c:1605:9
        #11 0x55555716ef53 in do_vm_stop cpus.c:1074:9
        #12 0x55555716eeff in vm_shutdown cpus.c:1092:12
        #13 0x555557c4283e in main vl.c:4617:5
        #14 0x7fffdfdb482f in __libc_start_main
          (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
        #15 0x555556ecb118 in _start (x86_64-softmmu/qemu-system-x86_64+0x1977118)

Would that be the same case that you mentioned here?  As vm_shutdown() is
indeed after migration_shutdown().

Even if so, two follow up comments..

(1) How did it help if process_incoming_migration_bh() took ref on
    MigrationState*?  I didn't even see it touching the object (instead, it
    uses the incoming object)?

(2) This is what I'm just wondering.. on whether we should clear
    current_migration to NULL in migration_shutdown() after we unref it.
    Maybe it'll make such issues abort in an even clearer way.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-22 10:21     ` Peter Xu
@ 2024-01-22 16:55       ` Fabiano Rosas
  2024-01-23  1:56         ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Fabiano Rosas @ 2024-01-22 16:55 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Mon, Jan 22, 2024 at 05:49:01PM +0800, Peter Xu wrote:
>> On Fri, Jan 19, 2024 at 08:39:18PM -0300, Fabiano Rosas wrote:
>> > We're currently allowing the process_incoming_migration_bh bottom-half
>> > to run without holding a reference to the 'current_migration' object,
>> > which leads to a segmentation fault if the BH is still live after
>> > migration_shutdown() has dropped the last reference to
>> > current_migration.
>> > 
>> > In my system the bug manifests as migrate_multifd() returning true
>> > when it shouldn't and multifd_load_shutdown() calling
>> > multifd_recv_terminate_threads() which crashes due to an uninitialized
>> > multifd_recv_state.
>> > 
>> > Fix the issue by holding a reference to the object when scheduling the
>> > BH and dropping it before returning from the BH. The same is already
>> > done for the cleanup_bh at migrate_fd_cleanup_schedule().
>> > 
>> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> > ---
>> >  migration/migration.c | 2 ++
>> >  1 file changed, 2 insertions(+)
>> > 
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index 219447dea1..cf17b68e57 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -648,6 +648,7 @@ static void process_incoming_migration_bh(void *opaque)
>> >                        MIGRATION_STATUS_COMPLETED);
>> >      qemu_bh_delete(mis->bh);
>> >      migration_incoming_state_destroy();
>> > +    object_unref(OBJECT(migrate_get_current()));
>> >  }
>> >  
>> >  static void coroutine_fn
>> > @@ -713,6 +714,7 @@ process_incoming_migration_co(void *opaque)
>> >      }
>> >  
>> >      mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
>> > +    object_ref(OBJECT(migrate_get_current()));
>> >      qemu_bh_schedule(mis->bh);
>> >      return;
>> >  fail:
>> > -- 
>> > 2.35.3
>> > 
>> 
> I know I missed something, but I'd better ask: use-after-free needs to
> happen only after migration_shutdown() / qemu_cleanup(), am I right?
> 
> If so, shouldn't qemu_main_loop() already returned?  Then how could any BH
> keep running (including migration's) without qemu_main_loop()?

The aio_bh_poll() -> aio_bh_call() sequence doesn't depend on
qemu_main_loop(). In the stack you found below it happens after
aio_wait_bh_oneshot() -> AIO_WAIT_WHILE -> aio_poll().

The stack I see is:

#0  0x00005625c97bffc0 in multifd_recv_terminate_threads (err=0x0) at ../migration/multifd.c:992
#1  0x00005625c97c0086 in multifd_load_shutdown () at ../migration/multifd.c:1012
#2  0x00005625c97b6163 in process_incoming_migration_bh (opaque=0x5625cbce59f0) at ../migration/migration.c:624
#3  0x00005625c9c740c2 in aio_bh_call (bh=0x5625cc9e1320) at ../util/async.c:169
#4  0x00005625c9c741de in aio_bh_poll (ctx=0x5625cba2a670) at ../util/async.c:216

here^

#5  0x00005625c9af0599 in bdrv_graph_wrunlock () at ../block/graph-lock.c:170
#6  0x00005625c9aba8bd in bdrv_close (bs=0x5625cbcb3d80) at ../block.c:5099
#7  0x00005625c9abb71a in bdrv_delete (bs=0x5625cbcb3d80) at ../block.c:5501
#8  0x00005625c9abe840 in bdrv_unref (bs=0x5625cbcb3d80) at ../block.c:7075
#9  0x00005625c9abe865 in bdrv_schedule_unref_bh (opaque=0x5625cbcb3d80) at ../block.c:7083
#10 0x00005625c9c740c2 in aio_bh_call (bh=0x5625cbde09d0) at ../util/async.c:169
#11 0x00005625c9c741de in aio_bh_poll (ctx=0x5625cba2a670) at ../util/async.c:216
#12 0x00005625c9af0599 in bdrv_graph_wrunlock () at ../block/graph-lock.c:170
#13 0x00005625c9ae05db in blk_remove_bs (blk=0x5625cbcc1070) at ../block/block-backend.c:907
#14 0x00005625c9adfb1b in blk_remove_all_bs () at ../block/block-backend.c:571
#15 0x00005625c9abab0d in bdrv_close_all () at ../block.c:5146
#16 0x00005625c97892e4 in qemu_cleanup (status=0) at ../system/runstate.c:880
#17 0x00005625c9a58081 in qemu_default_main () at ../system/main.c:38
#18 0x00005625c9a580af in main (argc=35, argv=0x7ffe30ab0578) at ../system/main.c:48

> Hmm, I saw a pretty old stack mentioned in commit fd392cfa8e6:
>
>     Original output:
>     qemu-system-x86_64: terminating on signal 15 from pid 31980 (<unknown process>)
>     =================================================================
>     ==31958==ERROR: AddressSanitizer: heap-use-after-free on address 0x61900001d210
>       at pc 0x555558a535ca bp 0x7fffffffb190 sp 0x7fffffffb188
>     READ of size 8 at 0x61900001d210 thread T0 (qemu-vm-0)
>         #0 0x555558a535c9 in migrate_fd_cleanup migration/migration.c:1502:23
>         #1 0x5555594fde0a in aio_bh_call util/async.c:90:5
>         #2 0x5555594fe522 in aio_bh_poll util/async.c:118:13
>         #3 0x555559524783 in aio_poll util/aio-posix.c:725:17
>         #4 0x555559504fb3 in aio_wait_bh_oneshot util/aio-wait.c:71:5
>         #5 0x5555573bddf6 in virtio_blk_data_plane_stop
>           hw/block/dataplane/virtio-blk.c:282:5
>         #6 0x5555589d5c09 in virtio_bus_stop_ioeventfd hw/virtio/virtio-bus.c:246:9
>         #7 0x5555589e9917 in virtio_pci_stop_ioeventfd hw/virtio/virtio-pci.c:287:5
>         #8 0x5555589e22bf in virtio_pci_vmstate_change hw/virtio/virtio-pci.c:1072:9
>         #9 0x555557628931 in virtio_vmstate_change hw/virtio/virtio.c:2257:9
>         #10 0x555557c36713 in vm_state_notify vl.c:1605:9
>         #11 0x55555716ef53 in do_vm_stop cpus.c:1074:9
>         #12 0x55555716eeff in vm_shutdown cpus.c:1092:12
>         #13 0x555557c4283e in main vl.c:4617:5
>         #14 0x7fffdfdb482f in __libc_start_main
>           (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
>         #15 0x555556ecb118 in _start (x86_64-softmmu/qemu-system-x86_64+0x1977118)
>
> Would that be the same case that you mentioned here?  As vm_shutdown() is
> indeed after migration_shutdown().
>
> Even if so, two follow up comments..
>
> (1) How did it help if process_incoming_migration_bh() took ref on
>     MigrationState*?  I didn't even see it touching the object (instead, it
>     uses the incoming object)?

We touch MigrationState every time we check for a capability. See the
stack I posted above: process_incoming_migration_bh() ->
multifd_load_shutdown().

void multifd_load_shutdown(void)
{
    if (migrate_multifd()) {        <-- HERE
        multifd_recv_terminate_threads(NULL);
    }
}

The bug reproduces *without* multifd, because that check passes and we
go into multifd code that has not been initialized.

side note: we should probably introduce a MigrationOutgoingState to pair
with MigrationIncomingState and have both inside a global MigrationState
that contains the common elements. If you agree I can add this to our
todo list.

> (2) This is what I'm just wondering.. on whether we should clear
>     current_migration to NULL in migration_shutdown() after we unref it.
>     Maybe it'll make such issues abort in an even clearer way.

It hits the assert at migrate_get_current():

#4  0x00005643006e22ae in migrate_get_current () at ../migration/migration.c:246
#5  0x00005643006f0415 in migrate_late_block_activate () at ../migration/options.c:275
#6  0x00005643006e30e0 in process_incoming_migration_bh (opaque=0x564303b279f0) at ../migration/migration.c:603
#7  0x0000564300ba10cd in aio_bh_call (bh=0x564304823320) at ../util/async.c:169
#8  0x0000564300ba11e9 in aio_bh_poll (ctx=0x56430386c670) at ../util/async.c:216
...
#20 0x00005643006b62e4 in qemu_cleanup (status=0) at ../system/runstate.c:880
#21 0x000056430098508c in qemu_default_main () at ../system/main.c:38
#22 0x00005643009850ba in main (argc=35, argv=0x7ffc8bf703c8) at
#../system/main.c:4

Note that this breaks at migrate_late_block_activate(), which is even
earlier than the bug scenario at multifd_load_shutdown().

However we cannot set NULL currently because the BHs are still running
after migration_shutdown(). I don't know of a safe way to cancel/delete
a BH after it has (potentially) been scheduled already.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] migration: Fix use-after-free of migration state object
  2024-01-22 16:55       ` Fabiano Rosas
@ 2024-01-23  1:56         ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2024-01-23  1:56 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel

On Mon, Jan 22, 2024 at 01:55:45PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Jan 22, 2024 at 05:49:01PM +0800, Peter Xu wrote:
> >> On Fri, Jan 19, 2024 at 08:39:18PM -0300, Fabiano Rosas wrote:
> >> > We're currently allowing the process_incoming_migration_bh bottom-half
> >> > to run without holding a reference to the 'current_migration' object,
> >> > which leads to a segmentation fault if the BH is still live after
> >> > migration_shutdown() has dropped the last reference to
> >> > current_migration.
> >> > 
> >> > In my system the bug manifests as migrate_multifd() returning true
> >> > when it shouldn't and multifd_load_shutdown() calling
> >> > multifd_recv_terminate_threads() which crashes due to an uninitialized
> >> > multifd_recv_state.
> >> > 
> >> > Fix the issue by holding a reference to the object when scheduling the
> >> > BH and dropping it before returning from the BH. The same is already
> >> > done for the cleanup_bh at migrate_fd_cleanup_schedule().
> >> > 
> >> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> > ---
> >> >  migration/migration.c | 2 ++
> >> >  1 file changed, 2 insertions(+)
> >> > 
> >> > diff --git a/migration/migration.c b/migration/migration.c
> >> > index 219447dea1..cf17b68e57 100644
> >> > --- a/migration/migration.c
> >> > +++ b/migration/migration.c
> >> > @@ -648,6 +648,7 @@ static void process_incoming_migration_bh(void *opaque)
> >> >                        MIGRATION_STATUS_COMPLETED);
> >> >      qemu_bh_delete(mis->bh);
> >> >      migration_incoming_state_destroy();
> >> > +    object_unref(OBJECT(migrate_get_current()));
> >> >  }
> >> >  
> >> >  static void coroutine_fn
> >> > @@ -713,6 +714,7 @@ process_incoming_migration_co(void *opaque)
> >> >      }
> >> >  
> >> >      mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
> >> > +    object_ref(OBJECT(migrate_get_current()));
> >> >      qemu_bh_schedule(mis->bh);
> >> >      return;
> >> >  fail:
> >> > -- 
> >> > 2.35.3
> >> > 
> >> 
> > I know I missed something, but I'd better ask: use-after-free needs to
> > happen only after migration_shutdown() / qemu_cleanup(), am I right?
> > 
> > If so, shouldn't qemu_main_loop() already returned?  Then how could any BH
> > keep running (including migration's) without qemu_main_loop()?
> 
> The aio_bh_poll() -> aio_bh_call() sequence doesn't depend on
> qemu_main_loop(). In the stack you found below it happens after
> aio_wait_bh_oneshot() -> AIO_WAIT_WHILE -> aio_poll().
> 
> The stack I see is:
> 
> #0  0x00005625c97bffc0 in multifd_recv_terminate_threads (err=0x0) at ../migration/multifd.c:992
> #1  0x00005625c97c0086 in multifd_load_shutdown () at ../migration/multifd.c:1012
> #2  0x00005625c97b6163 in process_incoming_migration_bh (opaque=0x5625cbce59f0) at ../migration/migration.c:624
> #3  0x00005625c9c740c2 in aio_bh_call (bh=0x5625cc9e1320) at ../util/async.c:169
> #4  0x00005625c9c741de in aio_bh_poll (ctx=0x5625cba2a670) at ../util/async.c:216
> 
> here^
> 
> #5  0x00005625c9af0599 in bdrv_graph_wrunlock () at ../block/graph-lock.c:170
> #6  0x00005625c9aba8bd in bdrv_close (bs=0x5625cbcb3d80) at ../block.c:5099
> #7  0x00005625c9abb71a in bdrv_delete (bs=0x5625cbcb3d80) at ../block.c:5501
> #8  0x00005625c9abe840 in bdrv_unref (bs=0x5625cbcb3d80) at ../block.c:7075
> #9  0x00005625c9abe865 in bdrv_schedule_unref_bh (opaque=0x5625cbcb3d80) at ../block.c:7083
> #10 0x00005625c9c740c2 in aio_bh_call (bh=0x5625cbde09d0) at ../util/async.c:169
> #11 0x00005625c9c741de in aio_bh_poll (ctx=0x5625cba2a670) at ../util/async.c:216
> #12 0x00005625c9af0599 in bdrv_graph_wrunlock () at ../block/graph-lock.c:170
> #13 0x00005625c9ae05db in blk_remove_bs (blk=0x5625cbcc1070) at ../block/block-backend.c:907
> #14 0x00005625c9adfb1b in blk_remove_all_bs () at ../block/block-backend.c:571
> #15 0x00005625c9abab0d in bdrv_close_all () at ../block.c:5146
> #16 0x00005625c97892e4 in qemu_cleanup (status=0) at ../system/runstate.c:880
> #17 0x00005625c9a58081 in qemu_default_main () at ../system/main.c:38
> #18 0x00005625c9a580af in main (argc=35, argv=0x7ffe30ab0578) at ../system/main.c:48
> 
> > Hmm, I saw a pretty old stack mentioned in commit fd392cfa8e6:
> >
> >     Original output:
> >     qemu-system-x86_64: terminating on signal 15 from pid 31980 (<unknown process>)
> >     =================================================================
> >     ==31958==ERROR: AddressSanitizer: heap-use-after-free on address 0x61900001d210
> >       at pc 0x555558a535ca bp 0x7fffffffb190 sp 0x7fffffffb188
> >     READ of size 8 at 0x61900001d210 thread T0 (qemu-vm-0)
> >         #0 0x555558a535c9 in migrate_fd_cleanup migration/migration.c:1502:23
> >         #1 0x5555594fde0a in aio_bh_call util/async.c:90:5
> >         #2 0x5555594fe522 in aio_bh_poll util/async.c:118:13
> >         #3 0x555559524783 in aio_poll util/aio-posix.c:725:17
> >         #4 0x555559504fb3 in aio_wait_bh_oneshot util/aio-wait.c:71:5
> >         #5 0x5555573bddf6 in virtio_blk_data_plane_stop
> >           hw/block/dataplane/virtio-blk.c:282:5
> >         #6 0x5555589d5c09 in virtio_bus_stop_ioeventfd hw/virtio/virtio-bus.c:246:9
> >         #7 0x5555589e9917 in virtio_pci_stop_ioeventfd hw/virtio/virtio-pci.c:287:5
> >         #8 0x5555589e22bf in virtio_pci_vmstate_change hw/virtio/virtio-pci.c:1072:9
> >         #9 0x555557628931 in virtio_vmstate_change hw/virtio/virtio.c:2257:9
> >         #10 0x555557c36713 in vm_state_notify vl.c:1605:9
> >         #11 0x55555716ef53 in do_vm_stop cpus.c:1074:9
> >         #12 0x55555716eeff in vm_shutdown cpus.c:1092:12
> >         #13 0x555557c4283e in main vl.c:4617:5
> >         #14 0x7fffdfdb482f in __libc_start_main
> >           (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
> >         #15 0x555556ecb118 in _start (x86_64-softmmu/qemu-system-x86_64+0x1977118)
> >
> > Would that be the same case that you mentioned here?  As vm_shutdown() is
> > indeed after migration_shutdown().
> >
> > Even if so, two follow up comments..
> >
> > (1) How did it help if process_incoming_migration_bh() took ref on
> >     MigrationState*?  I didn't even see it touching the object (instead, it
> >     uses the incoming object)?
> 
> We touch MigrationState every time we check for a capability. See the
> stack I posted above: process_incoming_migration_bh() ->
> multifd_load_shutdown().
> 
> void multifd_load_shutdown(void)
> {
>     if (migrate_multifd()) {        <-- HERE
>         multifd_recv_terminate_threads(NULL);
>     }
> }
> 
> The bug reproduces *without* multifd, because that check passes and we
> go into multifd code that has not been initialized.
> 
> side note: we should probably introduce a MigrationOutgoingState to pair
> with MigrationIncomingState and have both inside a global MigrationState
> that contains the common elements. If you agree I can add this to our
> todo list.

Yep, feel free to add an entry.

Though instead of MigrationOutgoingState, maybe it's easier to keep
MigrationState to be the "outgoing state", then move both caps/params (and
anything shared between src/dst) out of MigrationState?  That can be a
separate struct, e.g., MigrationGlobals.

> 
> > (2) This is what I'm just wondering.. on whether we should clear
> >     current_migration to NULL in migration_shutdown() after we unref it.
> >     Maybe it'll make such issues abort in an even clearer way.
> 
> It hits the assert at migrate_get_current():
> 
> #4  0x00005643006e22ae in migrate_get_current () at ../migration/migration.c:246
> #5  0x00005643006f0415 in migrate_late_block_activate () at ../migration/options.c:275
> #6  0x00005643006e30e0 in process_incoming_migration_bh (opaque=0x564303b279f0) at ../migration/migration.c:603
> #7  0x0000564300ba10cd in aio_bh_call (bh=0x564304823320) at ../util/async.c:169
> #8  0x0000564300ba11e9 in aio_bh_poll (ctx=0x56430386c670) at ../util/async.c:216
> ...
> #20 0x00005643006b62e4 in qemu_cleanup (status=0) at ../system/runstate.c:880
> #21 0x000056430098508c in qemu_default_main () at ../system/main.c:38
> #22 0x00005643009850ba in main (argc=35, argv=0x7ffc8bf703c8) at
> #../system/main.c:4
> 
> Note that this breaks at migrate_late_block_activate(), which is even
> earlier than the bug scenario at multifd_load_shutdown().
> 
> However we cannot set NULL currently because the BHs are still running
> after migration_shutdown(). I don't know of a safe way to cancel/delete
> a BH after it has (potentially) been scheduled already.

Indeed..  I am not sure whether migration can avoid using BHs some day.
E.g., I am not sure whether those tasks can be done synchronously under bql
already.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/5] migration: Fix migration state reference counting
  2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
                   ` (4 preceding siblings ...)
  2024-01-19 23:39 ` [PATCH 5/5] migration: Centralize BH creation and dispatch Fabiano Rosas
@ 2024-01-23  2:19 ` Peter Xu
  5 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2024-01-23  2:19 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel

On Fri, Jan 19, 2024 at 08:39:17PM -0300, Fabiano Rosas wrote:
> We currently have a bug when running migration code in bottom
> halves. The issue has already been reported in Gitlab[1] and it
> started happening very frequently on my machine for some reason.
> 
> The issue is that we're dropping the last reference to the
> MigrationState object while the cleanup bottom half is still running
> and it leads to an use after free. More details on the commit message.
> 
> This series fixes the issue and does a refactoring around the
> migration BH scheduling aiming to consolidate some code so that it is
> less error prone.
> 
> 1- https://gitlab.com/qemu-project/qemu/-/issues/1969
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1144927625
> 
> Fabiano Rosas (5):
>   migration: Fix use-after-free of migration state object
>   migration: Take reference to migration state around
>     bg_migration_vm_start_bh
>   migration: Reference migration state around
>     loadvm_postcopy_handle_run_bh
>   migration: Add a wrapper to qemu_bh_schedule
>   migration: Centralize BH creation and dispatch
> 
>  migration/migration.c | 82 +++++++++++++++++++++++++------------------
>  migration/migration.h |  5 +--
>  migration/savevm.c    |  5 +--
>  3 files changed, 49 insertions(+), 43 deletions(-)

Looks all good now, queued.  I'll amend the "Resolve:" line in patch 1.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-23  2:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-19 23:39 [PATCH 0/5] migration: Fix migration state reference counting Fabiano Rosas
2024-01-19 23:39 ` [PATCH 1/5] migration: Fix use-after-free of migration state object Fabiano Rosas
2024-01-19 23:43   ` Fabiano Rosas
2024-01-22  9:49   ` Peter Xu
2024-01-22 10:21     ` Peter Xu
2024-01-22 16:55       ` Fabiano Rosas
2024-01-23  1:56         ` Peter Xu
2024-01-19 23:39 ` [PATCH 2/5] migration: Take reference to migration state around bg_migration_vm_start_bh Fabiano Rosas
2024-01-19 23:39 ` [PATCH 3/5] migration: Reference migration state around loadvm_postcopy_handle_run_bh Fabiano Rosas
2024-01-19 23:39 ` [PATCH 4/5] migration: Add a wrapper to qemu_bh_schedule Fabiano Rosas
2024-01-19 23:39 ` [PATCH 5/5] migration: Centralize BH creation and dispatch Fabiano Rosas
2024-01-23  2:19 ` [PATCH 0/5] migration: Fix migration state reference counting Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).