All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] colo: migration related bugfixes
@ 2020-05-11 11:10 Lukas Straub
  2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

Hello Everyone,
Here are fixes for bugs that I found in my tests.

Regards,
Lukas Straub

Lukas Straub (6):
  migration/colo.c: Use event instead of semaphore
  migration/colo.c: Use cpu_synchronize_all_states()
  migration/colo.c: Flush ram cache only after receiving device state
  migration/colo.c: Relaunch failover even if there was an error
  migration/qemu-file.c: Don't ratelimit a shutdown fd
  migration/colo.c: Move colo_notify_compares_event to the right place

 migration/colo.c      | 39 ++++++++++++++++++++++++---------------
 migration/migration.h |  4 ++--
 migration/qemu-file.c |  2 +-
 migration/ram.c       |  5 +----
 migration/ram.h       |  1 +
 5 files changed, 29 insertions(+), 22 deletions(-)

-- 
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] migration/colo.c: Use event instead of semaphore
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
@ 2020-05-11 11:10 ` Lukas Straub
  2020-05-13 11:31   ` 答复: " Zhanghailiang
  2020-05-11 11:10 ` [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3226 bytes --]

If multiple packets miscompare in a short timeframe, the semaphore
value will be increased multiple times. This causes multiple
checkpoints even if one would be sufficient.

Fix this by using a event instead of a semaphore for triggering
checkpoints. Now, checkpoint requests will be ignored until the
checkpoint event is sent to colo-compare (which releases the
miscompared packets).

Benchmark results (iperf3):
Client-to-server tcp:
without patch: ~66 Mbit/s
with patch: ~61 Mbit/s
Server-to-client tcp:
without patch: ~702 Kbit/s
with patch: ~16 Mbit/s

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c      | 9 +++++----
 migration/migration.h | 4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index a54ac84f41..09168627bc 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,6 +430,7 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }
 
+    qemu_event_reset(&s->colo_checkpoint_event);
     colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
     if (local_err) {
         goto out;
@@ -580,7 +581,7 @@ static void colo_process_checkpoint(MigrationState *s)
             goto out;
         }
 
-        qemu_sem_wait(&s->colo_checkpoint_sem);
+        qemu_event_wait(&s->colo_checkpoint_event);
 
         if (s->state != MIGRATION_STATUS_COLO) {
             goto out;
@@ -628,7 +629,7 @@ out:
     colo_compare_unregister_notifier(&packets_compare_notifier);
     timer_del(s->colo_delay_timer);
     timer_free(s->colo_delay_timer);
-    qemu_sem_destroy(&s->colo_checkpoint_sem);
+    qemu_event_destroy(&s->colo_checkpoint_event);
 
     /*
      * Must be called after failover BH is completed,
@@ -645,7 +646,7 @@ void colo_checkpoint_notify(void *opaque)
     MigrationState *s = opaque;
     int64_t next_notify_time;
 
-    qemu_sem_post(&s->colo_checkpoint_sem);
+    qemu_event_set(&s->colo_checkpoint_event);
     s->colo_checkpoint_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
     next_notify_time = s->colo_checkpoint_time +
                     s->parameters.x_checkpoint_delay;
@@ -655,7 +656,7 @@ void colo_checkpoint_notify(void *opaque)
 void migrate_start_colo_process(MigrationState *s)
 {
     qemu_mutex_unlock_iothread();
-    qemu_sem_init(&s->colo_checkpoint_sem, 0);
+    qemu_event_init(&s->colo_checkpoint_event, false);
     s->colo_delay_timer =  timer_new_ms(QEMU_CLOCK_HOST,
                                 colo_checkpoint_notify, s);
 
diff --git a/migration/migration.h b/migration/migration.h
index 507284e563..f617960522 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -215,8 +215,8 @@ struct MigrationState
     /* The semaphore is used to notify COLO thread that failover is finished */
     QemuSemaphore colo_exit_sem;
 
-    /* The semaphore is used to notify COLO thread to do checkpoint */
-    QemuSemaphore colo_checkpoint_sem;
+    /* The event is used to notify COLO thread to do checkpoint */
+    QemuEvent colo_checkpoint_event;
     int64_t colo_checkpoint_time;
     QEMUTimer *colo_delay_timer;
 
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states()
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
  2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
@ 2020-05-11 11:10 ` Lukas Straub
  2020-05-13  9:47   ` Dr. David Alan Gilbert
  2020-05-11 11:10 ` [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

cpu_synchronize_all_pre_loadvm() marks all vcpus as dirty, so the
registers are loaded from CPUState before we continue running
the vm. However if we failover during checkpoint, CPUState is not
initialized and the registers are loaded with garbage. This causes
guest hangs and crashes.

Fix this by using cpu_synchronize_all_states(), which initializes
CPUState from the current cpu registers additionally to marking
the vcpus as dirty.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/colo.c b/migration/colo.c
index 09168627bc..6b2ad35aa4 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -696,7 +696,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     }
 
     qemu_mutex_lock_iothread();
-    cpu_synchronize_all_pre_loadvm();
+    cpu_synchronize_all_states();
     ret = qemu_loadvm_state_main(mis->from_src_file, mis);
     qemu_mutex_unlock_iothread();
 
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
  2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
  2020-05-11 11:10 ` [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
@ 2020-05-11 11:10 ` Lukas Straub
  2020-05-14 12:45   ` 答复: " Zhanghailiang
  2020-05-11 11:10 ` [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

If we suceed in receiving ram state, but fail receiving the device
state, there will be a mismatch between the two.

Fix this by flushing the ram cache only after the vmstate has been
received.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 1 +
 migration/ram.c  | 5 +----
 migration/ram.h  | 1 +
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index 6b2ad35aa4..2947363ae5 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -739,6 +739,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
 
     qemu_mutex_lock_iothread();
     vmstate_loading = true;
+    colo_flush_ram_cache();
     ret = qemu_load_device_state(fb);
     if (ret < 0) {
         error_setg(errp, "COLO: load device state failed");
diff --git a/migration/ram.c b/migration/ram.c
index 04f13feb2e..5baec5fce9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3313,7 +3313,7 @@ static bool postcopy_is_running(void)
  * Flush content of RAM cache into SVM's memory.
  * Only flush the pages that be dirtied by PVM or SVM or both.
  */
-static void colo_flush_ram_cache(void)
+void colo_flush_ram_cache(void)
 {
     RAMBlock *block = NULL;
     void *dst_host;
@@ -3585,9 +3585,6 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     }
     trace_ram_load_complete(ret, seq_iter);
 
-    if (!ret  && migration_incoming_in_colo_state()) {
-        colo_flush_ram_cache();
-    }
     return ret;
 }
 
diff --git a/migration/ram.h b/migration/ram.h
index 5ceaff7cb4..2eeaacfa13 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,7 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 
 /* ram cache */
 int colo_init_ram_cache(void);
+void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);
 
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
                   ` (2 preceding siblings ...)
  2020-05-11 11:10 ` [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
@ 2020-05-11 11:10 ` Lukas Straub
  2020-05-15  6:24   ` Zhanghailiang
  2020-05-11 11:10 ` [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd Lukas Straub
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3151 bytes --]

If vmstate_loading is true, secondary_vm_do_failover will set failover
status to FAILOVER_STATUS_RELAUNCH and return success without initiating
failover. However, if there is an error during the vmstate_loading
section, failover isn't relaunched. Instead we then wait for
failover on colo_incoming_sem.

Fix this by relaunching failover even if there was an error. Also,
to make this work properly, set vmstate_loading to false when
returning during the vmstate_loading section.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index 2947363ae5..a69782efc5 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -743,6 +743,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     ret = qemu_load_device_state(fb);
     if (ret < 0) {
         error_setg(errp, "COLO: load device state failed");
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -751,6 +752,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     replication_get_error_all(&local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -759,6 +761,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     replication_do_checkpoint_all(&local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -770,6 +773,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
 
     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -780,9 +784,6 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     qemu_mutex_unlock_iothread();
 
     if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
-        failover_set_state(FAILOVER_STATUS_RELAUNCH,
-                        FAILOVER_STATUS_NONE);
-        failover_request_active(NULL);
         return;
     }
 
@@ -881,6 +882,14 @@ void *colo_process_incoming_thread(void *opaque)
             error_report_err(local_err);
             break;
         }
+
+        if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
+            failover_set_state(FAILOVER_STATUS_RELAUNCH,
+                            FAILOVER_STATUS_NONE);
+            failover_request_active(NULL);
+            break;
+        }
+
         if (failover_get_state() != FAILOVER_STATUS_NONE) {
             error_report("failover request");
             break;
@@ -888,8 +897,6 @@ void *colo_process_incoming_thread(void *opaque)
     }
 
 out:
-    vmstate_loading = false;
-
     /*
      * There are only two reasons we can get here, some error happened
      * or the user triggered failover.
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
                   ` (3 preceding siblings ...)
  2020-05-11 11:10 ` [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
@ 2020-05-11 11:10 ` Lukas Straub
  2020-05-14 13:05   ` 答复: " Zhanghailiang
  2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
  2020-06-01 16:50 ` [PATCH 0/6] colo: migration related bugfixes Dr. David Alan Gilbert
  6 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

This causes the migration thread to hang if we failover during
checkpoint. A shutdown fd won't cause network traffic anyway.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/qemu-file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1c3a358a14..0748b5810f 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)
 int qemu_file_rate_limit(QEMUFile *f)
 {
     if (f->shutdown) {
-        return 1;
+        return 0;
     }
     if (qemu_file_get_error(f)) {
         return 1;
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
                   ` (4 preceding siblings ...)
  2020-05-11 11:10 ` [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd Lukas Straub
@ 2020-05-11 11:11 ` Lukas Straub
  2020-05-14 13:27   ` 答复: " Zhanghailiang
  2020-05-15  1:53   ` Zhanghailiang
  2020-06-01 16:50 ` [PATCH 0/6] colo: migration related bugfixes Dr. David Alan Gilbert
  6 siblings, 2 replies; 22+ messages in thread
From: Lukas Straub @ 2020-05-11 11:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

If the secondary has to failover during checkpointing, it still is
in the old state (i.e. different state than primary). Thus we can't
expose the primary state until after the checkpoint is sent.

This fixes sporadic connection reset of client connections during
failover.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index a69782efc5..a3fc21e86e 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,12 +430,6 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }
 
-    qemu_event_reset(&s->colo_checkpoint_event);
-    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
-    if (local_err) {
-        goto out;
-    }
-
     /* Disable block migration */
     migrate_set_block_enabled(false, &local_err);
     qemu_mutex_lock_iothread();
@@ -494,6 +488,12 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }
 
+    qemu_event_reset(&s->colo_checkpoint_event);
+    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
     colo_receive_check_message(s->rp_state.from_dst_file,
                        COLO_MESSAGE_VMSTATE_LOADED, &local_err);
     if (local_err) {
-- 
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states()
  2020-05-11 11:10 ` [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
@ 2020-05-13  9:47   ` Dr. David Alan Gilbert
  2020-05-13 19:15     ` Lukas Straub
  0 siblings, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-13  9:47 UTC (permalink / raw)
  To: Lukas Straub; +Cc: Juan Quintela, qemu-devel, Hailiang Zhang

* Lukas Straub (lukasstraub2@web.de) wrote:
> cpu_synchronize_all_pre_loadvm() marks all vcpus as dirty, so the
> registers are loaded from CPUState before we continue running
> the vm. However if we failover during checkpoint, CPUState is not
> initialized and the registers are loaded with garbage. This causes
> guest hangs and crashes.
> 
> Fix this by using cpu_synchronize_all_states(), which initializes
> CPUState from the current cpu registers additionally to marking
> the vcpus as dirty.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>

OK, so I think you're saying that if the qemu_loadvm_state_main fails
because we failover, we now have duff CPU state, where we should just
carry on running on the secondary with the current state, so yes


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/colo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c
> index 09168627bc..6b2ad35aa4 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -696,7 +696,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>      }
>  
>      qemu_mutex_lock_iothread();
> -    cpu_synchronize_all_pre_loadvm();
> +    cpu_synchronize_all_states();
>      ret = qemu_loadvm_state_main(mis->from_src_file, mis);
>      qemu_mutex_unlock_iothread();
>  
> -- 
> 2.20.1
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH 1/6] migration/colo.c: Use event instead of semaphore
  2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
@ 2020-05-13 11:31   ` Zhanghailiang
  0 siblings, 0 replies; 22+ messages in thread
From: Zhanghailiang @ 2020-05-13 11:31 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela

> If multiple packets miscompare in a short timeframe, the semaphore value
> will be increased multiple times. This causes multiple checkpoints even if one
> would be sufficient.
> 

You right, good catch ;)

Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>

> Fix this by using a event instead of a semaphore for triggering checkpoints.
> Now, checkpoint requests will be ignored until the checkpoint event is sent
> to colo-compare (which releases the miscompared packets).
> 
> Benchmark results (iperf3):
> Client-to-server tcp:
> without patch: ~66 Mbit/s
> with patch: ~61 Mbit/s
> Server-to-client tcp:
> without patch: ~702 Kbit/s
> with patch: ~16 Mbit/s
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c      | 9 +++++----
>  migration/migration.h | 4 ++--
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> a54ac84f41..09168627bc 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -430,6 +430,7 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> +    qemu_event_reset(&s->colo_checkpoint_event);
>      colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
>      if (local_err) {
>          goto out;
> @@ -580,7 +581,7 @@ static void colo_process_checkpoint(MigrationState
> *s)
>              goto out;
>          }
> 
> -        qemu_sem_wait(&s->colo_checkpoint_sem);
> +        qemu_event_wait(&s->colo_checkpoint_event);
> 
>          if (s->state != MIGRATION_STATUS_COLO) {
>              goto out;
> @@ -628,7 +629,7 @@ out:
>      colo_compare_unregister_notifier(&packets_compare_notifier);
>      timer_del(s->colo_delay_timer);
>      timer_free(s->colo_delay_timer);
> -    qemu_sem_destroy(&s->colo_checkpoint_sem);
> +    qemu_event_destroy(&s->colo_checkpoint_event);
> 
>      /*
>       * Must be called after failover BH is completed, @@ -645,7 +646,7
> @@ void colo_checkpoint_notify(void *opaque)
>      MigrationState *s = opaque;
>      int64_t next_notify_time;
> 
> -    qemu_sem_post(&s->colo_checkpoint_sem);
> +    qemu_event_set(&s->colo_checkpoint_event);
>      s->colo_checkpoint_time =
> qemu_clock_get_ms(QEMU_CLOCK_HOST);
>      next_notify_time = s->colo_checkpoint_time +
>                      s->parameters.x_checkpoint_delay; @@ -655,7
> +656,7 @@ void colo_checkpoint_notify(void *opaque)  void
> migrate_start_colo_process(MigrationState *s)  {
>      qemu_mutex_unlock_iothread();
> -    qemu_sem_init(&s->colo_checkpoint_sem, 0);
> +    qemu_event_init(&s->colo_checkpoint_event, false);
>      s->colo_delay_timer =  timer_new_ms(QEMU_CLOCK_HOST,
>                                  colo_checkpoint_notify, s);
> 
> diff --git a/migration/migration.h b/migration/migration.h index
> 507284e563..f617960522 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -215,8 +215,8 @@ struct MigrationState
>      /* The semaphore is used to notify COLO thread that failover is finished
> */
>      QemuSemaphore colo_exit_sem;
> 
> -    /* The semaphore is used to notify COLO thread to do checkpoint */
> -    QemuSemaphore colo_checkpoint_sem;
> +    /* The event is used to notify COLO thread to do checkpoint */
> +    QemuEvent colo_checkpoint_event;
>      int64_t colo_checkpoint_time;
>      QEMUTimer *colo_delay_timer;
> 
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states()
  2020-05-13  9:47   ` Dr. David Alan Gilbert
@ 2020-05-13 19:15     ` Lukas Straub
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Straub @ 2020-05-13 19:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, qemu-devel, Hailiang Zhang

[-- Attachment #1: Type: text/plain, Size: 1726 bytes --]

On Wed, 13 May 2020 10:47:02 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Lukas Straub (lukasstraub2@web.de) wrote:
> > cpu_synchronize_all_pre_loadvm() marks all vcpus as dirty, so the
> > registers are loaded from CPUState before we continue running
> > the vm. However if we failover during checkpoint, CPUState is not
> > initialized and the registers are loaded with garbage. This causes
> > guest hangs and crashes.
> > 
> > Fix this by using cpu_synchronize_all_states(), which initializes
> > CPUState from the current cpu registers additionally to marking
> > the vcpus as dirty.
> > 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>  
> 
> OK, so I think you're saying that if the qemu_loadvm_state_main fails
> because we failover, we now have duff CPU state, where we should just
> carry on running on the secondary with the current state, so yes

Exactly, Sorry for my bad wording.

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> > ---
> >  migration/colo.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/migration/colo.c b/migration/colo.c
> > index 09168627bc..6b2ad35aa4 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -696,7 +696,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> >      }
> >  
> >      qemu_mutex_lock_iothread();
> > -    cpu_synchronize_all_pre_loadvm();
> > +    cpu_synchronize_all_states();
> >      ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> >      qemu_mutex_unlock_iothread();
> >  
> > -- 
> > 2.20.1
> >   
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state
  2020-05-11 11:10 ` [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
@ 2020-05-14 12:45   ` Zhanghailiang
  0 siblings, 0 replies; 22+ messages in thread
From: Zhanghailiang @ 2020-05-14 12:45 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela

Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>

> 
> If we suceed in receiving ram state, but fail receiving the device state, there
> will be a mismatch between the two.
> 
> Fix this by flushing the ram cache only after the vmstate has been received.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 1 +
>  migration/ram.c  | 5 +----
>  migration/ram.h  | 1 +
>  3 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> 6b2ad35aa4..2947363ae5 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -739,6 +739,7 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> 
>      qemu_mutex_lock_iothread();
>      vmstate_loading = true;
> +    colo_flush_ram_cache();
>      ret = qemu_load_device_state(fb);
>      if (ret < 0) {
>          error_setg(errp, "COLO: load device state failed"); diff --git
> a/migration/ram.c b/migration/ram.c index 04f13feb2e..5baec5fce9 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3313,7 +3313,7 @@ static bool postcopy_is_running(void)
>   * Flush content of RAM cache into SVM's memory.
>   * Only flush the pages that be dirtied by PVM or SVM or both.
>   */
> -static void colo_flush_ram_cache(void)
> +void colo_flush_ram_cache(void)
>  {
>      RAMBlock *block = NULL;
>      void *dst_host;
> @@ -3585,9 +3585,6 @@ static int ram_load(QEMUFile *f, void *opaque,
> int version_id)
>      }
>      trace_ram_load_complete(ret, seq_iter);
> 
> -    if (!ret  && migration_incoming_in_colo_state()) {
> -        colo_flush_ram_cache();
> -    }
>      return ret;
>  }
> 
> diff --git a/migration/ram.h b/migration/ram.h index 5ceaff7cb4..2eeaacfa13
> 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -65,6 +65,7 @@ int ram_dirty_bitmap_reload(MigrationState *s,
> RAMBlock *rb);
> 
>  /* ram cache */
>  int colo_init_ram_cache(void);
> +void colo_flush_ram_cache(void);
>  void colo_release_ram_cache(void);
>  void colo_incoming_start_dirty_log(void);
> 
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-11 11:10 ` [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd Lukas Straub
@ 2020-05-14 13:05   ` Zhanghailiang
  2020-05-18 11:55     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Zhanghailiang @ 2020-05-14 13:05 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela

> This causes the migration thread to hang if we failover during checkpoint. A
> shutdown fd won't cause network traffic anyway.
> 

I'm not quite sure if this modification can take side effect on normal migration process or not,
There are several places calling it.

Maybe Juan and Dave can help ;)

> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/qemu-file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> 1c3a358a14..0748b5810f 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> qemu_file_rate_limit(QEMUFile *f)  {
>      if (f->shutdown) {
> -        return 1;
> +        return 0;
>      }
>      if (qemu_file_get_error(f)) {
>          return 1;
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
@ 2020-05-14 13:27   ` Zhanghailiang
  2020-05-14 14:31     ` Lukas Straub
  2020-05-15  1:53   ` Zhanghailiang
  1 sibling, 1 reply; 22+ messages in thread
From: Zhanghailiang @ 2020-05-14 13:27 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel, Zhang Chen
  Cc: Dr. David Alan Gilbert, Juan Quintela

Cc: Zhang Chen <chen.zhang@intel.com>

> 
> If the secondary has to failover during checkpointing, it still is in the old state
> (i.e. different state than primary). Thus we can't expose the primary state
> until after the checkpoint is sent.
> 

Hmm, do you mean we should not flush the net packages to client connection until checkpointing
Process almost success because it may fail during checkpointing ?

> This fixes sporadic connection reset of client connections during failover.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> a69782efc5..a3fc21e86e 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -430,12 +430,6 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> -    qemu_event_reset(&s->colo_checkpoint_event);
> -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> -    if (local_err) {
> -        goto out;
> -    }
> -
>      /* Disable block migration */
>      migrate_set_block_enabled(false, &local_err);
>      qemu_mutex_lock_iothread();
> @@ -494,6 +488,12 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> +    qemu_event_reset(&s->colo_checkpoint_event);
> +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
>      colo_receive_check_message(s->rp_state.from_dst_file,
>                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
>      if (local_err) {
> --
> 2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-14 13:27   ` 答复: " Zhanghailiang
@ 2020-05-14 14:31     ` Lukas Straub
  2020-05-15  1:45       ` Zhanghailiang
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-14 14:31 UTC (permalink / raw)
  To: Zhanghailiang
  Cc: Zhang Chen, qemu-devel, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2779 bytes --]

On Thu, 14 May 2020 13:27:30 +0000
Zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:

> Cc: Zhang Chen <chen.zhang@intel.com>
> 
> > 
> > If the secondary has to failover during checkpointing, it still is in the old state
> > (i.e. different state than primary). Thus we can't expose the primary state
> > until after the checkpoint is sent.
> >   
> 
> Hmm, do you mean we should not flush the net packages to client connection until checkpointing
> Process almost success because it may fail during checkpointing ?

No.
If the primary fails/crashes during checkpointing, the secondary is still in different state than the primary because it didn't receive the full checkpoint. We can release the miscompared packets only after both primary and secondary are in the same state.

Example:
1. Client opens a TCP connection, sends SYN.
2. Primary accepts the connection with SYN-ACK, but due to nondeterministic execution the secondary is delayed.
3. Checkpoint happens, primary releases the SYN-ACK packet but then crashes while sending the checkpoint.
4. The Secondary fails over. At this point it is still in the old state where it hasn't sent the SYN-ACK packet.
5. The client responds with ACK to the SYN-ACK packet.
6. Because it doesn't know the connection, the secondary responds with RST, connection reset.

Regards,
Lukas Straub

> > This fixes sporadic connection reset of client connections during failover.
> > 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > ---
> >  migration/colo.c | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/migration/colo.c b/migration/colo.c index
> > a69782efc5..a3fc21e86e 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -430,12 +430,6 @@ static int
> > colo_do_checkpoint_transaction(MigrationState *s,
> >          goto out;
> >      }
> > 
> > -    qemu_event_reset(&s->colo_checkpoint_event);
> > -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > &local_err);
> > -    if (local_err) {
> > -        goto out;
> > -    }
> > -
> >      /* Disable block migration */
> >      migrate_set_block_enabled(false, &local_err);
> >      qemu_mutex_lock_iothread();
> > @@ -494,6 +488,12 @@ static int
> > colo_do_checkpoint_transaction(MigrationState *s,
> >          goto out;
> >      }
> > 
> > +    qemu_event_reset(&s->colo_checkpoint_event);
> > +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> > +
> >      colo_receive_check_message(s->rp_state.from_dst_file,
> >                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
> >      if (local_err) {
> > --
> > 2.20.1  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-14 14:31     ` Lukas Straub
@ 2020-05-15  1:45       ` Zhanghailiang
  0 siblings, 0 replies; 22+ messages in thread
From: Zhanghailiang @ 2020-05-15  1:45 UTC (permalink / raw)
  To: Lukas Straub
  Cc: Zhang Chen, qemu-devel, Dr. David Alan Gilbert, Juan Quintela

> -----Original Message-----
> From: Lukas Straub [mailto:lukasstraub2@web.de]
> Sent: Thursday, May 14, 2020 10:31 PM
> To: Zhanghailiang <zhang.zhanghailiang@huawei.com>
> Cc: qemu-devel <qemu-devel@nongnu.org>; Zhang Chen
> <chen.zhang@intel.com>; Juan Quintela <quintela@redhat.com>; Dr. David
> Alan Gilbert <dgilbert@redhat.com>
> Subject: Re: [PATCH 6/6] migration/colo.c: Move
> colo_notify_compares_event to the right place
> 
> On Thu, 14 May 2020 13:27:30 +0000
> Zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:
> 
> > Cc: Zhang Chen <chen.zhang@intel.com>
> >
> > >
> > > If the secondary has to failover during checkpointing, it still is
> > > in the old state (i.e. different state than primary). Thus we can't
> > > expose the primary state until after the checkpoint is sent.
> > >
> >
> > Hmm, do you mean we should not flush the net packages to client
> > connection until checkpointing Process almost success because it may fail
> during checkpointing ?
> 
> No.
> If the primary fails/crashes during checkpointing, the secondary is still in
> different state than the primary because it didn't receive the full checkpoint.
> We can release the miscompared packets only after both primary and
> secondary are in the same state.
> 
> Example:
> 1. Client opens a TCP connection, sends SYN.
> 2. Primary accepts the connection with SYN-ACK, but due to
> nondeterministic execution the secondary is delayed.
> 3. Checkpoint happens, primary releases the SYN-ACK packet but then
> crashes while sending the checkpoint.
> 4. The Secondary fails over. At this point it is still in the old state where it
> hasn't sent the SYN-ACK packet.
> 5. The client responds with ACK to the SYN-ACK packet.
> 6. Because it doesn't know the connection, the secondary responds with RST,
> connection reset.
> 

Good example. For this patch, it is OK, I will add reviewed-by in your origin patch.


> Regards,
> Lukas Straub
> 
> > > This fixes sporadic connection reset of client connections during failover.
> > >
> > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > ---
> > >  migration/colo.c | 12 ++++++------
> > >  1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/migration/colo.c b/migration/colo.c index
> > > a69782efc5..a3fc21e86e 100644
> > > --- a/migration/colo.c
> > > +++ b/migration/colo.c
> > > @@ -430,12 +430,6 @@ static int
> > > colo_do_checkpoint_transaction(MigrationState *s,
> > >          goto out;
> > >      }
> > >
> > > -    qemu_event_reset(&s->colo_checkpoint_event);
> > > -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > > &local_err);
> > > -    if (local_err) {
> > > -        goto out;
> > > -    }
> > > -
> > >      /* Disable block migration */
> > >      migrate_set_block_enabled(false, &local_err);
> > >      qemu_mutex_lock_iothread();
> > > @@ -494,6 +488,12 @@ static int
> > > colo_do_checkpoint_transaction(MigrationState *s,
> > >          goto out;
> > >      }
> > >
> > > +    qemu_event_reset(&s->colo_checkpoint_event);
> > > +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > > &local_err);
> > > +    if (local_err) {
> > > +        goto out;
> > > +    }
> > > +
> > >      colo_receive_check_message(s->rp_state.from_dst_file,
> > >                         COLO_MESSAGE_VMSTATE_LOADED,
> &local_err);
> > >      if (local_err) {
> > > --
> > > 2.20.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
  2020-05-14 13:27   ` 答复: " Zhanghailiang
@ 2020-05-15  1:53   ` Zhanghailiang
  1 sibling, 0 replies; 22+ messages in thread
From: Zhanghailiang @ 2020-05-15  1:53 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela

Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>

> -----Original Message-----
> From: Lukas Straub [mailto:lukasstraub2@web.de]
> Sent: Monday, May 11, 2020 7:11 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Zhanghailiang <zhang.zhanghailiang@huawei.com>; Juan Quintela
> <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>
> Subject: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event
> to the right place
> 
> If the secondary has to failover during checkpointing, it still is in the old state
> (i.e. different state than primary). Thus we can't expose the primary state
> until after the checkpoint is sent.
> 
> This fixes sporadic connection reset of client connections during failover.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> a69782efc5..a3fc21e86e 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -430,12 +430,6 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> -    qemu_event_reset(&s->colo_checkpoint_event);
> -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> -    if (local_err) {
> -        goto out;
> -    }
> -
>      /* Disable block migration */
>      migrate_set_block_enabled(false, &local_err);
>      qemu_mutex_lock_iothread();
> @@ -494,6 +488,12 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> +    qemu_event_reset(&s->colo_checkpoint_event);
> +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
>      colo_receive_check_message(s->rp_state.from_dst_file,
>                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
>      if (local_err) {
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error
  2020-05-11 11:10 ` [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
@ 2020-05-15  6:24   ` Zhanghailiang
  0 siblings, 0 replies; 22+ messages in thread
From: Zhanghailiang @ 2020-05-15  6:24 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela

Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>

> -----Original Message-----
> From: Lukas Straub [mailto:lukasstraub2@web.de]
> Sent: Monday, May 11, 2020 7:11 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Zhanghailiang <zhang.zhanghailiang@huawei.com>; Juan Quintela
> <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>
> Subject: [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an
> error
> 
> If vmstate_loading is true, secondary_vm_do_failover will set failover status
> to FAILOVER_STATUS_RELAUNCH and return success without initiating
> failover. However, if there is an error during the vmstate_loading section,
> failover isn't relaunched. Instead we then wait for failover on
> colo_incoming_sem.
> 
> Fix this by relaunching failover even if there was an error. Also, to make this
> work properly, set vmstate_loading to false when returning during the
> vmstate_loading section.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> 2947363ae5..a69782efc5 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -743,6 +743,7 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>      ret = qemu_load_device_state(fb);
>      if (ret < 0) {
>          error_setg(errp, "COLO: load device state failed");
> +        vmstate_loading = false;
>          qemu_mutex_unlock_iothread();
>          return;
>      }
> @@ -751,6 +752,7 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>      replication_get_error_all(&local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> +        vmstate_loading = false;
>          qemu_mutex_unlock_iothread();
>          return;
>      }
> @@ -759,6 +761,7 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>      replication_do_checkpoint_all(&local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> +        vmstate_loading = false;
>          qemu_mutex_unlock_iothread();
>          return;
>      }
> @@ -770,6 +773,7 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> 
>      if (local_err) {
>          error_propagate(errp, local_err);
> +        vmstate_loading = false;
>          qemu_mutex_unlock_iothread();
>          return;
>      }
> @@ -780,9 +784,6 @@ static void
> colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>      qemu_mutex_unlock_iothread();
> 
>      if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
> -        failover_set_state(FAILOVER_STATUS_RELAUNCH,
> -                        FAILOVER_STATUS_NONE);
> -        failover_request_active(NULL);
>          return;
>      }
> 
> @@ -881,6 +882,14 @@ void *colo_process_incoming_thread(void
> *opaque)
>              error_report_err(local_err);
>              break;
>          }
> +
> +        if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
> +            failover_set_state(FAILOVER_STATUS_RELAUNCH,
> +                            FAILOVER_STATUS_NONE);
> +            failover_request_active(NULL);
> +            break;
> +        }
> +
>          if (failover_get_state() != FAILOVER_STATUS_NONE) {
>              error_report("failover request");
>              break;
> @@ -888,8 +897,6 @@ void *colo_process_incoming_thread(void *opaque)
>      }
> 
>  out:
> -    vmstate_loading = false;
> -
>      /*
>       * There are only two reasons we can get here, some error happened
>       * or the user triggered failover.
> --
> 2.20.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 答复: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-14 13:05   ` 答复: " Zhanghailiang
@ 2020-05-18 11:55     ` Dr. David Alan Gilbert
  2020-05-19 13:08       ` Lukas Straub
  0 siblings, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-18 11:55 UTC (permalink / raw)
  To: Zhanghailiang; +Cc: Lukas Straub, qemu-devel, Juan Quintela

* Zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> > This causes the migration thread to hang if we failover during checkpoint. A
> > shutdown fd won't cause network traffic anyway.
> > 
> 
> I'm not quite sure if this modification can take side effect on normal migration process or not,
> There are several places calling it.
> 
> Maybe Juan and Dave can help ;)
> 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > ---
> >  migration/qemu-file.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > 1c3a358a14..0748b5810f 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> > qemu_file_rate_limit(QEMUFile *f)  {
> >      if (f->shutdown) {
> > -        return 1;
> > +        return 0;
> >      }

This looks wrong to me; I'd be curious to understand how it's hanging
for you.
'1' means 'stop what you're doing', 0 means carry on; carrying on with a
shutdown fd sounds wrong.

If we look at ram.c we have:

        while ((ret = qemu_file_rate_limit(f)) == 0 ||
                !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
            int pages;
        ....

so if it returns '1', as it does at the moment it should cause it to
exit the ram_save_iterate loop - which is what we want if it's failing.
Thus I think you need to find the actual place it's stuck in this case -
I suspect it's repeatedly calling ram_save_iterate and then exiting it,
but if that's happening perhaps we're missing a qemu_file_get_error
check somewhere.

Dave

> >      if (qemu_file_get_error(f)) {
> >          return 1;
> > --
> > 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-18 11:55     ` Dr. David Alan Gilbert
@ 2020-05-19 13:08       ` Lukas Straub
  2020-05-19 14:50         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Straub @ 2020-05-19 13:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Zhanghailiang, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4448 bytes --]

On Mon, 18 May 2020 12:55:34 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> > > This causes the migration thread to hang if we failover during checkpoint. A
> > > shutdown fd won't cause network traffic anyway.
> > >   
> > 
> > I'm not quite sure if this modification can take side effect on normal migration process or not,
> > There are several places calling it.
> > 
> > Maybe Juan and Dave can help ;)
> >   
> > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > ---
> > >  migration/qemu-file.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > > 1c3a358a14..0748b5810f 100644
> > > --- a/migration/qemu-file.c
> > > +++ b/migration/qemu-file.c
> > > @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> > > qemu_file_rate_limit(QEMUFile *f)  {
> > >      if (f->shutdown) {
> > > -        return 1;
> > > +        return 0;
> > >      }  
> 
> This looks wrong to me; I'd be curious to understand how it's hanging
> for you.
> '1' means 'stop what you're doing', 0 means carry on; carrying on with a
> shutdown fd sounds wrong.
> 
> If we look at ram.c we have:
> 
>         while ((ret = qemu_file_rate_limit(f)) == 0 ||
>                 !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
>             int pages;
>         ....
> 
> so if it returns '1', as it does at the moment it should cause it to
> exit the ram_save_iterate loop - which is what we want if it's failing.
> Thus I think you need to find the actual place it's stuck in this case -
> I suspect it's repeatedly calling ram_save_iterate and then exiting it,
> but if that's happening perhaps we're missing a qemu_file_get_error
> check somewhere.

Hi,
the problem is in ram_save_host_page and migration_rate_limit, here is a backtrace:

#0  0x00007f7b502921a8 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f7ada7fb3f0, clockid=0, expected=0, futex_word=0x55bc358b9908) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
#1  do_futex_wait (sem=sem@entry=0x55bc358b9908, abstime=abstime@entry=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:112
#2  0x00007f7b502922d3 in __new_sem_wait_slow (sem=0x55bc358b9908, abstime=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:184
#3  0x000055bc3382b6c1 in qemu_sem_timedwait (sem=0x55bc358b9908, ms=100) at util/qemu-thread-posix.c:306
#4  0x000055bc3363950b in migration_rate_limit () at migration/migration.c:3365
#5  0x000055bc332b70d3 in ram_save_host_page (rs=0x7f7acc001a70, pss=0x7f7ada7fb4b0, last_stage=true) at /home/lukas/qemu/migration/ram.c:1696
#6  0x000055bc332b71fa in ram_find_and_save_block (rs=0x7f7acc001a70, last_stage=true) at /home/lukas/qemu/migration/ram.c:1750
#7  0x000055bc332b8bbd in ram_save_complete (f=0x55bc36661330, opaque=0x55bc33fbc678 <ram_state>) at /home/lukas/qemu/migration/ram.c:2606
#8  0x000055bc3364112c in qemu_savevm_state_complete_precopy_iterable (f=0x55bc36661330, in_postcopy=false) at migration/savevm.c:1344
#9  0x000055bc33641556 in qemu_savevm_state_complete_precopy (f=0x55bc36661330, iterable_only=true, inactivate_disks=false) at migration/savevm.c:1442
#10 0x000055bc33641982 in qemu_savevm_live_state (f=0x55bc36661330) at migration/savevm.c:1569
#11 0x000055bc33645407 in colo_do_checkpoint_transaction (s=0x55bc358b9840, bioc=0x7f7acc059990, fb=0x7f7acc4627b0) at migration/colo.c:464
#12 0x000055bc336457ca in colo_process_checkpoint (s=0x55bc358b9840) at migration/colo.c:589
#13 0x000055bc336459e4 in migrate_start_colo_process (s=0x55bc358b9840) at migration/colo.c:666
#14 0x000055bc336393d7 in migration_iteration_finish (s=0x55bc358b9840) at migration/migration.c:3312
#15 0x000055bc33639753 in migration_thread (opaque=0x55bc358b9840) at migration/migration.c:3477
#16 0x000055bc3382bbb5 in qemu_thread_start (args=0x55bc357c27c0) at util/qemu-thread-posix.c:519
#17 0x00007f7b50288f27 in start_thread (arg=<optimized out>) at pthread_create.c:479
#18 0x00007f7b501ba31f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It hangs in ram_save_host_page for at least 10 Minutes.

Regards,
Lukas Straub

> Dave
> 
> > >      if (qemu_file_get_error(f)) {
> > >          return 1;
> > > --
> > > 2.20.1  
> >   
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-19 13:08       ` Lukas Straub
@ 2020-05-19 14:50         ` Dr. David Alan Gilbert
  2020-05-20 20:44           ` Lukas Straub
  0 siblings, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-19 14:50 UTC (permalink / raw)
  To: Lukas Straub; +Cc: Juan Quintela, Zhanghailiang, qemu-devel

* Lukas Straub (lukasstraub2@web.de) wrote:
> On Mon, 18 May 2020 12:55:34 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> > > > This causes the migration thread to hang if we failover during checkpoint. A
> > > > shutdown fd won't cause network traffic anyway.
> > > >   
> > > 
> > > I'm not quite sure if this modification can take side effect on normal migration process or not,
> > > There are several places calling it.
> > > 
> > > Maybe Juan and Dave can help ;)
> > >   
> > > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > > ---
> > > >  migration/qemu-file.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > > > 1c3a358a14..0748b5810f 100644
> > > > --- a/migration/qemu-file.c
> > > > +++ b/migration/qemu-file.c
> > > > @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> > > > qemu_file_rate_limit(QEMUFile *f)  {
> > > >      if (f->shutdown) {
> > > > -        return 1;
> > > > +        return 0;
> > > >      }  
> > 
> > This looks wrong to me; I'd be curious to understand how it's hanging
> > for you.
> > '1' means 'stop what you're doing', 0 means carry on; carrying on with a
> > shutdown fd sounds wrong.
> > 
> > If we look at ram.c we have:
> > 
> >         while ((ret = qemu_file_rate_limit(f)) == 0 ||
> >                 !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
> >             int pages;
> >         ....
> > 
> > so if it returns '1', as it does at the moment it should cause it to
> > exit the ram_save_iterate loop - which is what we want if it's failing.
> > Thus I think you need to find the actual place it's stuck in this case -
> > I suspect it's repeatedly calling ram_save_iterate and then exiting it,
> > but if that's happening perhaps we're missing a qemu_file_get_error
> > check somewhere.
> 
> Hi,
> the problem is in ram_save_host_page and migration_rate_limit, here is a backtrace:

Ah...

> #0  0x00007f7b502921a8 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f7ada7fb3f0, clockid=0, expected=0, futex_word=0x55bc358b9908) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
> #1  do_futex_wait (sem=sem@entry=0x55bc358b9908, abstime=abstime@entry=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:112
> #2  0x00007f7b502922d3 in __new_sem_wait_slow (sem=0x55bc358b9908, abstime=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:184
> #3  0x000055bc3382b6c1 in qemu_sem_timedwait (sem=0x55bc358b9908, ms=100) at util/qemu-thread-posix.c:306
> #4  0x000055bc3363950b in migration_rate_limit () at migration/migration.c:3365

OK, so how about:

diff --git a/migration/migration.c b/migration/migration.c
index b6b662e016..4e885385a8 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3356,6 +3356,10 @@ bool migration_rate_limit(void)
     bool urgent = false;
     migration_update_counters(s, now);
     if (qemu_file_rate_limit(s->to_dst_file)) {
+
+        if (qemu_file_get_error(mis->from_src_file)) {
+            return false;
+        }
         /*
          * Wait for a delay to do rate limiting OR
          * something urgent to post the semaphore.

Does that work?
I wonder if we also need to kick the rate_limit_sem when we yank the
socket.

Dave

> #5  0x000055bc332b70d3 in ram_save_host_page (rs=0x7f7acc001a70, pss=0x7f7ada7fb4b0, last_stage=true) at /home/lukas/qemu/migration/ram.c:1696
> #6  0x000055bc332b71fa in ram_find_and_save_block (rs=0x7f7acc001a70, last_stage=true) at /home/lukas/qemu/migration/ram.c:1750
> #7  0x000055bc332b8bbd in ram_save_complete (f=0x55bc36661330, opaque=0x55bc33fbc678 <ram_state>) at /home/lukas/qemu/migration/ram.c:2606
> #8  0x000055bc3364112c in qemu_savevm_state_complete_precopy_iterable (f=0x55bc36661330, in_postcopy=false) at migration/savevm.c:1344
> #9  0x000055bc33641556 in qemu_savevm_state_complete_precopy (f=0x55bc36661330, iterable_only=true, inactivate_disks=false) at migration/savevm.c:1442
> #10 0x000055bc33641982 in qemu_savevm_live_state (f=0x55bc36661330) at migration/savevm.c:1569
> #11 0x000055bc33645407 in colo_do_checkpoint_transaction (s=0x55bc358b9840, bioc=0x7f7acc059990, fb=0x7f7acc4627b0) at migration/colo.c:464
> #12 0x000055bc336457ca in colo_process_checkpoint (s=0x55bc358b9840) at migration/colo.c:589
> #13 0x000055bc336459e4 in migrate_start_colo_process (s=0x55bc358b9840) at migration/colo.c:666
> #14 0x000055bc336393d7 in migration_iteration_finish (s=0x55bc358b9840) at migration/migration.c:3312
> #15 0x000055bc33639753 in migration_thread (opaque=0x55bc358b9840) at migration/migration.c:3477
> #16 0x000055bc3382bbb5 in qemu_thread_start (args=0x55bc357c27c0) at util/qemu-thread-posix.c:519
> #17 0x00007f7b50288f27 in start_thread (arg=<optimized out>) at pthread_create.c:479
> #18 0x00007f7b501ba31f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> It hangs in ram_save_host_page for at least 10 Minutes.
> 
> Regards,
> Lukas Straub
> 
> > Dave
> > 
> > > >      if (qemu_file_get_error(f)) {
> > > >          return 1;
> > > > --
> > > > 2.20.1  
> > >   
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
  2020-05-19 14:50         ` Dr. David Alan Gilbert
@ 2020-05-20 20:44           ` Lukas Straub
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Straub @ 2020-05-20 20:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Zhanghailiang, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5893 bytes --]

On Tue, 19 May 2020 15:50:20 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Lukas Straub (lukasstraub2@web.de) wrote:
> > On Mon, 18 May 2020 12:55:34 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:  
> > > > > This causes the migration thread to hang if we failover during checkpoint. A
> > > > > shutdown fd won't cause network traffic anyway.
> > > > >     
> > > > 
> > > > I'm not quite sure if this modification can take side effect on normal migration process or not,
> > > > There are several places calling it.
> > > > 
> > > > Maybe Juan and Dave can help ;)
> > > >     
> > > > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > > > ---
> > > > >  migration/qemu-file.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > > > > 1c3a358a14..0748b5810f 100644
> > > > > --- a/migration/qemu-file.c
> > > > > +++ b/migration/qemu-file.c
> > > > > @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> > > > > qemu_file_rate_limit(QEMUFile *f)  {
> > > > >      if (f->shutdown) {
> > > > > -        return 1;
> > > > > +        return 0;
> > > > >      }    
> > > 
> > > This looks wrong to me; I'd be curious to understand how it's hanging
> > > for you.
> > > '1' means 'stop what you're doing', 0 means carry on; carrying on with a
> > > shutdown fd sounds wrong.
> > > 
> > > If we look at ram.c we have:
> > > 
> > >         while ((ret = qemu_file_rate_limit(f)) == 0 ||
> > >                 !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
> > >             int pages;
> > >         ....
> > > 
> > > so if it returns '1', as it does at the moment it should cause it to
> > > exit the ram_save_iterate loop - which is what we want if it's failing.
> > > Thus I think you need to find the actual place it's stuck in this case -
> > > I suspect it's repeatedly calling ram_save_iterate and then exiting it,
> > > but if that's happening perhaps we're missing a qemu_file_get_error
> > > check somewhere.  
> > 
> > Hi,
> > the problem is in ram_save_host_page and migration_rate_limit, here is a backtrace:  
> 
> Ah...
> 
> > #0  0x00007f7b502921a8 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f7ada7fb3f0, clockid=0, expected=0, futex_word=0x55bc358b9908) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
> > #1  do_futex_wait (sem=sem@entry=0x55bc358b9908, abstime=abstime@entry=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:112
> > #2  0x00007f7b502922d3 in __new_sem_wait_slow (sem=0x55bc358b9908, abstime=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:184
> > #3  0x000055bc3382b6c1 in qemu_sem_timedwait (sem=0x55bc358b9908, ms=100) at util/qemu-thread-posix.c:306
> > #4  0x000055bc3363950b in migration_rate_limit () at migration/migration.c:3365  
> 
> OK, so how about:
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b6b662e016..4e885385a8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3356,6 +3356,10 @@ bool migration_rate_limit(void)
>      bool urgent = false;
>      migration_update_counters(s, now);
>      if (qemu_file_rate_limit(s->to_dst_file)) {
> +
> +        if (qemu_file_get_error(mis->from_src_file)) {
> +            return false;
> +        }
>          /*
>           * Wait for a delay to do rate limiting OR
>           * something urgent to post the semaphore.
> 
> Does that work?

Yes, this works well using s->to_dst_file instead of mis->from_src_file.

Regards,
Lukas Straub

> I wonder if we also need to kick the rate_limit_sem when we yank the
> socket.
> 
> Dave
> 
> > #5  0x000055bc332b70d3 in ram_save_host_page (rs=0x7f7acc001a70, pss=0x7f7ada7fb4b0, last_stage=true) at /home/lukas/qemu/migration/ram.c:1696
> > #6  0x000055bc332b71fa in ram_find_and_save_block (rs=0x7f7acc001a70, last_stage=true) at /home/lukas/qemu/migration/ram.c:1750
> > #7  0x000055bc332b8bbd in ram_save_complete (f=0x55bc36661330, opaque=0x55bc33fbc678 <ram_state>) at /home/lukas/qemu/migration/ram.c:2606
> > #8  0x000055bc3364112c in qemu_savevm_state_complete_precopy_iterable (f=0x55bc36661330, in_postcopy=false) at migration/savevm.c:1344
> > #9  0x000055bc33641556 in qemu_savevm_state_complete_precopy (f=0x55bc36661330, iterable_only=true, inactivate_disks=false) at migration/savevm.c:1442
> > #10 0x000055bc33641982 in qemu_savevm_live_state (f=0x55bc36661330) at migration/savevm.c:1569
> > #11 0x000055bc33645407 in colo_do_checkpoint_transaction (s=0x55bc358b9840, bioc=0x7f7acc059990, fb=0x7f7acc4627b0) at migration/colo.c:464
> > #12 0x000055bc336457ca in colo_process_checkpoint (s=0x55bc358b9840) at migration/colo.c:589
> > #13 0x000055bc336459e4 in migrate_start_colo_process (s=0x55bc358b9840) at migration/colo.c:666
> > #14 0x000055bc336393d7 in migration_iteration_finish (s=0x55bc358b9840) at migration/migration.c:3312
> > #15 0x000055bc33639753 in migration_thread (opaque=0x55bc358b9840) at migration/migration.c:3477
> > #16 0x000055bc3382bbb5 in qemu_thread_start (args=0x55bc357c27c0) at util/qemu-thread-posix.c:519
> > #17 0x00007f7b50288f27 in start_thread (arg=<optimized out>) at pthread_create.c:479
> > #18 0x00007f7b501ba31f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> > 
> > It hangs in ram_save_host_page for at least 10 Minutes.
> > 
> > Regards,
> > Lukas Straub
> >   
> > > Dave
> > >   
> > > > >      if (qemu_file_get_error(f)) {
> > > > >          return 1;
> > > > > --
> > > > > 2.20.1    
> > > >     
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > >   
> >   
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] colo: migration related bugfixes
  2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
                   ` (5 preceding siblings ...)
  2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
@ 2020-06-01 16:50 ` Dr. David Alan Gilbert
  6 siblings, 0 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-01 16:50 UTC (permalink / raw)
  To: Lukas Straub; +Cc: Juan Quintela, qemu-devel, Hailiang Zhang

* Lukas Straub (lukasstraub2@web.de) wrote:
> Hello Everyone,
> Here are fixes for bugs that I found in my tests.

I've queued 1,2,3,4,6 but not 5 since that was still under discussion.

Dave

> Regards,
> Lukas Straub
> 
> Lukas Straub (6):
>   migration/colo.c: Use event instead of semaphore
>   migration/colo.c: Use cpu_synchronize_all_states()
>   migration/colo.c: Flush ram cache only after receiving device state
>   migration/colo.c: Relaunch failover even if there was an error
>   migration/qemu-file.c: Don't ratelimit a shutdown fd
>   migration/colo.c: Move colo_notify_compares_event to the right place
> 
>  migration/colo.c      | 39 ++++++++++++++++++++++++---------------
>  migration/migration.h |  4 ++--
>  migration/qemu-file.c |  2 +-
>  migration/ram.c       |  5 +----
>  migration/ram.h       |  1 +
>  5 files changed, 29 insertions(+), 22 deletions(-)
> 
> -- 
> 2.20.1


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-06-01 16:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
2020-05-13 11:31   ` 答复: " Zhanghailiang
2020-05-11 11:10 ` [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
2020-05-13  9:47   ` Dr. David Alan Gilbert
2020-05-13 19:15     ` Lukas Straub
2020-05-11 11:10 ` [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
2020-05-14 12:45   ` 答复: " Zhanghailiang
2020-05-11 11:10 ` [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
2020-05-15  6:24   ` Zhanghailiang
2020-05-11 11:10 ` [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd Lukas Straub
2020-05-14 13:05   ` 答复: " Zhanghailiang
2020-05-18 11:55     ` Dr. David Alan Gilbert
2020-05-19 13:08       ` Lukas Straub
2020-05-19 14:50         ` Dr. David Alan Gilbert
2020-05-20 20:44           ` Lukas Straub
2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
2020-05-14 13:27   ` 答复: " Zhanghailiang
2020-05-14 14:31     ` Lukas Straub
2020-05-15  1:45       ` Zhanghailiang
2020-05-15  1:53   ` Zhanghailiang
2020-06-01 16:50 ` [PATCH 0/6] colo: migration related bugfixes Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.