[Qemu-devel] [Bug?] BQL about live migration

* [Qemu-devel] [Bug?] BQL about live migration
@ 2017-03-03  9:29 Gonglei (Arei)
  2017-03-03 10:42 ` Fam Zheng
  2017-03-03 12:00 ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 12+ messages in thread
From: Gonglei (Arei) @ 2017-03-03  9:29 UTC (permalink / raw)
  To: quintela, Dr. David Alan Gilbert, qemu-devel; +Cc: yanghongyang, Huangzhichao

Hello Juan & Dave,

We hit a bug in our test:
Network error occurs when migrating a guest, libvirt then rollback the
migration, causes qemu coredump
qemu log:
2017-03-01T12:54:33.904949+08:00|info|qemu[17672]|[33614]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344073, "microseconds": 904914}, "event": "STOP"}
2017-03-01T12:54:37.522500+08:00|info|qemu[17672]|[17672]|handle_qmp_command[3930]|: qmp_cmd_name: migrate_cancel
2017-03-01T12:54:37.522607+08:00|info|qemu[17672]|[17672]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 522556}, "event": "MIGRATION", "data": {"status": "cancelling"}}
2017-03-01T12:54:37.524671+08:00|info|qemu[17672]|[17672]|handle_qmp_command[3930]|: qmp_cmd_name: cont
2017-03-01T12:54:37.524733+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-balloon device status is 7 that means DRIVER OK
2017-03-01T12:54:37.525434+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-net device status is 7 that means DRIVER OK
2017-03-01T12:54:37.525484+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-blk device status is 7 that means DRIVER OK
2017-03-01T12:54:37.525562+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-serial device status is 7 that means DRIVER OK
2017-03-01T12:54:37.527653+08:00|info|qemu[17672]|[17672]|vm_start[981]|: vm_state-notify:3ms
2017-03-01T12:54:37.528523+08:00|info|qemu[17672]|[17672]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 527699}, "event": "RESUME"}
2017-03-01T12:54:37.530680+08:00|info|qemu[17672]|[33614]|migration_bitmap_sync[720]|: this iteration cycle takes 3s, new dirtied data:0MB
2017-03-01T12:54:37.530909+08:00|info|qemu[17672]|[33614]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 530733}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
2017-03-01T04:54:37.530997Z qemu-kvm: socket_writev_buffer: Got err=32 for (131583/18446744073709551615)
qemu-kvm: /home/abuild/rpmbuild/BUILD/qemu-kvm-2.6.0/hw/net/virtio_net.c:1519: virtio_net_save: Assertion `!n->vhost_started' failed.
2017-03-01 12:54:43.028: shutting down

>From qemu log, qemu received and processed migrate_cancel/cont qmp commands
after guest been stopped and entered the last round of migration. Then
migration thread try to save device state when guest is running(started by
cont command), causes assert and coredump.
This is because in last iter, we call cpu_synchronize_all_states() to
synchronize vcpu states, this call will release qemu_global_mutex and wait
for do_kvm_cpu_synchronize_state() to be executed on target vcpu:
(gdb) bt
#0  0x00007f763d1046d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7643e51d7f in qemu_cond_wait (cond=0x7f764445eca0 <qemu_work_cond>, mutex=0x7f764445eba0 <qemu_global_mutex>) at util/qemu-thread-posix.c:132
#2  0x00007f7643a2e154 in run_on_cpu (cpu=0x7f7644e06d80, func=0x7f7643a46413 <do_kvm_cpu_synchronize_state>, data=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/cpus.c:995
#3  0x00007f7643a46487 in kvm_cpu_synchronize_state (cpu=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/kvm-all.c:1805
#4  0x00007f7643a2c700 in cpu_synchronize_state (cpu=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/include/sysemu/kvm.h:457
#5  0x00007f7643a2db0c in cpu_synchronize_all_states () at /mnt/public/yanghy/qemu-kvm/cpus.c:766
#6  0x00007f7643a67b5b in qemu_savevm_state_complete_precopy (f=0x7f76462f2d30, iterable_only=false) at /mnt/public/yanghy/qemu-kvm/migration/savevm.c:1051
#7  0x00007f7643d121e9 in migration_completion (s=0x7f76443e78c0 <current_migration.37571>, current_active_state=4, old_vm_running=0x7f74343fda00, start_time=0x7f74343fda08) at migration/migration.c:1753
#8  0x00007f7643d126c5 in migration_thread (opaque=0x7f76443e78c0 <current_migration.37571>) at migration/migration.c:1922
#9  0x00007f763d100dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f763ce2e71d in clone () from /lib64/libc.so.6
(gdb) p iothread_locked
$1 = true

and then, qemu main thread been executed, it won't block because migration
thread released the qemu_global_mutex:
(gdb) thr 1
[Switching to thread 1 (Thread 0x7fe298e08bc0 (LWP 30767))]
#0  os_host_main_loop_wait (timeout=931565) at main-loop.c:270
270                 QEMU_LOG(LOG_INFO,"***** after qemu_pool_ns: timeout %d\n", timeout);
(gdb) p iothread_locked
$2 = true
(gdb) l 268
263
264         ret = qemu_poll_ns((GPollFD *)gpollfds->data, gpollfds->len, timeout);
265
266
267         if (timeout) {
268             qemu_mutex_lock_iothread();
269             if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
270                 QEMU_LOG(LOG_INFO,"***** after qemu_pool_ns: timeout %d\n", timeout);
271             }
272         }
(gdb)

So, although we've hold iothread_lock in stop&copy phase of migration, we
can't guarantee the iothread been locked all through the stop & copy phase,
any thoughts on how to solve this problem?

Thanks,
-Gonglei

^ permalink raw reply	[flat|nested] 12+ messages in thread