All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
@ 2016-05-23 18:54 Jason J. Herne
  2016-05-24  2:12 ` Fam Zheng
  2016-06-09 16:31 ` [Qemu-devel] " Stefan Hajnoczi
  0 siblings, 2 replies; 16+ messages in thread
From: Jason J. Herne @ 2016-05-23 18:54 UTC (permalink / raw)
  To: qemu-block, qemu-devel, famz, stefanha, jcody, quintela

Using libvirt to migrate a guest and one guest disk that is using 
iothreads causes Qemu to crash with the message:
Co-routine re-entered recursively

I've looked into this one a bit but I have not seen anything that 
immediately stands out.
Here is what I have found:

In qemu_coroutine_enter:
     if (co->caller) {
         fprintf(stderr, "Co-routine re-entered recursively\n");
         abort();
     }

The value of co->caller is actually changing between the time "if 
(co->caller)" is evaluated and the time I print some debug statements 
directly under the existing fprintf. I confirmed this by saving the 
value in a local variable and printing both the new local variable and 
co->caller immediately after the existing fprintf. This would certainly 
indicate some kind of concurrency issue. However, it does not 
necessarily point to the reason we ended up inside this if statement 
because co->caller was not NULL before it was trashed. Perhaps it was 
trashed more than once then? I figured maybe the problem was with 
coroutine pools so I disabled them (--disable-coroutine-pool) and still 
hit the bug.

The backtrace is not always identical. Here is one instance:
(gdb) bt
#0  0x000003ffa78be2c0 in raise () from /lib64/libc.so.6
#1  0x000003ffa78bfc26 in abort () from /lib64/libc.so.6
#2  0x0000000080427d80 in qemu_coroutine_enter (co=0xa2cf2b40, 
opaque=0x0) at /root/kvmdev/qemu/util/qemu-coroutine.c:112
#3  0x000000008032246e in nbd_restart_write	 (opaque=0xa2d0cd40) at 
/root/kvmdev/qemu/block/nbd-client.c:114
#4  0x00000000802b3a1c in aio_dispatch (ctx=0xa2c907a0) at 
/root/kvmdev/qemu/aio-posix.c:341
#5  0x00000000802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at 
/root/kvmdev/qemu/aio-posix.c:479
#6  0x0000000080155aba in iothread_run (opaque=0xa2c90260) at 
/root/kvmdev/qemu/iothread.c:46
#7  0x000003ffa7a87c2c in start_thread () from /lib64/libpthread.so.0
#8  0x000003ffa798ec9a in thread_start () from /lib64/libc.so.6

I've also noticed that co->entry sometimes (maybe always?) points to 
mirror_run. Though, given that co->caller changes unexpectedly I don't 
know if we can trust co->entry.

I do not see the bug when I perform the same migration without migrating 
the disk.
I also do not see the bug when I remove the iothread from the guest.

I tested this scenario as far back as tag v2.4.0 and hit the bug every 
time. I was unable to test v2.3.0 due to unresolved guest hangs. I did, 
however, manage to get as far as this commit:

commit ca96ac44dcd290566090b2435bc828fded356ad9
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Tue Jul 28 18:34:09 2015 +0200
AioContext: force event loop iteration using BH

This commit fixes a hang that my test scenario experiences. I was able 
to test even further back by cherry-picking ca96ac44 on top of the 
earlier commits but at this point I cannot be sure if the bug was 
introduced by ca96ac44 so I stopped.

I am willing to run tests or collect any info needed. I'll keep 
investigating but I won't turn down any help :).

Qemu command line as taken from Libvirt log:
qemu-system-s390x
     -name kvm1 -S -machine s390-ccw-virtio-2.6,accel=kvm,usb=off
     -m 6144 -realtime mlock=off
     -smp 1,sockets=1,cores=1,threads=1
     -object iothread,id=iothread1
     -uuid 3796d9f0-8555-4a1e-9d5c-fac56b8cbf56
     -nographic -no-user-config -nodefaults
     -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-kvm1/monitor.sock,server,nowait
     -mon chardev=charmonitor,id=monitor,mode=control
     -rtc base=utc -no-shutdown
     -boot strict=on -kernel /data/vms/kvm1/kvm1-image
     -initrd /data/vms/kvm1/kvm1-initrd -append 'hvc_iucv=8 TERM=dumb'
     -drive 
file=/dev/disk/by-path/ccw-0.0.c22b,format=raw,if=none,id=drive-virtio-disk0,cache=none
     -device 
virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
     -drive 
file=/data/vms/kvm1/kvm1.qcow,format=qcow2,if=none,id=drive-virtio-disk1,cache=none
     -device 
virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0008,drive=drive-virtio-disk1,id=virtio-disk1
     -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27
     -device 
virtio-net-ccw,netdev=hostnet0,id=net0,mac=52:54:00:c9:86:2b,devno=fe.0.0001
     -chardev pty,id=charconsole0 -device 
sclpconsole,chardev=charconsole0,id=console0
     -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg 
timestamp=on

Libvirt migration command:
virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb 
  kvm1 qemu+ssh://dev1/system

-- 
-- Jason J. Herne (jjherne@linux.vnet.ibm.com)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-05-23 18:54 [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads Jason J. Herne
@ 2016-05-24  2:12 ` Fam Zheng
  2016-05-24 15:05   ` Jason J. Herne
  2016-06-09 16:31 ` [Qemu-devel] " Stefan Hajnoczi
  1 sibling, 1 reply; 16+ messages in thread
From: Fam Zheng @ 2016-05-24  2:12 UTC (permalink / raw)
  To: Jason J. Herne; +Cc: qemu-block, qemu-devel, stefanha, jcody, quintela

On Mon, 05/23 14:54, Jason J. Herne wrote:
> Using libvirt to migrate a guest and one guest disk that is using iothreads
> causes Qemu to crash with the message:
> Co-routine re-entered recursively
> 
> I've looked into this one a bit but I have not seen anything that
> immediately stands out.
> Here is what I have found:
> 
> In qemu_coroutine_enter:
>     if (co->caller) {
>         fprintf(stderr, "Co-routine re-entered recursively\n");
>         abort();
>     }
> 
> The value of co->caller is actually changing between the time "if
> (co->caller)" is evaluated and the time I print some debug statements
> directly under the existing fprintf. I confirmed this by saving the value in
> a local variable and printing both the new local variable and co->caller
> immediately after the existing fprintf. This would certainly indicate some
> kind of concurrency issue. However, it does not necessarily point to the
> reason we ended up inside this if statement because co->caller was not NULL
> before it was trashed. Perhaps it was trashed more than once then? I figured
> maybe the problem was with coroutine pools so I disabled them
> (--disable-coroutine-pool) and still hit the bug.

Which coroutine backend are you using?

> 
> The backtrace is not always identical. Here is one instance:
> (gdb) bt
> #0  0x000003ffa78be2c0 in raise () from /lib64/libc.so.6
> #1  0x000003ffa78bfc26 in abort () from /lib64/libc.so.6
> #2  0x0000000080427d80 in qemu_coroutine_enter (co=0xa2cf2b40, opaque=0x0)
> at /root/kvmdev/qemu/util/qemu-coroutine.c:112
> #3  0x000000008032246e in nbd_restart_write	 (opaque=0xa2d0cd40) at
> /root/kvmdev/qemu/block/nbd-client.c:114
> #4  0x00000000802b3a1c in aio_dispatch (ctx=0xa2c907a0) at
> /root/kvmdev/qemu/aio-posix.c:341
> #5  0x00000000802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at
> /root/kvmdev/qemu/aio-posix.c:479
> #6  0x0000000080155aba in iothread_run (opaque=0xa2c90260) at
> /root/kvmdev/qemu/iothread.c:46
> #7  0x000003ffa7a87c2c in start_thread () from /lib64/libpthread.so.0
> #8  0x000003ffa798ec9a in thread_start () from /lib64/libc.so.6

It may be worth looking at backtrace of all threads especially the monitor
thread (main thread).

> 
> I've also noticed that co->entry sometimes (maybe always?) points to
> mirror_run. Though, given that co->caller changes unexpectedly I don't know
> if we can trust co->entry.
> 
> I do not see the bug when I perform the same migration without migrating the
> disk.
> I also do not see the bug when I remove the iothread from the guest.
> 
> I tested this scenario as far back as tag v2.4.0 and hit the bug every time.
> I was unable to test v2.3.0 due to unresolved guest hangs. I did, however,
> manage to get as far as this commit:
> 
> commit ca96ac44dcd290566090b2435bc828fded356ad9
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Tue Jul 28 18:34:09 2015 +0200
> AioContext: force event loop iteration using BH
> 
> This commit fixes a hang that my test scenario experiences. I was able to
> test even further back by cherry-picking ca96ac44 on top of the earlier
> commits but at this point I cannot be sure if the bug was introduced by
> ca96ac44 so I stopped.
> 
> I am willing to run tests or collect any info needed. I'll keep
> investigating but I won't turn down any help :).
> 
> Qemu command line as taken from Libvirt log:
> qemu-system-s390x
>     -name kvm1 -S -machine s390-ccw-virtio-2.6,accel=kvm,usb=off
>     -m 6144 -realtime mlock=off
>     -smp 1,sockets=1,cores=1,threads=1
>     -object iothread,id=iothread1
>     -uuid 3796d9f0-8555-4a1e-9d5c-fac56b8cbf56
>     -nographic -no-user-config -nodefaults
>     -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-kvm1/monitor.sock,server,nowait
>     -mon chardev=charmonitor,id=monitor,mode=control
>     -rtc base=utc -no-shutdown
>     -boot strict=on -kernel /data/vms/kvm1/kvm1-image
>     -initrd /data/vms/kvm1/kvm1-initrd -append 'hvc_iucv=8 TERM=dumb'
>     -drive file=/dev/disk/by-path/ccw-0.0.c22b,format=raw,if=none,id=drive-virtio-disk0,cache=none
>     -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>     -drive file=/data/vms/kvm1/kvm1.qcow,format=qcow2,if=none,id=drive-virtio-disk1,cache=none
>     -device virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0008,drive=drive-virtio-disk1,id=virtio-disk1
>     -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27
>     -device
> virtio-net-ccw,netdev=hostnet0,id=net0,mac=52:54:00:c9:86:2b,devno=fe.0.0001
>     -chardev pty,id=charconsole0 -device
> sclpconsole,chardev=charconsole0,id=console0
>     -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on
> 
> Libvirt migration command:
> virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb
> kvm1 qemu+ssh://dev1/system
> 
> -- 
> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-05-24  2:12 ` Fam Zheng
@ 2016-05-24 15:05   ` Jason J. Herne
  2016-05-25  8:36     ` Fam Zheng
  0 siblings, 1 reply; 16+ messages in thread
From: Jason J. Herne @ 2016-05-24 15:05 UTC (permalink / raw)
  To: Fam Zheng; +Cc: jcody, stefanha, qemu-devel, qemu-block, quintela

On 05/23/2016 10:12 PM, Fam Zheng wrote:
> On Mon, 05/23 14:54, Jason J. Herne wrote:
>> Using libvirt to migrate a guest and one guest disk that is using iothreads
>> causes Qemu to crash with the message:
>> Co-routine re-entered recursively
>>
>> I've looked into this one a bit but I have not seen anything that
>> immediately stands out.
>> Here is what I have found:
>>
>> In qemu_coroutine_enter:
>>      if (co->caller) {
>>          fprintf(stderr, "Co-routine re-entered recursively\n");
>>          abort();
>>      }
>>
>> The value of co->caller is actually changing between the time "if
>> (co->caller)" is evaluated and the time I print some debug statements
>> directly under the existing fprintf. I confirmed this by saving the value in
>> a local variable and printing both the new local variable and co->caller
>> immediately after the existing fprintf. This would certainly indicate some
>> kind of concurrency issue. However, it does not necessarily point to the
>> reason we ended up inside this if statement because co->caller was not NULL
>> before it was trashed. Perhaps it was trashed more than once then? I figured
>> maybe the problem was with coroutine pools so I disabled them
>> (--disable-coroutine-pool) and still hit the bug.
>
> Which coroutine backend are you using?
>

ucontext normally. I've also reproduced the problem with sigaltstack.

>>
>> The backtrace is not always identical. Here is one instance:
>> (gdb) bt
>> #0  0x000003ffa78be2c0 in raise () from /lib64/libc.so.6
>> #1  0x000003ffa78bfc26 in abort () from /lib64/libc.so.6
>> #2  0x0000000080427d80 in qemu_coroutine_enter (co=0xa2cf2b40, opaque=0x0)
>> at /root/kvmdev/qemu/util/qemu-coroutine.c:112
>> #3  0x000000008032246e in nbd_restart_write	 (opaque=0xa2d0cd40) at
>> /root/kvmdev/qemu/block/nbd-client.c:114
>> #4  0x00000000802b3a1c in aio_dispatch (ctx=0xa2c907a0) at
>> /root/kvmdev/qemu/aio-posix.c:341
>> #5  0x00000000802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at
>> /root/kvmdev/qemu/aio-posix.c:479
>> #6  0x0000000080155aba in iothread_run (opaque=0xa2c90260) at
>> /root/kvmdev/qemu/iothread.c:46
>> #7  0x000003ffa7a87c2c in start_thread () from /lib64/libpthread.so.0
>> #8  0x000003ffa798ec9a in thread_start () from /lib64/libc.so.6
>
> It may be worth looking at backtrace of all threads especially the monitor
> thread (main thread).
>

Here is a complete backtrace: (gdb) thread apply all bt
Thread #1 is main.
Thread #13 is the crashing thread.

Thread 50 (Thread 0x3fdeb1f1910 (LWP 29570)):
#0  0x000003ff99c901fc in do_futex_wait () from /lib64/libpthread.so.0
#1  0x000003ff99c90302 in __new_sem_wait_slow () from /lib64/libpthread.so.0
#2  0x00000000804097a4 in qemu_sem_timedwait (sem=0x3ff8c000ce8, 
ms=10000) at /root/kvmdev/qemu/util/qemu-thread-posix.c:245
#3  0x00000000802a215e in worker_thread (opaque=0x3ff8c000c80) at 
/root/kvmdev/qemu/thread-pool.c:92
#4  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#5  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 49 (Thread 0x3fdb47ff910 (LWP 29569)):
#0  0x000003ff99c901fc in do_futex_wait () from /lib64/libpthread.so.0
#1  0x000003ff99c90302 in __new_sem_wait_slow () from /lib64/libpthread.so.0
#2  0x00000000804097a4 in qemu_sem_timedwait (sem=0x9c54c8d8, ms=10000) 
at /root/kvmdev/qemu/util/qemu-thread-posix.c:245
#3  0x00000000802a215e in worker_thread (opaque=0x9c54c870) at 
/root/kvmdev/qemu/thread-pool.c:92
#4  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#5  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 15 (Thread 0x3ff999ff910 (LWP 29449)):
#0  0x000003ff99b8841e in syscall () from /lib64/libc.so.6
#1  0x00000000804099c6 in futex_wait (ev=0x80ac597c 
<rcu_call_ready_event>, val=4294967295) at 
/root/kvmdev/qemu/util/qemu-thread-posix.c:292
#2  0x0000000080409c56 in qemu_event_wait (ev=0x80ac597c 
<rcu_call_ready_event>) at /root/kvmdev/qemu/util/qemu-thread-posix.c:399
#3  0x00000000804271ec in call_rcu_thread (opaque=0x0) at 
/root/kvmdev/qemu/util/rcu.c:250
#4  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#5  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 14 (Thread 0x3ff991ff910 (LWP 29451)):
#0  0x000003ff99b8841e in syscall () from /lib64/libc.so.6
#1  0x000003ff9a19e330 in g_cond_wait () from /lib64/libglib-2.0.so.0
#2  0x000000008039d936 in wait_for_trace_records_available () at 
/root/kvmdev/qemu/trace/simple.c:147
#3  0x000000008039d9c6 in writeout_thread (opaque=0x0) at 
/root/kvmdev/qemu/trace/simple.c:165
#4  0x000003ff9a17c4cc in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#6  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 13 (Thread 0x3ff989ff910 (LWP 29452)):
#0  0x000003ff99abe2c0 in raise () from /lib64/libc.so.6
#1  0x000003ff99abfc26 in abort () from /lib64/libc.so.6
#2  0x0000000080427d80 in qemu_coroutine_enter (co=0x9c5a4120, 
opaque=0x0) at /root/kvmdev/qemu/util/qemu-coroutine.c:112
#3  0x000000008032246e in nbd_restart_write (opaque=0x9c5897b0) at 
/root/kvmdev/qemu/block/nbd-client.c:114
#4  0x00000000802b3a1c in aio_dispatch (ctx=0x9c530770) at 
/root/kvmdev/qemu/aio-posix.c:341
#5  0x00000000802b4332 in aio_poll (ctx=0x9c530770, blocking=true) at 
/root/kvmdev/qemu/aio-posix.c:479
#6  0x0000000080155aba in iothread_run (opaque=0x9c530200) at 
/root/kvmdev/qemu/iothread.c:46
#7  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#8  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 12 (Thread 0x3ff91b5c910 (LWP 29456)):
#0  0x000003ff99c8d68a in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x000000008040932e in qemu_cond_wait (cond=0x9c904690, 
mutex=0x8065aab0 <qemu_global_mutex>) at 
/root/kvmdev/qemu/util/qemu-thread-posix.c:123
#2  0x000000008005c1d6 in qemu_kvm_wait_io_event (cpu=0x9c8c8e80) at 
/root/kvmdev/qemu/cpus.c:1030
#3  0x000000008005c37a in qemu_kvm_cpu_thread_fn (arg=0x9c8c8e80) at 
/root/kvmdev/qemu/cpus.c:1069
#4  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#5  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x3ff9a6f2a90 (LWP 29433)):
#0  0x000003ff99c8d68a in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x000000008040932e in qemu_cond_wait (cond=0x9c530800, 
mutex=0x9c5307d0) at /root/kvmdev/qemu/util/qemu-thread-posix.c:123
#2  0x0000000080426a38 in rfifolock_lock (r=0x9c5307d0) at 
/root/kvmdev/qemu/util/rfifolock.c:59
#3  0x00000000802a1f72 in aio_context_acquire (ctx=0x9c530770) at 
/root/kvmdev/qemu/async.c:373
#4  0x00000000802b3f54 in aio_poll (ctx=0x9c530770, blocking=true) at 
/root/kvmdev/qemu/aio-posix.c:415
#5  0x000000008031e7ac in bdrv_flush (bs=0x9c59b5c0) at 
/root/kvmdev/qemu/block/io.c:2470
#6  0x00000000802a8e6e in bdrv_close (bs=0x9c59b5c0) at 
/root/kvmdev/qemu/block.c:2134
#7  0x00000000802a9966 in bdrv_delete (bs=0x9c59b5c0) at 
/root/kvmdev/qemu/block.c:2341
#8  0x00000000802ac7c6 in bdrv_unref (bs=0x9c59b5c0) at 
/root/kvmdev/qemu/block.c:3376
#9  0x0000000080315340 in mirror_exit (job=0x9c956ed0, 
opaque=0x9c9570d0) at /root/kvmdev/qemu/block/mirror.c:494
#10 0x00000000802afb52 in block_job_defer_to_main_loop_bh 
(opaque=0x9c90dc10) at /root/kvmdev/qemu/blockjob.c:476
#11 0x00000000802a10a8 in aio_bh_call (bh=0x9c9090a0) at 
/root/kvmdev/qemu/async.c:66
#12 0x00000000802a1206 in aio_bh_poll (ctx=0x9c51e6c0) at 
/root/kvmdev/qemu/async.c:94
#13 0x00000000802b389e in aio_dispatch (ctx=0x9c51e6c0) at 
/root/kvmdev/qemu/aio-posix.c:308
#14 0x00000000802a1854 in aio_ctx_dispatch (source=0x9c51e6c0, 
callback=0x0, user_data=0x0) at /root/kvmdev/qemu/async.c:233
#15 0x000003ff9a151c0a in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#16 0x00000000802b05ce in glib_pollfds_poll () at 
/root/kvmdev/qemu/main-loop.c:213
#17 0x00000000802b070a in os_host_main_loop_wait (timeout=0) at 
/root/kvmdev/qemu/main-loop.c:258
#18 0x00000000802b0816 in main_loop_wait (nonblocking=0) at 
/root/kvmdev/qemu/main-loop.c:506
#19 0x000000008016d434 in main_loop () at /root/kvmdev/qemu/vl.c:1934
#20 0x00000000801756b8 in main (argc=54, argv=0x3ffdcd7ee58, 
envp=0x3ffdcd7f010) at /root/kvmdev/qemu/vl.c:4656

-- 
-- Jason J. Herne (jjherne@linux.vnet.ibm.com)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-05-24 15:05   ` Jason J. Herne
@ 2016-05-25  8:36     ` Fam Zheng
  2016-06-06 18:55       ` Jason J. Herne
  0 siblings, 1 reply; 16+ messages in thread
From: Fam Zheng @ 2016-05-25  8:36 UTC (permalink / raw)
  To: Jason J. Herne; +Cc: jcody, stefanha, qemu-devel, qemu-block, quintela

On Tue, 05/24 11:05, Jason J. Herne wrote:
> Thread 13 (Thread 0x3ff989ff910 (LWP 29452)):
> #0  0x000003ff99abe2c0 in raise () from /lib64/libc.so.6
> #1  0x000003ff99abfc26 in abort () from /lib64/libc.so.6
> #2  0x0000000080427d80 in qemu_coroutine_enter (co=0x9c5a4120, opaque=0x0)
> at /root/kvmdev/qemu/util/qemu-coroutine.c:112
> #3  0x000000008032246e in nbd_restart_write (opaque=0x9c5897b0) at
> /root/kvmdev/qemu/block/nbd-client.c:114
> #4  0x00000000802b3a1c in aio_dispatch (ctx=0x9c530770) at
> /root/kvmdev/qemu/aio-posix.c:341
> #5  0x00000000802b4332 in aio_poll (ctx=0x9c530770, blocking=true) at
> /root/kvmdev/qemu/aio-posix.c:479
> #6  0x0000000080155aba in iothread_run (opaque=0x9c530200) at
> /root/kvmdev/qemu/iothread.c:46
> #7  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
> #8  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6

This is the continuation of write request to the NBD target....

> Thread 1 (Thread 0x3ff9a6f2a90 (LWP 29433)):
> #0  0x000003ff99c8d68a in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000008040932e in qemu_cond_wait (cond=0x9c530800, mutex=0x9c5307d0)
> at /root/kvmdev/qemu/util/qemu-thread-posix.c:123
> #2  0x0000000080426a38 in rfifolock_lock (r=0x9c5307d0) at
> /root/kvmdev/qemu/util/rfifolock.c:59
> #3  0x00000000802a1f72 in aio_context_acquire (ctx=0x9c530770) at
> /root/kvmdev/qemu/async.c:373
> #4  0x00000000802b3f54 in aio_poll (ctx=0x9c530770, blocking=true) at
> /root/kvmdev/qemu/aio-posix.c:415
> #5  0x000000008031e7ac in bdrv_flush (bs=0x9c59b5c0) at
> /root/kvmdev/qemu/block/io.c:2470
> #6  0x00000000802a8e6e in bdrv_close (bs=0x9c59b5c0) at
> /root/kvmdev/qemu/block.c:2134
> #7  0x00000000802a9966 in bdrv_delete (bs=0x9c59b5c0) at
> /root/kvmdev/qemu/block.c:2341
> #8  0x00000000802ac7c6 in bdrv_unref (bs=0x9c59b5c0) at
> /root/kvmdev/qemu/block.c:3376
> #9  0x0000000080315340 in mirror_exit (job=0x9c956ed0, opaque=0x9c9570d0) at
> /root/kvmdev/qemu/block/mirror.c:494
> #10 0x00000000802afb52 in block_job_defer_to_main_loop_bh
> (opaque=0x9c90dc10) at /root/kvmdev/qemu/blockjob.c:476

... while this is the completion of mirror. They are not supposed to happen
together. Either the job is completed too early, or the nbd_restart_write
function is invoked incorrectly.

I'll see if I can reproduce it here.

Fam

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-05-25  8:36     ` Fam Zheng
@ 2016-06-06 18:55       ` Jason J. Herne
  2016-06-07  2:44         ` Fam Zheng
  0 siblings, 1 reply; 16+ messages in thread
From: Jason J. Herne @ 2016-06-06 18:55 UTC (permalink / raw)
  To: Fam Zheng; +Cc: jcody, stefanha, qemu-devel, qemu-block, quintela

On 05/25/2016 04:36 AM, Fam Zheng wrote:
> On Tue, 05/24 11:05, Jason J. Herne wrote:
>> Thread 13 (Thread 0x3ff989ff910 (LWP 29452)):
>> #0  0x000003ff99abe2c0 in raise () from /lib64/libc.so.6
>> #1  0x000003ff99abfc26 in abort () from /lib64/libc.so.6
>> #2  0x0000000080427d80 in qemu_coroutine_enter (co=0x9c5a4120, opaque=0x0)
>> at /root/kvmdev/qemu/util/qemu-coroutine.c:112
>> #3  0x000000008032246e in nbd_restart_write (opaque=0x9c5897b0) at
>> /root/kvmdev/qemu/block/nbd-client.c:114
>> #4  0x00000000802b3a1c in aio_dispatch (ctx=0x9c530770) at
>> /root/kvmdev/qemu/aio-posix.c:341
>> #5  0x00000000802b4332 in aio_poll (ctx=0x9c530770, blocking=true) at
>> /root/kvmdev/qemu/aio-posix.c:479
>> #6  0x0000000080155aba in iothread_run (opaque=0x9c530200) at
>> /root/kvmdev/qemu/iothread.c:46
>> #7  0x000003ff99c87c2c in start_thread () from /lib64/libpthread.so.0
>> #8  0x000003ff99b8ec9a in thread_start () from /lib64/libc.so.6
>
> This is the continuation of write request to the NBD target....
>
>> Thread 1 (Thread 0x3ff9a6f2a90 (LWP 29433)):
>> #0  0x000003ff99c8d68a in pthread_cond_wait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x000000008040932e in qemu_cond_wait (cond=0x9c530800, mutex=0x9c5307d0)
>> at /root/kvmdev/qemu/util/qemu-thread-posix.c:123
>> #2  0x0000000080426a38 in rfifolock_lock (r=0x9c5307d0) at
>> /root/kvmdev/qemu/util/rfifolock.c:59
>> #3  0x00000000802a1f72 in aio_context_acquire (ctx=0x9c530770) at
>> /root/kvmdev/qemu/async.c:373
>> #4  0x00000000802b3f54 in aio_poll (ctx=0x9c530770, blocking=true) at
>> /root/kvmdev/qemu/aio-posix.c:415
>> #5  0x000000008031e7ac in bdrv_flush (bs=0x9c59b5c0) at
>> /root/kvmdev/qemu/block/io.c:2470
>> #6  0x00000000802a8e6e in bdrv_close (bs=0x9c59b5c0) at
>> /root/kvmdev/qemu/block.c:2134
>> #7  0x00000000802a9966 in bdrv_delete (bs=0x9c59b5c0) at
>> /root/kvmdev/qemu/block.c:2341
>> #8  0x00000000802ac7c6 in bdrv_unref (bs=0x9c59b5c0) at
>> /root/kvmdev/qemu/block.c:3376
>> #9  0x0000000080315340 in mirror_exit (job=0x9c956ed0, opaque=0x9c9570d0) at
>> /root/kvmdev/qemu/block/mirror.c:494
>> #10 0x00000000802afb52 in block_job_defer_to_main_loop_bh
>> (opaque=0x9c90dc10) at /root/kvmdev/qemu/blockjob.c:476
>
> ... while this is the completion of mirror. They are not supposed to happen
> together. Either the job is completed too early, or the nbd_restart_write
> function is invoked incorrectly.
>
> I'll see if I can reproduce it here.
>
> Fam
>

Hi Fam,
Have you had any luck reproducing this?


-- 
-- Jason J. Herne (jjherne@linux.vnet.ibm.com)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-06 18:55       ` Jason J. Herne
@ 2016-06-07  2:44         ` Fam Zheng
  2016-06-07 12:42           ` Jason J. Herne
  0 siblings, 1 reply; 16+ messages in thread
From: Fam Zheng @ 2016-06-07  2:44 UTC (permalink / raw)
  To: Jason J. Herne; +Cc: jcody, stefanha, qemu-devel, qemu-block, quintela

On Mon, 06/06 14:55, Jason J. Herne wrote:
> > I'll see if I can reproduce it here.
> > 
> > Fam
> > 
> 
> Hi Fam,
> Have you had any luck reproducing this?

No I cannot reproduce so far.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-07  2:44         ` Fam Zheng
@ 2016-06-07 12:42           ` Jason J. Herne
  2016-06-08 15:30             ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Jason J. Herne @ 2016-06-07 12:42 UTC (permalink / raw)
  To: Fam Zheng; +Cc: jcody, qemu-block, qemu-devel, stefanha, quintela

On 06/06/2016 10:44 PM, Fam Zheng wrote:
> On Mon, 06/06 14:55, Jason J. Herne wrote:
>>> I'll see if I can reproduce it here.
>>>
>>> Fam
>>>
>>
>> Hi Fam,
>> Have you had any luck reproducing this?
>
> No I cannot reproduce so far.
>

I can hit the problem 100% of the time. Is there any info I can provide 
to help? Either with debugging or reproducing?

-- 
-- Jason J. Herne (jjherne@linux.vnet.ibm.com)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-07 12:42           ` Jason J. Herne
@ 2016-06-08 15:30             ` Stefan Hajnoczi
  2016-06-08 16:03               ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2016-06-08 15:30 UTC (permalink / raw)
  To: Jason J. Herne
  Cc: Fam Zheng, Stefan Hajnoczi, qemu-devel, qemu block,
	Juan Quintela, Paolo Bonzini

On Tue, Jun 7, 2016 at 1:42 PM, Jason J. Herne
<jjherne@linux.vnet.ibm.com> wrote:
> On 06/06/2016 10:44 PM, Fam Zheng wrote:
>>
>> On Mon, 06/06 14:55, Jason J. Herne wrote:
>>>>
>>>> I'll see if I can reproduce it here.
>>>>
>>>> Fam
>>>>
>>>
>>> Hi Fam,
>>> Have you had any luck reproducing this?
>>
>>
>> No I cannot reproduce so far.
>>
>
> I can hit the problem 100% of the time. Is there any info I can provide to
> help? Either with debugging or reproducing?

I have a related scenario that is reproducible on qemu.git/master:

Launch an x86 guest with virtio-blk-pci,iothread= and perform
drive_mirror to NBD.  Now live migrate (without block migration
because you already mirrored) and the source QEMU will eventually
abort.

The backtrace shows the coroutine is running in the IOThread.  It
should be doing that since virtio-blk.c stops dataplane when live
migration/savevm begins.

What happens is that the mirror block job's coroutine continues to run
in the IOThread but the BlockDriverState's AioContext has switched
back to the main loop.

blk_add_aio_context_notifier() calls are missing for block jobs.  They
need to attach/detach when the AioContext changes.

This needs to be fixed.  I believe Paolo has a patch to continue using
dataplane during savevm but that's a workaround rather than a solution
to this problem.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-08 15:30             ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2016-06-08 16:03               ` Paolo Bonzini
  2016-06-09  7:35                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2016-06-08 16:03 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason J. Herne
  Cc: Fam Zheng, Stefan Hajnoczi, qemu-devel, qemu block, Juan Quintela



On 08/06/2016 17:30, Stefan Hajnoczi wrote:
> This needs to be fixed.  I believe Paolo has a patch to continue using
> dataplane during savevm but that's a workaround rather than a solution
> to this problem.

That is already in.  Dataplane during migration has been enabled since
we've gotten rid of vring.

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-08 16:03               ` Paolo Bonzini
@ 2016-06-09  7:35                 ` Stefan Hajnoczi
  2016-06-09  8:25                   ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2016-06-09  7:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason J. Herne, Fam Zheng, Stefan Hajnoczi, qemu-devel,
	qemu block, Juan Quintela

On Wed, Jun 8, 2016 at 5:03 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 08/06/2016 17:30, Stefan Hajnoczi wrote:
>> This needs to be fixed.  I believe Paolo has a patch to continue using
>> dataplane during savevm but that's a workaround rather than a solution
>> to this problem.
>
> That is already in.  Dataplane during migration has been enabled since
> we've gotten rid of vring.

You're right.  This means the problem isn't that dataplane disables
itself during the live migration phase.

The issue occurs because the mirror block job has finished syncing but
is still alive when vmsave happens:

static void virtio_blk_save(QEMUFile *f, void *opaque)
{
    VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
    VirtIOBlock *s = VIRTIO_BLK(vdev);

    if (s->dataplane) {
        virtio_blk_data_plane_stop(s->dataplane);
    }

    virtio_save(vdev, f);
}

We reach the situation I described where BDS AioContext changes but
mirror_run() is still in the IOThread AioContext.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-09  7:35                 ` Stefan Hajnoczi
@ 2016-06-09  8:25                   ` Paolo Bonzini
  2016-06-09  8:47                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2016-06-09  8:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Jason J. Herne, qemu block, Juan Quintela, qemu-devel,
	Stefan Hajnoczi



On 09/06/2016 09:35, Stefan Hajnoczi wrote:
> 
>     if (s->dataplane) {
>         virtio_blk_data_plane_stop(s->dataplane);
>     }
> 
>     virtio_save(vdev, f);
> }
> 
> We reach the situation I described where BDS AioContext changes but
> mirror_run() is still in the IOThread AioContext.

Why is the virtio_blk_data_plane_stop necessary?  It used to sync
between vring and virtio state, but that's not required anymore---and we
know that the virtqueues are quiescent at this point...

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-09  8:25                   ` Paolo Bonzini
@ 2016-06-09  8:47                     ` Stefan Hajnoczi
  2016-06-09  8:48                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2016-06-09  8:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Jason J. Herne, qemu block, Juan Quintela, qemu-devel,
	Stefan Hajnoczi

On Thu, Jun 9, 2016 at 9:25 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 09/06/2016 09:35, Stefan Hajnoczi wrote:
>>
>>     if (s->dataplane) {
>>         virtio_blk_data_plane_stop(s->dataplane);
>>     }
>>
>>     virtio_save(vdev, f);
>> }
>>
>> We reach the situation I described where BDS AioContext changes but
>> mirror_run() is still in the IOThread AioContext.
>
> Why is the virtio_blk_data_plane_stop necessary?  It used to sync
> between vring and virtio state, but that's not required anymore---and we
> know that the virtqueues are quiescent at this point...

I agree.  It doesn't seem necessary anymore unless I'm missing a race
condition with ioeventfd.

It was introduced so that the savevm command (without live migration!) works:

commit 10a06fd65f667a972848ebbbcac11bdba931b544
Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
Date:   Mon Oct 26 14:42:57 2015 +0300

    virtio: sync the dataplane vring state to the virtqueue before virtio_save

We can remove it now.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-09  8:47                     ` Stefan Hajnoczi
@ 2016-06-09  8:48                       ` Stefan Hajnoczi
  2016-06-09 10:02                         ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2016-06-09  8:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Jason J. Herne, qemu block, Juan Quintela, qemu-devel,
	Stefan Hajnoczi

On Thu, Jun 9, 2016 at 9:47 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Jun 9, 2016 at 9:25 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 09/06/2016 09:35, Stefan Hajnoczi wrote:
>>>
>>>     if (s->dataplane) {
>>>         virtio_blk_data_plane_stop(s->dataplane);
>>>     }
>>>
>>>     virtio_save(vdev, f);
>>> }
>>>
>>> We reach the situation I described where BDS AioContext changes but
>>> mirror_run() is still in the IOThread AioContext.
>>
>> Why is the virtio_blk_data_plane_stop necessary?  It used to sync
>> between vring and virtio state, but that's not required anymore---and we
>> know that the virtqueues are quiescent at this point...
>
> I agree.  It doesn't seem necessary anymore unless I'm missing a race
> condition with ioeventfd.
>
> It was introduced so that the savevm command (without live migration!) works:
>
> commit 10a06fd65f667a972848ebbbcac11bdba931b544
> Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> Date:   Mon Oct 26 14:42:57 2015 +0300
>
>     virtio: sync the dataplane vring state to the virtqueue before virtio_save
>
> We can remove it now.

BTW this doesn't mean that the blockjobs AioContext following problem
is solved.  AioContexts can still change if the guest resets the
virtio device, for example.

I suspect we'll still hit undefined behavior when that happens.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-09  8:48                       ` Stefan Hajnoczi
@ 2016-06-09 10:02                         ` Paolo Bonzini
  0 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2016-06-09 10:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Jason J. Herne, qemu block, Juan Quintela, qemu-devel,
	Stefan Hajnoczi

> BTW this doesn't mean that the blockjobs AioContext following problem
> is solved.  AioContexts can still change if the guest resets the
> virtio device, for example.
> 
> I suspect we'll still hit undefined behavior when that happens.

I agree... :/

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-05-23 18:54 [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads Jason J. Herne
  2016-05-24  2:12 ` Fam Zheng
@ 2016-06-09 16:31 ` Stefan Hajnoczi
  2016-06-09 18:19   ` Jason J. Herne
  1 sibling, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2016-06-09 16:31 UTC (permalink / raw)
  To: Jason J. Herne
  Cc: qemu block, qemu-devel, Fam Zheng, Stefan Hajnoczi, Jeff Cody,
	Juan Quintela

On Mon, May 23, 2016 at 7:54 PM, Jason J. Herne
<jjherne@linux.vnet.ibm.com> wrote:
> Libvirt migration command:
> virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb
> kvm1 qemu+ssh://dev1/system

I guess that this is the same scenario as a manual drive_mirror +
migrate test I have been doing.  I also get the "Co-routine re-entered
recursively" error message.

I've CCed you on an (unfinished) patch that solves the abort.  Please
test it and let me know!

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads
  2016-06-09 16:31 ` [Qemu-devel] " Stefan Hajnoczi
@ 2016-06-09 18:19   ` Jason J. Herne
  0 siblings, 0 replies; 16+ messages in thread
From: Jason J. Herne @ 2016-06-09 18:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, qemu block, Juan Quintela, Jeff Cody, qemu-devel,
	Stefan Hajnoczi

On 06/09/2016 12:31 PM, Stefan Hajnoczi wrote:
> On Mon, May 23, 2016 at 7:54 PM, Jason J. Herne
> <jjherne@linux.vnet.ibm.com> wrote:
>> Libvirt migration command:
>> virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb
>> kvm1 qemu+ssh://dev1/system
>
> I guess that this is the same scenario as a manual drive_mirror +
> migrate test I have been doing.  I also get the "Co-routine re-entered
> recursively" error message.
>
> I've CCed you on an (unfinished) patch that solves the abort.  Please
> test it and let me know!
>

Stefan, Yes!! This patch fixes the problem in our environment :) Thank 
you for digging into it and finding a solution. We'll patiently await a 
completed patch.

-- 
-- Jason J. Herne (jjherne@linux.vnet.ibm.com)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-06-09 18:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23 18:54 [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads Jason J. Herne
2016-05-24  2:12 ` Fam Zheng
2016-05-24 15:05   ` Jason J. Herne
2016-05-25  8:36     ` Fam Zheng
2016-06-06 18:55       ` Jason J. Herne
2016-06-07  2:44         ` Fam Zheng
2016-06-07 12:42           ` Jason J. Herne
2016-06-08 15:30             ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-06-08 16:03               ` Paolo Bonzini
2016-06-09  7:35                 ` Stefan Hajnoczi
2016-06-09  8:25                   ` Paolo Bonzini
2016-06-09  8:47                     ` Stefan Hajnoczi
2016-06-09  8:48                       ` Stefan Hajnoczi
2016-06-09 10:02                         ` Paolo Bonzini
2016-06-09 16:31 ` [Qemu-devel] " Stefan Hajnoczi
2016-06-09 18:19   ` Jason J. Herne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.