All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] another locking issue in current dataplane code?
@ 2014-07-07 11:58 Christian Borntraeger
  2014-07-08  7:19 ` Christian Borntraeger
  2014-07-08 15:59 ` Stefan Hajnoczi
  0 siblings, 2 replies; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-07 11:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

Folks,

with current 2.1-rc0 (
+  dataplane: do not free VirtQueueElement in vring_push()
+  virtio-blk: avoid dataplane VirtIOBlockReq early free
+ some not-ready yet s390 patches for migration
)

I still having issues with dataplane during managedsave (without dataplane everything seems to work fine):

With 1 CPU and 1 disk (and some workload, e.g. a simple dd on the disk) I get:


Thread 3 (Thread 0x3fff90fd910 (LWP 27218)):
#0  0x000003fffcdb7ba0 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003fffcdbac0c in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x000003fffcdb399a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00000000801fff06 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x8037f788 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
#4  0x00000000800472f4 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:843
#5  qemu_kvm_cpu_thread_fn (arg=0x809ad6b0) at /home/cborntra/REPOS/qemu/cpus.c:879
#6  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
#7  0x000003fffba350ae in thread_start () from /lib64/libc.so.6

Thread 2 (Thread 0x3fff88fd910 (LWP 27219)):
#0  0x000003fffba2a8e0 in ppoll () from /lib64/libc.so.6
#1  0x00000000801af250 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=fds@entry=0x3fff40010c0, nfds=nfds@entry=3, timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:314
#3  0x00000000801b0702 in aio_poll (ctx=0x807f2230, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:221
#4  0x00000000800be3c4 in iothread_run (opaque=0x807f20d8) at /home/cborntra/REPOS/qemu/iothread.c:41
#5  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
#6  0x000003fffba350ae in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x3fff9c529b0 (LWP 27215)):
#0  0x000003fffcdb38f0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000801fff06 in qemu_cond_wait (cond=cond@entry=0x807f22c0, mutex=mutex@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
#2  0x0000000080212906 in rfifolock_lock (r=r@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/rfifolock.c:59
#3  0x000000008019e536 in aio_context_acquire (ctx=ctx@entry=0x807f2230) at /home/cborntra/REPOS/qemu/async.c:295
#4  0x00000000801a34e6 in bdrv_drain_all () at /home/cborntra/REPOS/qemu/block.c:1907
#5  0x0000000080048e24 in do_vm_stop (state=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:538
#6  vm_stop (state=state@entry=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:1221
#7  0x00000000800e6338 in qmp_stop (errp=errp@entry=0x3ffffa9dc00) at /home/cborntra/REPOS/qemu/qmp.c:98
#8  0x00000000800e1314 in qmp_marshal_input_stop (mon=<optimized out>, qdict=<optimized out>, ret=<optimized out>) at qmp-marshal.c:2806
#9  0x000000008004b91a in qmp_call_cmd (cmd=<optimized out>, params=0x8096cf50, mon=0x8080b8a0) at /home/cborntra/REPOS/qemu/monitor.c:5038
#10 handle_qmp_command (parser=<optimized out>, tokens=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5104
#11 0x00000000801faf16 in json_message_process_token (lexer=0x8080b7c0, token=0x808f2610, type=<optimized out>, x=<optimized out>, y=6) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:87
#12 0x0000000080212bac in json_lexer_feed_char (lexer=lexer@entry=0x8080b7c0, ch=<optimized out>, flush=flush@entry=false) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:303
#13 0x0000000080212cfe in json_lexer_feed (lexer=0x8080b7c0, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:356
#14 0x00000000801fb10e in json_message_parser_feed (parser=<optimized out>, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:110
#15 0x0000000080049f28 in monitor_control_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5125
#16 0x00000000800c8636 in qemu_chr_be_write (len=1, buf=0x3ffffa9e010 "}[B\377\373\251\372\b", s=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:213
#17 tcp_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:2690
#18 0x000003fffcc9f05a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#19 0x00000000801ae3e0 in glib_pollfds_poll () at /home/cborntra/REPOS/qemu/main-loop.c:190
#20 os_host_main_loop_wait (timeout=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:235
#21 main_loop_wait (nonblocking=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:484
#22 0x00000000800169e2 in main_loop () at /home/cborntra/REPOS/qemu/vl.c:2024
#23 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4551

Now. If aio_poll never returns, we have a deadlock here. 
To me it looks like, that aio_poll could be called from iothread_run, even if there are no outstanding request.
Opinions?

Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-07 11:58 [Qemu-devel] another locking issue in current dataplane code? Christian Borntraeger
@ 2014-07-08  7:19 ` Christian Borntraeger
  2014-07-08  7:43   ` Ming Lei
  2014-07-08 15:59 ` Stefan Hajnoczi
  1 sibling, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08  7:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

Ping.

has anyone seen a similar hang on x86?



On 07/07/14 13:58, Christian Borntraeger wrote:
> Folks,
> 
> with current 2.1-rc0 (
> +  dataplane: do not free VirtQueueElement in vring_push()
> +  virtio-blk: avoid dataplane VirtIOBlockReq early free
> + some not-ready yet s390 patches for migration
> )
> 
> I still having issues with dataplane during managedsave (without dataplane everything seems to work fine):
> 
> With 1 CPU and 1 disk (and some workload, e.g. a simple dd on the disk) I get:
> 
> 
> Thread 3 (Thread 0x3fff90fd910 (LWP 27218)):
> #0  0x000003fffcdb7ba0 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x000003fffcdbac0c in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
> #2  0x000003fffcdb399a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #3  0x00000000801fff06 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x8037f788 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
> #4  0x00000000800472f4 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:843
> #5  qemu_kvm_cpu_thread_fn (arg=0x809ad6b0) at /home/cborntra/REPOS/qemu/cpus.c:879
> #6  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
> #7  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
> 
> Thread 2 (Thread 0x3fff88fd910 (LWP 27219)):
> #0  0x000003fffba2a8e0 in ppoll () from /lib64/libc.so.6
> #1  0x00000000801af250 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
> #2  qemu_poll_ns (fds=fds@entry=0x3fff40010c0, nfds=nfds@entry=3, timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:314
> #3  0x00000000801b0702 in aio_poll (ctx=0x807f2230, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:221
> #4  0x00000000800be3c4 in iothread_run (opaque=0x807f20d8) at /home/cborntra/REPOS/qemu/iothread.c:41
> #5  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
> #6  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x3fff9c529b0 (LWP 27215)):
> #0  0x000003fffcdb38f0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x00000000801fff06 in qemu_cond_wait (cond=cond@entry=0x807f22c0, mutex=mutex@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
> #2  0x0000000080212906 in rfifolock_lock (r=r@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/rfifolock.c:59
> #3  0x000000008019e536 in aio_context_acquire (ctx=ctx@entry=0x807f2230) at /home/cborntra/REPOS/qemu/async.c:295
> #4  0x00000000801a34e6 in bdrv_drain_all () at /home/cborntra/REPOS/qemu/block.c:1907
> #5  0x0000000080048e24 in do_vm_stop (state=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:538
> #6  vm_stop (state=state@entry=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:1221
> #7  0x00000000800e6338 in qmp_stop (errp=errp@entry=0x3ffffa9dc00) at /home/cborntra/REPOS/qemu/qmp.c:98
> #8  0x00000000800e1314 in qmp_marshal_input_stop (mon=<optimized out>, qdict=<optimized out>, ret=<optimized out>) at qmp-marshal.c:2806
> #9  0x000000008004b91a in qmp_call_cmd (cmd=<optimized out>, params=0x8096cf50, mon=0x8080b8a0) at /home/cborntra/REPOS/qemu/monitor.c:5038
> #10 handle_qmp_command (parser=<optimized out>, tokens=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5104
> #11 0x00000000801faf16 in json_message_process_token (lexer=0x8080b7c0, token=0x808f2610, type=<optimized out>, x=<optimized out>, y=6) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:87
> #12 0x0000000080212bac in json_lexer_feed_char (lexer=lexer@entry=0x8080b7c0, ch=<optimized out>, flush=flush@entry=false) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:303
> #13 0x0000000080212cfe in json_lexer_feed (lexer=0x8080b7c0, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:356
> #14 0x00000000801fb10e in json_message_parser_feed (parser=<optimized out>, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:110
> #15 0x0000000080049f28 in monitor_control_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5125
> #16 0x00000000800c8636 in qemu_chr_be_write (len=1, buf=0x3ffffa9e010 "}[B\377\373\251\372\b", s=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:213
> #17 tcp_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:2690
> #18 0x000003fffcc9f05a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #19 0x00000000801ae3e0 in glib_pollfds_poll () at /home/cborntra/REPOS/qemu/main-loop.c:190
> #20 os_host_main_loop_wait (timeout=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:235
> #21 main_loop_wait (nonblocking=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:484
> #22 0x00000000800169e2 in main_loop () at /home/cborntra/REPOS/qemu/vl.c:2024
> #23 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4551
> 
> Now. If aio_poll never returns, we have a deadlock here. 
> To me it looks like, that aio_poll could be called from iothread_run, even if there are no outstanding request.
> Opinions?
> 
> Christian
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08  7:19 ` Christian Borntraeger
@ 2014-07-08  7:43   ` Ming Lei
  2014-07-08  8:38     ` Christian Borntraeger
  2014-07-08  9:09     ` Christian Borntraeger
  0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2014-07-08  7:43 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> Ping.
>
> has anyone seen a similar hang on x86?
>
>
>
> On 07/07/14 13:58, Christian Borntraeger wrote:
>> Folks,
>>
>> with current 2.1-rc0 (
>> +  dataplane: do not free VirtQueueElement in vring_push()
>> +  virtio-blk: avoid dataplane VirtIOBlockReq early free
>> + some not-ready yet s390 patches for migration
>> )
>>
>> I still having issues with dataplane during managedsave (without dataplane everything seems to work fine):
>>
>> With 1 CPU and 1 disk (and some workload, e.g. a simple dd on the disk) I get:
>>
>>
>> Thread 3 (Thread 0x3fff90fd910 (LWP 27218)):
>> #0  0x000003fffcdb7ba0 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x000003fffcdbac0c in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
>> #2  0x000003fffcdb399a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #3  0x00000000801fff06 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x8037f788 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
>> #4  0x00000000800472f4 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:843
>> #5  qemu_kvm_cpu_thread_fn (arg=0x809ad6b0) at /home/cborntra/REPOS/qemu/cpus.c:879
>> #6  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
>> #7  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
>>
>> Thread 2 (Thread 0x3fff88fd910 (LWP 27219)):
>> #0  0x000003fffba2a8e0 in ppoll () from /lib64/libc.so.6
>> #1  0x00000000801af250 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
>> #2  qemu_poll_ns (fds=fds@entry=0x3fff40010c0, nfds=nfds@entry=3, timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:314
>> #3  0x00000000801b0702 in aio_poll (ctx=0x807f2230, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:221
>> #4  0x00000000800be3c4 in iothread_run (opaque=0x807f20d8) at /home/cborntra/REPOS/qemu/iothread.c:41
>> #5  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
>> #6  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x3fff9c529b0 (LWP 27215)):
>> #0  0x000003fffcdb38f0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1  0x00000000801fff06 in qemu_cond_wait (cond=cond@entry=0x807f22c0, mutex=mutex@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
>> #2  0x0000000080212906 in rfifolock_lock (r=r@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/rfifolock.c:59
>> #3  0x000000008019e536 in aio_context_acquire (ctx=ctx@entry=0x807f2230) at /home/cborntra/REPOS/qemu/async.c:295
>> #4  0x00000000801a34e6 in bdrv_drain_all () at /home/cborntra/REPOS/qemu/block.c:1907
>> #5  0x0000000080048e24 in do_vm_stop (state=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:538
>> #6  vm_stop (state=state@entry=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:1221
>> #7  0x00000000800e6338 in qmp_stop (errp=errp@entry=0x3ffffa9dc00) at /home/cborntra/REPOS/qemu/qmp.c:98
>> #8  0x00000000800e1314 in qmp_marshal_input_stop (mon=<optimized out>, qdict=<optimized out>, ret=<optimized out>) at qmp-marshal.c:2806
>> #9  0x000000008004b91a in qmp_call_cmd (cmd=<optimized out>, params=0x8096cf50, mon=0x8080b8a0) at /home/cborntra/REPOS/qemu/monitor.c:5038
>> #10 handle_qmp_command (parser=<optimized out>, tokens=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5104
>> #11 0x00000000801faf16 in json_message_process_token (lexer=0x8080b7c0, token=0x808f2610, type=<optimized out>, x=<optimized out>, y=6) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:87
>> #12 0x0000000080212bac in json_lexer_feed_char (lexer=lexer@entry=0x8080b7c0, ch=<optimized out>, flush=flush@entry=false) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:303
>> #13 0x0000000080212cfe in json_lexer_feed (lexer=0x8080b7c0, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:356
>> #14 0x00000000801fb10e in json_message_parser_feed (parser=<optimized out>, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:110
>> #15 0x0000000080049f28 in monitor_control_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5125
>> #16 0x00000000800c8636 in qemu_chr_be_write (len=1, buf=0x3ffffa9e010 "}[B\377\373\251\372\b", s=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:213
>> #17 tcp_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:2690
>> #18 0x000003fffcc9f05a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>> #19 0x00000000801ae3e0 in glib_pollfds_poll () at /home/cborntra/REPOS/qemu/main-loop.c:190
>> #20 os_host_main_loop_wait (timeout=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:235
>> #21 main_loop_wait (nonblocking=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:484
>> #22 0x00000000800169e2 in main_loop () at /home/cborntra/REPOS/qemu/vl.c:2024
>> #23 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4551
>>
>> Now. If aio_poll never returns, we have a deadlock here.
>> To me it looks like, that aio_poll could be called from iothread_run, even if there are no outstanding request.
>> Opinions?

I have sent out one patch to fix the issue, and the title is
"virtio-blk: data-plane: fix save/set .complete_request in start".

Please try this patch to see if it fixes your issue.

thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08  7:43   ` Ming Lei
@ 2014-07-08  8:38     ` Christian Borntraeger
  2014-07-08  9:09     ` Christian Borntraeger
  1 sibling, 0 replies; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08  8:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On 08/07/14 09:43, Ming Lei wrote:
> On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> Ping.
>>
>> has anyone seen a similar hang on x86?
>>
>>
>>
>> On 07/07/14 13:58, Christian Borntraeger wrote:
>>> Folks,
>>>
>>> with current 2.1-rc0 (
>>> +  dataplane: do not free VirtQueueElement in vring_push()
>>> +  virtio-blk: avoid dataplane VirtIOBlockReq early free
>>> + some not-ready yet s390 patches for migration
>>> )
>>>
>>> I still having issues with dataplane during managedsave (without dataplane everything seems to work fine):
>>>
>>> With 1 CPU and 1 disk (and some workload, e.g. a simple dd on the disk) I get:
>>>
>>>
>>> Thread 3 (Thread 0x3fff90fd910 (LWP 27218)):
>>> #0  0x000003fffcdb7ba0 in __lll_lock_wait () from /lib64/libpthread.so.0
>>> #1  0x000003fffcdbac0c in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
>>> #2  0x000003fffcdb399a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>> #3  0x00000000801fff06 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x8037f788 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
>>> #4  0x00000000800472f4 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:843
>>> #5  qemu_kvm_cpu_thread_fn (arg=0x809ad6b0) at /home/cborntra/REPOS/qemu/cpus.c:879
>>> #6  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
>>> #7  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
>>>
>>> Thread 2 (Thread 0x3fff88fd910 (LWP 27219)):
>>> #0  0x000003fffba2a8e0 in ppoll () from /lib64/libc.so.6
>>> #1  0x00000000801af250 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
>>> #2  qemu_poll_ns (fds=fds@entry=0x3fff40010c0, nfds=nfds@entry=3, timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:314
>>> #3  0x00000000801b0702 in aio_poll (ctx=0x807f2230, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:221
>>> #4  0x00000000800be3c4 in iothread_run (opaque=0x807f20d8) at /home/cborntra/REPOS/qemu/iothread.c:41
>>> #5  0x000003fffcdaf412 in start_thread () from /lib64/libpthread.so.0
>>> #6  0x000003fffba350ae in thread_start () from /lib64/libc.so.6
>>>
>>> Thread 1 (Thread 0x3fff9c529b0 (LWP 27215)):
>>> #0  0x000003fffcdb38f0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>> #1  0x00000000801fff06 in qemu_cond_wait (cond=cond@entry=0x807f22c0, mutex=mutex@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:135
>>> #2  0x0000000080212906 in rfifolock_lock (r=r@entry=0x807f2290) at /home/cborntra/REPOS/qemu/util/rfifolock.c:59
>>> #3  0x000000008019e536 in aio_context_acquire (ctx=ctx@entry=0x807f2230) at /home/cborntra/REPOS/qemu/async.c:295
>>> #4  0x00000000801a34e6 in bdrv_drain_all () at /home/cborntra/REPOS/qemu/block.c:1907
>>> #5  0x0000000080048e24 in do_vm_stop (state=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:538
>>> #6  vm_stop (state=state@entry=RUN_STATE_PAUSED) at /home/cborntra/REPOS/qemu/cpus.c:1221
>>> #7  0x00000000800e6338 in qmp_stop (errp=errp@entry=0x3ffffa9dc00) at /home/cborntra/REPOS/qemu/qmp.c:98
>>> #8  0x00000000800e1314 in qmp_marshal_input_stop (mon=<optimized out>, qdict=<optimized out>, ret=<optimized out>) at qmp-marshal.c:2806
>>> #9  0x000000008004b91a in qmp_call_cmd (cmd=<optimized out>, params=0x8096cf50, mon=0x8080b8a0) at /home/cborntra/REPOS/qemu/monitor.c:5038
>>> #10 handle_qmp_command (parser=<optimized out>, tokens=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5104
>>> #11 0x00000000801faf16 in json_message_process_token (lexer=0x8080b7c0, token=0x808f2610, type=<optimized out>, x=<optimized out>, y=6) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:87
>>> #12 0x0000000080212bac in json_lexer_feed_char (lexer=lexer@entry=0x8080b7c0, ch=<optimized out>, flush=flush@entry=false) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:303
>>> #13 0x0000000080212cfe in json_lexer_feed (lexer=0x8080b7c0, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-lexer.c:356
>>> #14 0x00000000801fb10e in json_message_parser_feed (parser=<optimized out>, buffer=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/qobject/json-streamer.c:110
>>> #15 0x0000000080049f28 in monitor_control_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>) at /home/cborntra/REPOS/qemu/monitor.c:5125
>>> #16 0x00000000800c8636 in qemu_chr_be_write (len=1, buf=0x3ffffa9e010 "}[B\377\373\251\372\b", s=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:213
>>> #17 tcp_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x807f5af0) at /home/cborntra/REPOS/qemu/qemu-char.c:2690
>>> #18 0x000003fffcc9f05a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>>> #19 0x00000000801ae3e0 in glib_pollfds_poll () at /home/cborntra/REPOS/qemu/main-loop.c:190
>>> #20 os_host_main_loop_wait (timeout=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:235
>>> #21 main_loop_wait (nonblocking=<optimized out>) at /home/cborntra/REPOS/qemu/main-loop.c:484
>>> #22 0x00000000800169e2 in main_loop () at /home/cborntra/REPOS/qemu/vl.c:2024
>>> #23 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4551
>>>
>>> Now. If aio_poll never returns, we have a deadlock here.
>>> To me it looks like, that aio_poll could be called from iothread_run, even if there are no outstanding request.
>>> Opinions?
> 
> I have sent out one patch to fix the issue, and the title is
> "virtio-blk: data-plane: fix save/set .complete_request in start".
> 
> Please try this patch to see if it fixes your issue.

Yes, I have seen that patch. Unfortunately it does not make a difference for the managedsave case.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08  7:43   ` Ming Lei
  2014-07-08  8:38     ` Christian Borntraeger
@ 2014-07-08  9:09     ` Christian Borntraeger
  2014-07-08 10:12       ` Christian Borntraeger
  1 sibling, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08  9:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On 08/07/14 09:43, Ming Lei wrote:
> On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> Ping.
>>
>> has anyone seen a similar hang on x86?

The problem seems to be, that for managedsave, we do a VM stop before we call the migration_state_notifier. to be verified.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08  9:09     ` Christian Borntraeger
@ 2014-07-08 10:12       ` Christian Borntraeger
  2014-07-08 10:37         ` Christian Borntraeger
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08 10:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On 08/07/14 11:09, Christian Borntraeger wrote:
> On 08/07/14 09:43, Ming Lei wrote:
>> On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
>> <borntraeger@de.ibm.com> wrote:
>>> Ping.
>>>
>>> has anyone seen a similar hang on x86?
> 
> The problem seems to be, that for managedsave, we do a VM stop before we call the migration_state_notifier. to be verified.

Yes. virsh suspend also hangs. Any ideas?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 10:12       ` Christian Borntraeger
@ 2014-07-08 10:37         ` Christian Borntraeger
  2014-07-08 11:03           ` Christian Borntraeger
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08 10:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On 08/07/14 12:12, Christian Borntraeger wrote:
> On 08/07/14 11:09, Christian Borntraeger wrote:
>> On 08/07/14 09:43, Ming Lei wrote:
>>> On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
>>> <borntraeger@de.ibm.com> wrote:
>>>> Ping.
>>>>
>>>> has anyone seen a similar hang on x86?
>>
>> The problem seems to be, that for managedsave, we do a VM stop before we call the migration_state_notifier. to be verified.
> 
> Yes. virsh suspend also hangs. Any ideas?
> 

Finally found a solution. Merging belows merge request from upstream seems to fix my issues. I guess the plugging/unplugging fixed this implicetely,but I dont understand it yet.
Since the problem is gone, I will no longer investigate...


Bug fixes for QEMU 2.1-rc1.

The following changes since commit 9d9de254c2b81b68cd48f2324cc753a570a4cdd8:

  MAINTAINERS: seccomp: change email contact for Eduardo Otubo (2014-07-03 12:36:15 +0100)

are available in the git repository at:

  git://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to f4eb32b590bf58c1c67570775eb78beb09964fad:

  qmp: show QOM properties in device-list-properties (2014-07-07 11:10:05 +0200)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 10:37         ` Christian Borntraeger
@ 2014-07-08 11:03           ` Christian Borntraeger
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08 11:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Cornelia Huck, Kevin Wolf, Dominik Dingel, qemu-devel, Stefan Hajnoczi

On 08/07/14 12:37, Christian Borntraeger wrote:
> On 08/07/14 12:12, Christian Borntraeger wrote:
>> On 08/07/14 11:09, Christian Borntraeger wrote:
>>> On 08/07/14 09:43, Ming Lei wrote:
>>>> On Tue, Jul 8, 2014 at 3:19 PM, Christian Borntraeger
>>>> <borntraeger@de.ibm.com> wrote:
>>>>> Ping.
>>>>>
>>>>> has anyone seen a similar hang on x86?
>>>
>>> The problem seems to be, that for managedsave, we do a VM stop before we call the migration_state_notifier. to be verified.
>>
>> Yes. virsh suspend also hangs. Any ideas?
>>
> 
> Finally found a solution. Merging belows merge request from upstream seems to fix my issues. I guess the plugging/unplugging fixed this implicetely,but I dont understand it yet.
> Since the problem is gone, I will no longer investigate...

Sigh. This merge just made the bug less likely to occur.


> 
> 
> Bug fixes for QEMU 2.1-rc1.
> 
> The following changes since commit 9d9de254c2b81b68cd48f2324cc753a570a4cdd8:
> 
>   MAINTAINERS: seccomp: change email contact for Eduardo Otubo (2014-07-03 12:36:15 +0100)
> 
> are available in the git repository at:
> 
>   git://github.com/stefanha/qemu.git tags/block-pull-request
> 
> for you to fetch changes up to f4eb32b590bf58c1c67570775eb78beb09964fad:
> 
>   qmp: show QOM properties in device-list-properties (2014-07-07 11:10:05 +0200)
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-07 11:58 [Qemu-devel] another locking issue in current dataplane code? Christian Borntraeger
  2014-07-08  7:19 ` Christian Borntraeger
@ 2014-07-08 15:59 ` Stefan Hajnoczi
  2014-07-08 17:08   ` Paolo Bonzini
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2014-07-08 15:59 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

On Mon, Jul 07, 2014 at 01:58:01PM +0200, Christian Borntraeger wrote:
> Now. If aio_poll never returns, we have a deadlock here. 
> To me it looks like, that aio_poll could be called from iothread_run, even if there are no outstanding request.
> Opinions?

Christian pointed out that iothread_run() can miss aio_notify() if a
file descriptor becomes readable/writeable at the same time as the
AioContext->notifier.  aio_poll() will return true since progress was
made and we are left with a hung QEMU.

I sent Christian an initial patch to fix this but now both threads are
stuck in rfifolock_lock() inside cond wait.  That's very strange and
should never happen.

Still trying to figure out what is going on...

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 15:59 ` Stefan Hajnoczi
@ 2014-07-08 17:08   ` Paolo Bonzini
  2014-07-08 19:07     ` Christian Borntraeger
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2014-07-08 17:08 UTC (permalink / raw)
  To: Stefan Hajnoczi, Christian Borntraeger
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

Il 08/07/2014 17:59, Stefan Hajnoczi ha scritto:
> I sent Christian an initial patch to fix this but now both threads are
> stuck in rfifolock_lock() inside cond wait.  That's very strange and
> should never happen.

I had this patch pending for 2.2:

commit 6c81e31615c3cda5ea981a998ba8b1b8ed17de6f
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Mon Jul 7 10:39:49 2014 +0200

    iothread: do not rely on aio_poll(ctx, true) result to end a loop
    
    Currently, whenever aio_poll(ctx, true) has completed all pending
    work it returns true *and* the next call to aio_poll(ctx, true)
    will not block.
    
    This invariant has its roots in qemu_aio_flush()'s implementation
    as "while (qemu_aio_wait()) {}".  However, qemu_aio_flush() does
    not exist anymore and bdrv_drain_all() is implemented differently;
    and this invariant is complicated to maintain and subtly different
    from the return value of GMainLoop's g_main_context_iteration.
    
    All calls to aio_poll(ctx, true) except one are guarded by a
    while() loop checking for a request to be incomplete, or a
    BlockDriverState to be idle.  Modify that one exception in
    iothread.c.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/iothread.c b/iothread.c
index 1fbf9f1..d9403cf 100644
--- a/iothread.c
+++ b/iothread.c
@@ -30,6 +30,7 @@ typedef ObjectClass IOThreadClass;
 static void *iothread_run(void *opaque)
 {
     IOThread *iothread = opaque;
+    bool blocking;
 
     qemu_mutex_lock(&iothread->init_done_lock);
     iothread->thread_id = qemu_get_thread_id();
@@ -38,8 +39,10 @@ static void *iothread_run(void *opaque)
 
     while (!iothread->stopping) {
         aio_context_acquire(iothread->ctx);
-        while (!iothread->stopping && aio_poll(iothread->ctx, true)) {
+        blocking = true;
+        while (!iothread->stopping && aio_poll(iothread->ctx, blocking)) {
             /* Progress was made, keep going */
+            blocking = false;
         }
         aio_context_release(iothread->ctx);
     }

Christian, can you test it?

Paolo

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 17:08   ` Paolo Bonzini
@ 2014-07-08 19:07     ` Christian Borntraeger
  2014-07-08 19:50       ` Paolo Bonzini
  2014-07-09  7:56       ` Stefan Hajnoczi
  0 siblings, 2 replies; 13+ messages in thread
From: Christian Borntraeger @ 2014-07-08 19:07 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

On 08/07/14 19:08, Paolo Bonzini wrote:
> Il 08/07/2014 17:59, Stefan Hajnoczi ha scritto:
>> I sent Christian an initial patch to fix this but now both threads are
>> stuck in rfifolock_lock() inside cond wait.  That's very strange and
>> should never happen.
> 
> I had this patch pending for 2.2:
> 
> commit 6c81e31615c3cda5ea981a998ba8b1b8ed17de6f
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Mon Jul 7 10:39:49 2014 +0200
> 
>     iothread: do not rely on aio_poll(ctx, true) result to end a loop
>     
>     Currently, whenever aio_poll(ctx, true) has completed all pending
>     work it returns true *and* the next call to aio_poll(ctx, true)
>     will not block.
>     
>     This invariant has its roots in qemu_aio_flush()'s implementation
>     as "while (qemu_aio_wait()) {}".  However, qemu_aio_flush() does
>     not exist anymore and bdrv_drain_all() is implemented differently;
>     and this invariant is complicated to maintain and subtly different
>     from the return value of GMainLoop's g_main_context_iteration.
>     
>     All calls to aio_poll(ctx, true) except one are guarded by a
>     while() loop checking for a request to be incomplete, or a
>     BlockDriverState to be idle.  Modify that one exception in
>     iothread.c.
>     
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

The hangs are gone. Looks like 2.1 material now...

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>




> 
> diff --git a/iothread.c b/iothread.c
> index 1fbf9f1..d9403cf 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -30,6 +30,7 @@ typedef ObjectClass IOThreadClass;
>  static void *iothread_run(void *opaque)
>  {
>      IOThread *iothread = opaque;
> +    bool blocking;
> 
>      qemu_mutex_lock(&iothread->init_done_lock);
>      iothread->thread_id = qemu_get_thread_id();
> @@ -38,8 +39,10 @@ static void *iothread_run(void *opaque)
> 
>      while (!iothread->stopping) {
>          aio_context_acquire(iothread->ctx);
> -        while (!iothread->stopping && aio_poll(iothread->ctx, true)) {
> +        blocking = true;
> +        while (!iothread->stopping && aio_poll(iothread->ctx, blocking)) {
>              /* Progress was made, keep going */
> +            blocking = false;
>          }
>          aio_context_release(iothread->ctx);
>      }
> 
> Christian, can you test it?
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 19:07     ` Christian Borntraeger
@ 2014-07-08 19:50       ` Paolo Bonzini
  2014-07-09  7:56       ` Stefan Hajnoczi
  1 sibling, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2014-07-08 19:50 UTC (permalink / raw)
  To: Christian Borntraeger, Stefan Hajnoczi
  Cc: Cornelia Huck, Kevin Wolf, ming.lei, qemu-devel, Dominik Dingel

Il 08/07/2014 21:07, Christian Borntraeger ha scritto:
> On 08/07/14 19:08, Paolo Bonzini wrote:
>> Il 08/07/2014 17:59, Stefan Hajnoczi ha scritto:
>>> I sent Christian an initial patch to fix this but now both threads are
>>> stuck in rfifolock_lock() inside cond wait.  That's very strange and
>>> should never happen.
>>
>> I had this patch pending for 2.2:
>>
>> commit 6c81e31615c3cda5ea981a998ba8b1b8ed17de6f
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date:   Mon Jul 7 10:39:49 2014 +0200
>>
>>     iothread: do not rely on aio_poll(ctx, true) result to end a loop
>>
>>     Currently, whenever aio_poll(ctx, true) has completed all pending
>>     work it returns true *and* the next call to aio_poll(ctx, true)
>>     will not block.
>>
>>     This invariant has its roots in qemu_aio_flush()'s implementation
>>     as "while (qemu_aio_wait()) {}".  However, qemu_aio_flush() does
>>     not exist anymore and bdrv_drain_all() is implemented differently;
>>     and this invariant is complicated to maintain and subtly different
>>     from the return value of GMainLoop's g_main_context_iteration.
>>
>>     All calls to aio_poll(ctx, true) except one are guarded by a
>>     while() loop checking for a request to be incomplete, or a
>>     BlockDriverState to be idle.  Modify that one exception in
>>     iothread.c.
>>
>>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> The hangs are gone. Looks like 2.1 material now...
>
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

Great, I'll send it out tomorrow morning.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] another locking issue in current dataplane code?
  2014-07-08 19:07     ` Christian Borntraeger
  2014-07-08 19:50       ` Paolo Bonzini
@ 2014-07-09  7:56       ` Stefan Hajnoczi
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2014-07-09  7:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Kevin Wolf, Ming Lei, qemu-devel, Dominik Dingel,
	Stefan Hajnoczi, Cornelia Huck, Paolo Bonzini

On Tue, Jul 8, 2014 at 9:07 PM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 08/07/14 19:08, Paolo Bonzini wrote:
>> Il 08/07/2014 17:59, Stefan Hajnoczi ha scritto:
>>> I sent Christian an initial patch to fix this but now both threads are
>>> stuck in rfifolock_lock() inside cond wait.  That's very strange and
>>> should never happen.
>>
>> I had this patch pending for 2.2:
>>
>> commit 6c81e31615c3cda5ea981a998ba8b1b8ed17de6f
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date:   Mon Jul 7 10:39:49 2014 +0200
>>
>>     iothread: do not rely on aio_poll(ctx, true) result to end a loop
>>
>>     Currently, whenever aio_poll(ctx, true) has completed all pending
>>     work it returns true *and* the next call to aio_poll(ctx, true)
>>     will not block.
>>
>>     This invariant has its roots in qemu_aio_flush()'s implementation
>>     as "while (qemu_aio_wait()) {}".  However, qemu_aio_flush() does
>>     not exist anymore and bdrv_drain_all() is implemented differently;
>>     and this invariant is complicated to maintain and subtly different
>>     from the return value of GMainLoop's g_main_context_iteration.
>>
>>     All calls to aio_poll(ctx, true) except one are guarded by a
>>     while() loop checking for a request to be incomplete, or a
>>     BlockDriverState to be idle.  Modify that one exception in
>>     iothread.c.
>>
>>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> The hangs are gone. Looks like 2.1 material now...
>
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>
>
>
>
>>
>> diff --git a/iothread.c b/iothread.c
>> index 1fbf9f1..d9403cf 100644
>> --- a/iothread.c
>> +++ b/iothread.c
>> @@ -30,6 +30,7 @@ typedef ObjectClass IOThreadClass;
>>  static void *iothread_run(void *opaque)
>>  {
>>      IOThread *iothread = opaque;
>> +    bool blocking;
>>
>>      qemu_mutex_lock(&iothread->init_done_lock);
>>      iothread->thread_id = qemu_get_thread_id();
>> @@ -38,8 +39,10 @@ static void *iothread_run(void *opaque)
>>
>>      while (!iothread->stopping) {
>>          aio_context_acquire(iothread->ctx);
>> -        while (!iothread->stopping && aio_poll(iothread->ctx, true)) {
>> +        blocking = true;
>> +        while (!iothread->stopping && aio_poll(iothread->ctx, blocking)) {
>>              /* Progress was made, keep going */
>> +            blocking = false;
>>          }
>>          aio_context_release(iothread->ctx);
>>      }
>>
>> Christian, can you test it?

Could affect performance because of the extra poll/release/acquire but
a clean solution for broken iothread_run().

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Good for 2.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-07-09  7:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-07 11:58 [Qemu-devel] another locking issue in current dataplane code? Christian Borntraeger
2014-07-08  7:19 ` Christian Borntraeger
2014-07-08  7:43   ` Ming Lei
2014-07-08  8:38     ` Christian Borntraeger
2014-07-08  9:09     ` Christian Borntraeger
2014-07-08 10:12       ` Christian Borntraeger
2014-07-08 10:37         ` Christian Borntraeger
2014-07-08 11:03           ` Christian Borntraeger
2014-07-08 15:59 ` Stefan Hajnoczi
2014-07-08 17:08   ` Paolo Bonzini
2014-07-08 19:07     ` Christian Borntraeger
2014-07-08 19:50       ` Paolo Bonzini
2014-07-09  7:56       ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.