On 2/11/19 12:24 PM, Daniel P. Berrangé wrote: > This is a followup to > > v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html > v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html > > This series comes out of a discussion between myself & Yongji Xie in: > > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html > > I eventually understood that the problem faced was that > tcp_chr_wait_connected was racing with the background connection attempt > previously started, causing two connections to be established. This > broke because some vhost user servers only allow a single connection. > > After messing around with the code alot the final solution was in fact > very easy. We simply have to delay the first background connection > attempt until the main loop is running. It will then automatically > turn into a no-op if tcp_chr_wait_connected has been run. This is > dealt with in the last patch in this series > > I believe this should solve the problem Yongji Xie faced, and thus not > require us to add support for "nowait" option with client sockets at > all. The reconnect=1 option effectively already implements nowait > semantics, and now plays nicely with tcp_chr_wait_connected. > > In investigating this I found various other bugs that needed fixing and > identified some useful refactoring to simplify / clarify the code, hence > this very long series. Even with this series applied, I'm still seeing sporadic failures of iotest 169. Max posted a hack patch a while back that tries to work around the race: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg05907.html which he originally diagnosed in iotest 147: https://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html but as it was a hack, he has not pursued it further, and so the symptoms are still there, although not completely reproducible: 169 10s ... - output mismatch (see 169.out.bad) --- /home/eblake/qemu/tests/qemu-iotests/169.out 2018-11-16 15:48:12.018526748 -0600 +++ /home/eblake/qemu/tests/qemu-iotests/169.out.bad 2019-04-22 09:38:45.481517132 -0500 @@ -1,3 +1,5 @@ +WARNING:qemu:qemu received signal 11: /home/eblake/qemu/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 -chardev socket,id=mon,path=/home/eblake/qemu/tests/qemu-iotests/scratch/tmp4clmPF/qemua-26803-monitor.sock -mon chardev=mon,mode=control -display none -vga none -qtest unix:path=/home/eblake/qemu/tests/qemu-iotests/scratch/qemua-26803-qtest.sock -machine accel=qtest -nodefaults -machine accel=qtest -drive if=virtio,id=drive0,file=/home/eblake/qemu/tests/qemu-iotests/scratch/disk_a,format=qcow2,cache=writeback Any chance you can take a look as to what a non-hack fix should be? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org