On Mon, 11 May 2020 12:49:47 +0100 Daniel P. Berrangé wrote: > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote: > > Hello Everyone, > > In many cases, if qemu has a network connection (qmp, migration, chardev, etc.) > > to some other server and that server dies or hangs, qemu hangs too. > > If qemu as a whole hangs due to a stalled network connection, that is a > bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking > I/O in general, such that if the network connection or remote server stalls, > we simply stop sending I/O - we shouldn't ever hang the QEMU process or main > loop. > > There are places in QEMU code which are not well behaved in this respect, > but many are, and others are getting fixed where found to be important. > > Arguably any place in QEMU code which can result in a hang of QEMU in the > event of a stalled network should be considered a security flaw, because > the network is untrusted in general. The fact that out-of-band qmp commands exist at all shows that we have to make tradeoffs of developer time vs. doing things right. Sure, the migration code can be rewritten to use non-blocking i/o and finegrained locks. But as a hobbyist I don't have time to fix this. > > These patches introduce the new 'yank' out-of-band qmp command to recover from > > these kinds of hangs. The different subsystems register callbacks which get > > executed with the yank command. For example the callback can shutdown() a > > socket. This is intended for the colo use-case, but it can be used for other > > things too of course. > > IIUC, invoking the "yank" command unconditionally kills every single > network connection in QEMU that has registered with the "yank" subsystem. > IMHO this is way too big of a hammer, even if we accept there are bugs in > QEMU not handling stalled networking well. > > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD > connection used for the guest disk, we needlessly break I/O. Yeah, these patches are intended to solve the problems with the colo use-case where all external connections (migration, chardevs, nbd) are just for replication. In other use-cases you'd enable the yank feature only on the non-essential connections. > eg doing this in the chardev backend is not desirable, because the bugs > with hanging QEMU are typically caused by the way the frontend device > uses the chardev blocking I/O calls, instead of non-blocking I/O calls. > > > Regards, > Daniel