On Mon, Sep 21, 2020 at 11:14:35AM +0000, Fam Zheng wrote: > On 2020-09-19 10:22, Zhenyu Ye wrote: > > On 2020/9/18 22:06, Fam Zheng wrote: > > > > > > I can see how blocking in a slow io_submit can cause trouble for main > > > thread. I think one way to fix it (until it's made truly async in new > > > kernels) is moving the io_submit call to thread pool, and wrapped in a > > > coroutine, perhaps. > > > > > > > I'm not sure if any other operation will block the main thread, other > > than io_submit(). > > Then that's a problem with io_submit which should be fixed. Or more > precisely, that is a long held lock that we should avoid in QEMU's event > loops. > > > > > > I'm not sure qmp timeout is a complete solution because we would still > > > suffer from a blocked state for a period, in this exact situation before > > > the timeout. > > > > Anyway, the qmp timeout may be the last measure to prevent the VM > > soft lockup. > > Maybe, but I don't think baking such a workaround into the QMP API is a > good idea. No QMP command should be synchronously long running, so > having a timeout parameter is just a wrong design. Sorry, I lost track of this on-going email thread. Thanks for the backtrace. It shows the io_submit call is done while the AioContext lock is held. The monitor thread is waiting for the IOThread's AioContext lock. vcpus threads can get stuck waiting on the big QEMU lock (BQL) that is held by the monitor in the meantime. Please collect the kernel backtrace for io_submit so we can understand why multi-second io_submit latencies happen. I also suggest trying aio=io_uring to check if Linux io_uring avoids the latency problem. Stefan