On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote: > On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote: > > On 2020/10/19 21:25, Paolo Bonzini wrote: > > > On 19/10/20 14:40, Zhenyu Ye wrote: > > > > The kernel backtrace for io_submit in GUEST is: > > > > > > > > guest# ./offcputime -K -p `pgrep -nx fio` > > > > b'finish_task_switch' > > > > b'__schedule' > > > > b'schedule' > > > > b'io_schedule' > > > > b'blk_mq_get_tag' > > > > b'blk_mq_get_request' > > > > b'blk_mq_make_request' > > > > b'generic_make_request' > > > > b'submit_bio' > > > > b'blkdev_direct_IO' > > > > b'generic_file_read_iter' > > > > b'aio_read' > > > > b'io_submit_one' > > > > b'__x64_sys_io_submit' > > > > b'do_syscall_64' > > > > b'entry_SYSCALL_64_after_hwframe' > > > > - fio (1464) > > > > 40031912 > > > > > > > > And Linux io_uring can avoid the latency problem. > > Thanks for the info. What this tells us is basically the inflight > requests are high. It's sad that the linux-aio is in practice > implemented as a blocking API. > > Host side backtrace will be of more help. Can you get that too? I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went to sleep when it ran out of blk-mq tags. The easiest solution is to move to io_uring. Linux AIO is broken - it's not AIO :). If we know that no other process is writing to the host block device then maybe we can determine the blk-mq tags limit (the queue depth) and avoid sending more requests. That way QEMU doesn't block, but I don't think this approach works when other processes are submitting I/O to the same host block device :(. Fam's original suggestion of invoking io_submit(2) from a worker thread is an option, but I'm afraid it will slow down the uncontended case. I'm CCing Glauber in case he battled this in the past in ScyllaDB. Stefan