On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote:
> On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> > On 2020/10/19 21:25, Paolo Bonzini wrote:
> > > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > > The kernel backtrace for io_submit in GUEST is:
> > > > 
> > > > 	guest# ./offcputime -K -p `pgrep -nx fio`
> > > > 	    b'finish_task_switch'
> > > > 	    b'__schedule'
> > > > 	    b'schedule'
> > > > 	    b'io_schedule'
> > > > 	    b'blk_mq_get_tag'
> > > > 	    b'blk_mq_get_request'
> > > > 	    b'blk_mq_make_request'
> > > > 	    b'generic_make_request'
> > > > 	    b'submit_bio'
> > > > 	    b'blkdev_direct_IO'
> > > > 	    b'generic_file_read_iter'
> > > > 	    b'aio_read'
> > > > 	    b'io_submit_one'
> > > > 	    b'__x64_sys_io_submit'
> > > > 	    b'do_syscall_64'
> > > > 	    b'entry_SYSCALL_64_after_hwframe'
> > > > 	    -                fio (1464)
> > > > 		40031912
> > > > 
> > > > And Linux io_uring can avoid the latency problem.
> 
> Thanks for the info. What this tells us is basically the inflight
> requests are high. It's sad that the linux-aio is in practice
> implemented as a blocking API.
> 
> Host side backtrace will be of more help. Can you get that too?

I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went
to sleep when it ran out of blk-mq tags. The easiest solution is to move
to io_uring. Linux AIO is broken - it's not AIO :).

If we know that no other process is writing to the host block device
then maybe we can determine the blk-mq tags limit (the queue depth) and
avoid sending more requests. That way QEMU doesn't block, but I don't
think this approach works when other processes are submitting I/O to the
same host block device :(.

Fam's original suggestion of invoking io_submit(2) from a worker thread
is an option, but I'm afraid it will slow down the uncontended case.

I'm CCing Glauber in case he battled this in the past in ScyllaDB.

Stefan