All of lore.kernel.org
 help / color / mirror / Atom feed
* io_uring question
@ 2019-07-04 11:21 Filipp Mikoian
  2019-07-16 17:04 ` Jens Axboe
  0 siblings, 1 reply; 3+ messages in thread
From: Filipp Mikoian @ 2019-07-04 11:21 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel, linux-aio; +Cc: axboe, jmoyer, viro

Hi dear io_uring developers,

Recently I started playing with io_uring, and the main difference I expected
to see with old AIO(io_submit(), etc.) was submission syscall(io_uring_enter())
not blocking in case submission might take long time, e.g. if waiting for a slot
in block device request queue is required. AFAIU, 'workers' machinery is used
solely to be able to submit requests in async context, thus not forcing calling
thread to block for a significant time. At worst EAGAIN is expected.

However, when I installed fresh 5.2.0-rc7 kernel on the machine with HDD with
64-requests-deep queue, I noticed significant increase in time spent in
io_uring_enter() once request queue became full. Below you can find output
of the program that submits random(in 1GB range) 4K read requests in batches
of 32. Though O_DIRECT is used, the same phenomenon is observed when using
page cache. Source code can be found here:
https://github.com/Phikimon/io_uring_question

While analyzing stack dump, I found out that IOCB_NOWAIT flag being set
does not prevent generic_file_read_iter() from calling blkdev_direct_IO(),
so thread gets stuck for hundreds of milliseconds. However, I am not a
Linux kernel expert, so I can not be sure this is actually related to the
mentioned issue.

Is it actually expected that io_uring would sleep in case there is no slot
in block device's request queue, or is this a bug of current implementation?

root@localhost:~/io_uring# uname -msr
Linux 5.2.0-rc7 x86_64
root@localhost:~/io_uring# hdparm -I /dev/sda | grep Model
Model Number:       Hitachi HTS541075A9E680
root@localhost:~/io_uring# cat /sys/block/sda/queue/nr_requests
64
root@localhost:~/io_uring# ./io_uring_read_blkdev /dev/sda8
submitted_already =   0, submitted_now =  32, submit_time =     246 us
submitted_already =  32, submitted_now =  32, submit_time =     130 us
submitted_already =  64, submitted_now =  32, submit_time =  189548 us
submitted_already =  96, submitted_now =  32, submit_time =  121542 us
submitted_already = 128, submitted_now =  32, submit_time =  128314 us
submitted_already = 160, submitted_now =  32, submit_time =  136345 us
submitted_already = 192, submitted_now =  32, submit_time =  162320 us
root@localhost:~/io_uring# cat pstack_output # This is where process slept
[<0>] io_schedule+0x16/0x40
[<0>] blk_mq_get_tag+0x166/0x280
[<0>] blk_mq_get_request+0xde/0x380
[<0>] blk_mq_make_request+0x11e/0x5b0
[<0>] generic_make_request+0x191/0x3c0
[<0>] submit_bio+0x75/0x140
[<0>] blkdev_direct_IO+0x3f8/0x4a0
[<0>] generic_file_read_iter+0xbf/0xdc0
[<0>] blkdev_read_iter+0x37/0x40
[<0>] io_read+0xf6/0x180
[<0>] __io_submit_sqe+0x1cd/0x6a0
[<0>] io_submit_sqe+0xea/0x4b0
[<0>] io_ring_submit+0x86/0x120
[<0>] __x64_sys_io_uring_enter+0x241/0x2d0
[<0>] do_syscall_64+0x60/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

P.S. There are also several suspicious places in liburing and io_uring's kernel
     part. I'm not sure if these are really bugs, so please point out if any of
     them needs a fixing patch. Among them:
1. Inaccurate handling of errors in liburing/__io_uring_submit(). Because
   liburing currently does not care about queue head that kernel sets, it cannot
   know how many entries have been actually consumed. In case e.g.
   io_uring_enter() returns EAGAIN, and consumes none of the sqes, sq->sqe_head
   still advances in __io_uring_submit(), this can eventually cause both
   io_uring_submit() and io_uring_sqe() return 0 forever.
2. There is also a related issue -- when using IORING_SETUP_SQPOLL, in case
   polling kernel thread already went to sleep(IORING_SQ_NEED_WAKEUP is set),
   io_uring_enter() just wakes it up and immediately reports all @to_submit
   requests are consumed, while this is not true until awaken thread will manage
   to handle them. At least this contradicts with man page, which states:
   > When the system call returns that a certain amount of SQEs have been
   > consumed and submitted, it's safe to reuse SQE entries in the ring.
   It is easy to reproduce this bug -- just change e.g. ->offset field in the
   SQE immediately after io_uring_enter() successfully returns and you will see
   that IO happened on new offset.
3. Again due to lack of synchronization between io_sq_thread() and
   io_uring_enter(), in case the ring is full and IORING_SETUP_SQPOLL is used,
   it seems there is no other way for application to wait for slots in SQ to
   become available but busy waiting for *sq->khead to advance. Thus from one
   busy waiting thread we get two. Is this the expected behavior? Should the
   user of IORING_SETUP_SQPOLL busy wait for slots in SQ?
4. Minor one: in case sq_thread_idle is set to ridiculously big value(e.g. 100
   sec), kernel watchdog starts reporting this as a bug.
   > Message from syslogd@centos-linux at Jun 21 20:00:04 ...
   >  kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [io_uring-sq:10691]

Looking forward to your reply and thank you in advance.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: io_uring question
  2019-07-04 11:21 io_uring question Filipp Mikoian
@ 2019-07-16 17:04 ` Jens Axboe
  2019-07-17 14:56   ` Filipp Mikoian
  0 siblings, 1 reply; 3+ messages in thread
From: Jens Axboe @ 2019-07-16 17:04 UTC (permalink / raw)
  To: Filipp Mikoian, linux-block, linux-fsdevel, linux-kernel, linux-aio
  Cc: jmoyer, viro

[-- Attachment #1: Type: text/plain, Size: 5616 bytes --]

On 7/4/19 5:21 AM, Filipp Mikoian wrote:
> Hi dear io_uring developers,
> 
> Recently I started playing with io_uring, and the main difference I expected
> to see with old AIO(io_submit(), etc.) was submission syscall(io_uring_enter())
> not blocking in case submission might take long time, e.g. if waiting for a slot
> in block device request queue is required. AFAIU, 'workers' machinery is used
> solely to be able to submit requests in async context, thus not forcing calling
> thread to block for a significant time. At worst EAGAIN is expected.
> 
> However, when I installed fresh 5.2.0-rc7 kernel on the machine with HDD with
> 64-requests-deep queue, I noticed significant increase in time spent in
> io_uring_enter() once request queue became full. Below you can find output
> of the program that submits random(in 1GB range) 4K read requests in batches
> of 32. Though O_DIRECT is used, the same phenomenon is observed when using
> page cache. Source code can be found here:
> https://github.com/Phikimon/io_uring_question
> 
> While analyzing stack dump, I found out that IOCB_NOWAIT flag being set
> does not prevent generic_file_read_iter() from calling blkdev_direct_IO(),
> so thread gets stuck for hundreds of milliseconds. However, I am not a
> Linux kernel expert, so I can not be sure this is actually related to the
> mentioned issue.
> 
> Is it actually expected that io_uring would sleep in case there is no slot
> in block device's request queue, or is this a bug of current implementation?
> 
> root@localhost:~/io_uring# uname -msr
> Linux 5.2.0-rc7 x86_64
> root@localhost:~/io_uring# hdparm -I /dev/sda | grep Model
> Model Number:       Hitachi HTS541075A9E680
> root@localhost:~/io_uring# cat /sys/block/sda/queue/nr_requests
> 64
> root@localhost:~/io_uring# ./io_uring_read_blkdev /dev/sda8
> submitted_already =   0, submitted_now =  32, submit_time =     246 us
> submitted_already =  32, submitted_now =  32, submit_time =     130 us
> submitted_already =  64, submitted_now =  32, submit_time =  189548 us
> submitted_already =  96, submitted_now =  32, submit_time =  121542 us
> submitted_already = 128, submitted_now =  32, submit_time =  128314 us
> submitted_already = 160, submitted_now =  32, submit_time =  136345 us
> submitted_already = 192, submitted_now =  32, submit_time =  162320 us
> root@localhost:~/io_uring# cat pstack_output # This is where process slept
> [<0>] io_schedule+0x16/0x40
> [<0>] blk_mq_get_tag+0x166/0x280
> [<0>] blk_mq_get_request+0xde/0x380
> [<0>] blk_mq_make_request+0x11e/0x5b0
> [<0>] generic_make_request+0x191/0x3c0
> [<0>] submit_bio+0x75/0x140
> [<0>] blkdev_direct_IO+0x3f8/0x4a0
> [<0>] generic_file_read_iter+0xbf/0xdc0
> [<0>] blkdev_read_iter+0x37/0x40
> [<0>] io_read+0xf6/0x180
> [<0>] __io_submit_sqe+0x1cd/0x6a0
> [<0>] io_submit_sqe+0xea/0x4b0
> [<0>] io_ring_submit+0x86/0x120
> [<0>] __x64_sys_io_uring_enter+0x241/0x2d0
> [<0>] do_syscall_64+0x60/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff

Sorry for originally missing this one! For this particular issue, it
looks like you are right, we don't honor NOWAIT for request allocation.
That's a bug, pondering how best to fix this. Can you try the attached
patch and see if it fixes it for you?

> 1. Inaccurate handling of errors in liburing/__io_uring_submit().
> Because liburing currently does not care about queue head that kernel
> sets, it cannot know how many entries have been actually consumed. In
> case e.g.  io_uring_enter() returns EAGAIN, and consumes none of the
> sqes, sq->sqe_head still advances in __io_uring_submit(), this can
> eventually cause both io_uring_submit() and io_uring_sqe() return 0
> forever.

I'll look into that one.

> 2. There is also a related issue -- when using IORING_SETUP_SQPOLL, in
> case polling kernel thread already went to sleep(IORING_SQ_NEED_WAKEUP
> is set), io_uring_enter() just wakes it up and immediately reports all
> @to_submit requests are consumed, while this is not true until awaken
> thread will manage to handle them. At least this contradicts with man
> page, which states:
>     > When the system call returns that a certain amount of SQEs have
>     > been consumed and submitted, it's safe to reuse SQE entries in
>     > the ring.
>     It is easy to reproduce this bug -- just change e.g. ->offset
>     field in the SQE immediately after io_uring_enter() successfully
>     returns and you will see that IO happened on new offset.

Not sure how best to convery that bit of information. If you're using
the sq thread for submission, then we cannot reliably tell the
application when an sqe has been consumed. The application must look for
completions (successful or errors) in the CQ ring.

> 3. Again due to lack of synchronization between io_sq_thread() and
> io_uring_enter(), in case the ring is full and IORING_SETUP_SQPOLL is
> used, it seems there is no other way for application to wait for slots
> in SQ to become available but busy waiting for *sq->khead to advance.
> Thus from one busy waiting thread we get two. Is this the expected
> behavior? Should the user of IORING_SETUP_SQPOLL busy wait for slots
> in SQ?

You could wait on cq ring completions, each sqe should trigger one.

> 4. Minor one: in case sq_thread_idle is set to ridiculously big
> value(e.g. 100 sec), kernel watchdog starts reporting this as a bug.
>     > Message from syslogd@centos-linux at Jun 21 20:00:04 ...
>     > kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 21s!
>     > [io_uring-sq:10691]

Ah yes, cosmetic issue, I'll address that one as well.

-- 
Jens Axboe


[-- Attachment #2: io_uring-direct-nowait.patch --]
[-- Type: text/x-patch, Size: 4027 bytes --]

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b038ec680e84..81409162e662 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1959,10 +1959,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	data.cmd_flags = bio->bi_opf;
 	rq = blk_mq_get_request(q, bio, &data);
 	if (unlikely(!rq)) {
+		blk_qc_t ret = BLK_QC_T_NONE;
+
 		rq_qos_cleanup(q, bio);
-		if (bio->bi_opf & REQ_NOWAIT)
+		if (bio->bi_opf & REQ_NOWAIT_RET)
+			ret = BLK_QC_T_EAGAIN;
+		else if (bio->bi_opf & REQ_NOWAIT)
 			bio_wouldblock_error(bio);
-		return BLK_QC_T_NONE;
+		return ret;
 	}
 
 	trace_block_getrq(q, bio, bio->bi_opf);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f00b569a9f89..6a09f483f15c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -344,15 +344,24 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	struct bio *bio;
 	bool is_poll = (iocb->ki_flags & IOCB_HIPRI) != 0;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool nowait = (iocb->ki_flags & IOCB_NOWAIT) != 0;
 	loff_t pos = iocb->ki_pos;
 	blk_qc_t qc = BLK_QC_T_NONE;
+	gfp_t gfp;
 	int ret = 0;
 
 	if ((pos | iov_iter_alignment(iter)) &
 	    (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 
-	bio = bio_alloc_bioset(GFP_KERNEL, nr_pages, &blkdev_dio_pool);
+	if (nowait)
+		gfp = GFP_NOWAIT;
+	else
+		gfp = GFP_KERNEL;
+
+	bio = bio_alloc_bioset(gfp, nr_pages, &blkdev_dio_pool);
+	if (!bio)
+		return -EAGAIN;
 
 	dio = container_of(bio, struct blkdev_dio, bio);
 	dio->is_sync = is_sync = is_sync_kiocb(iocb);
@@ -397,6 +406,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			bio->bi_opf = dio_bio_write_op(iocb);
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
+		if (nowait)
+			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_RET);
 
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
@@ -411,6 +422,10 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			}
 
 			qc = submit_bio(bio);
+			if (qc == BLK_QC_T_EAGAIN) {
+				ret = -EAGAIN;
+				goto err;
+			}
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -431,7 +446,12 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
-		submit_bio(bio);
+		qc = submit_bio(bio);
+		if (qc == BLK_QC_T_EAGAIN) {
+			ret = -EAGAIN;
+			goto err;
+		}
+
 		bio = bio_alloc(GFP_KERNEL, nr_pages);
 	}
 
@@ -452,6 +472,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	}
 	__set_current_state(TASK_RUNNING);
 
+out:
 	if (!ret)
 		ret = blk_status_to_errno(dio->bio.bi_status);
 	if (likely(!ret))
@@ -459,6 +480,10 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	bio_put(&dio->bio);
 	return ret;
+err:
+	if (!is_poll)
+		blk_finish_plug(&plug);
+	goto out;
 }
 
 static ssize_t
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index feff3fe4467e..3d89a3c04977 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -311,6 +311,7 @@ enum req_flag_bits {
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
+	__REQ_NOWAIT_RET,	/* Return would-block error inline */
 	/*
 	 * When a shared kthread needs to issue a bio for a cgroup, doing
 	 * so synchronously can lead to priority inversions as the kthread
@@ -345,6 +346,7 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
+#define REQ_NOWAIT_RET		(1ULL << __REQ_NOWAIT_RET)
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
@@ -418,6 +420,7 @@ static inline int op_stat_group(unsigned int op)
 
 typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE		-1U
+#define BLK_QC_T_EAGAIN		-2U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
 

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: io_uring question
  2019-07-16 17:04 ` Jens Axboe
@ 2019-07-17 14:56   ` Filipp Mikoian
  0 siblings, 0 replies; 3+ messages in thread
From: Filipp Mikoian @ 2019-07-17 14:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-kernel, linux-aio
  Cc: jmoyer, viro

> Can you try the attached patch and see if it fixes it for you?

Thank you very much, that worked like a charm for both O_DIRECT and page
cache. Below is the output for O_DIRECT reads submission on the same machine:

root@localhost:~/io_uring# ./io_uring_read_blkdev /dev/sda8
submitted_already =   0, submitted_now =  32, submit_time =     277 us
submitted_already =  32, submitted_now =  32, submit_time =     131 us
submitted_already =  64, submitted_now =  32, submit_time =     213 us
submitted_already =  96, submitted_now =  32, submit_time =     170 us
submitted_already = 128, submitted_now =  32, submit_time =     161 us
submitted_already = 160, submitted_now =  32, submit_time =     169 us
submitted_already = 192, submitted_now =  32, submit_time =     184 us

> Not sure how best to convery that bit of information. If you're using
> the sq thread for submission, then we cannot reliably tell the
> application when an sqe has been consumed. The application must look for
> completions (successful or errors) in the CQ ring.

I know that SQPOLL feature support is not fully implemented in liburing,
so for now it seems that io_uring_get_sqe() could return not actually
submitted SQE, editing which could lead to race between kernel polling
thread and user space. I just think it is worth mentioning this fact in
documentation.

> You could wait on cq ring completions, each sqe should trigger one.

Unfortunately few issues seem to arise if this approach is taken in
IO-intensive application. As a disclaimer I should note that SQ ring
overflow is a rare event given enough entries, nevertheless applications,
especially those using SQPOLL, should handle this situation gracefully
and in a performant manner.

So what we have is highly IO-intensive application that submits very
slow IOs*** (that's why it uses async IO in the first place) and
cares much about the progress of the submitting threads(the most probable
reason to use SQPOLL feature). Given such prerequisites, the following
scenario is probable:

*** by 'very slow' I mean IOs, completion of which takes significantly
    more time than submission

1. Put @sq_entries with very slow IOs in SQ...
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
============>| X |     |   |     |   |   |
   +---+     +---+     +---+     +---+---+
   ...which will be submitted by polling thread
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
   |   |     |   |====>| X |     |   |   |
   +---+     +---+     +---+     +---+---+
2. Then try to add (@sq_entries + @pending) entries to SQ, but only
   succeed with @sq_entries.
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
==>| X |====>| X |     | X |     |   |   |
   +---+     +---+     +---+     +---+---+
3. Wait very long time in io_uring_enter(GETEVENTS) waiting for CQ ring
   completion...
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
   | X |     | X |     |   |====>| X |   |
   +---+     +---+     +---+     +---+---+
   ...and still there is no guarantee that slot in SQ ring became
   available. Should we call
       io_uring_enter(GETEVENTS, min_complete = 1);
   in a loop, checking (*khead == *ktail) at every iteration?

Concluding, it seems reasonable to instruct applications using SQPOLL to
submit SQEs until the queue is full, and then call io_uring_enter(),
probably with some flag, to wait for a slot in submission queue, not for
completions, since
1) Time needed to complete IO tends to be much greater than time needed
   to submit it.
2) CQ ring completion does not imply the slot became available in SQ (see
   diagram above).
3) Busy waiting of submitting thread is probably not what is desired by
   SQPOLL users.

Side note: eventloop-driven applications could find themselves comforted
by epoll()-ing ioring fd with EPOLLOUT to wait for the available entry in
SQ. Do I understand it correctly that spurious wakeups are currently
possible since io_uring_poll() is awakened only on io_commit_cqring(),
which, as shown above, doesn't guarantee that EPOLLOUT may be set?

Thank you again!
__
Best regards,
Filipp Mikoian


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-07-17 14:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-04 11:21 io_uring question Filipp Mikoian
2019-07-16 17:04 ` Jens Axboe
2019-07-17 14:56   ` Filipp Mikoian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.