Re: [PATCH v3] io_uring: implement async hybrid mode for pollable requests

From: Pavel Begunkov <asml.silence@gmail.com>
To: Hao Xu <haoxu@linux.alibaba.com>, Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org, Joseph Qi <joseph.qi@linux.alibaba.com>
Subject: Re: [PATCH v3] io_uring: implement async hybrid mode for pollable requests
Date: Fri, 22 Oct 2021 20:49:24 +0100	[thread overview]
Message-ID: <47b4a1bb-7918-8803-90a7-3cf95b99d84f@gmail.com> (raw)
In-Reply-To: <20211018133445.103438-1-haoxu@linux.alibaba.com>

On 10/18/21 14:34, Hao Xu wrote:
> The current logic of requests with IOSQE_ASYNC is first queueing it to
> io-worker, then execute it in a synchronous way. For unbound works like
> pollable requests(e.g. read/write a socketfd), the io-worker may stuck
> there waiting for events for a long time. And thus other works wait in
> the list for a long time too.
> Let's introduce a new way for unbound works (currently pollable
> requests), with this a request will first be queued to io-worker, then
> executed in a nonblock try rather than a synchronous way. Failure of
> that leads it to arm poll stuff and then the worker can begin to handle
> other works.
> The detail process of this kind of requests is:

Looks good, but there are a couple of moments that might be better.
Would be good to have it for 5.16, if it's rolling out soon we can
queue this and do post clean up.

nits below

> step1: original context:
>             queue it to io-worker
> step2: io-worker context:
>             nonblock try(the old logic is a synchronous try here)
>                 |
>                 |--fail--> arm poll
>                              |
>                              |--(fail/ready)-->synchronous issue
>                              |
>                              |--(succeed)-->worker finish it's job, tw
>                                             take over the req
> 
> This works much better than the old IOSQE_ASYNC logic in cases where
> unbound max_worker is relatively small. In this case, number of
> io-worker eazily increments to max_worker, new worker cannot be created
> and running workers stuck there handling old works in IOSQE_ASYNC mode.
> 
> In my 64-core machine, set unbound max_worker to 20, run echo-server,
> turns out:
> (arguments: register_file, connetion number is 1000, message size is 12
> Byte)
> original IOSQE_ASYNC: 76664.151 tps
> after this patch: 166934.985 tps
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
> 
> v1-->v2:
>   - tweak added code in io_wq_submit_work to reduce overhead
> v2-->v3:
>   - remove redundant IOSQE_ASYNC_HYBRID stuff
> 
> 
>   fs/io_uring.c | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index b3546eef0289..86819c7917df 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6747,8 +6747,18 @@ static void io_wq_submit_work(struct io_wq_work *work)
>   		ret = -ECANCELED;

Too much of indention, it's hard to read. I'd suggest to do

if (unlikely(work->flags & IO_WQ_WORK_CANCEL)) {
	io_req_task_queue_fail(req, -ECANCELED);
	return;
}

and kill following "if (!ret)" with extra indention

>   
>   	if (!ret) {
> +		bool needs_poll = false;
> +		unsigned int issue_flags = IO_URING_F_UNLOCKED;
> +
> +		if (req->flags & REQ_F_FORCE_ASYNC) {
> +			needs_poll = req->file && file_can_poll(req->file);

It goes through a weird flow if opcode doesn't use polling,

needs_poll = needs_poll && (def->pollin || def->pollout);

We can even split out checks from io_arm_poll_handler() into
an inline handler.

if (!req->file || !file_can_poll(req->file))
	return IO_APOLL_ABORTED;
if (req->flags & REQ_F_POLLED)
	return IO_APOLL_ABORTED;
if (!def->pollin && !def->pollout)
	return IO_APOLL_ABORTED;

> +			if (needs_poll)
> +				issue_flags |= IO_URING_F_NONBLOCK;
> +		}
> +
>   		do {
> -			ret = io_issue_sqe(req, IO_URING_F_UNLOCKED);
> +issue_sqe:
> +			ret = io_issue_sqe(req, issue_flags);
>   			/*
>   			 * We can get EAGAIN for polled IO even though we're
>   			 * forcing a sync submission from here, since we can't
> @@ -6756,6 +6766,30 @@ static void io_wq_submit_work(struct io_wq_work *work)
>   			 */
>   			if (ret != -EAGAIN)
>   				break;
> +			if (needs_poll) {

How about inverting it and killing yet another indention level
for the polling stuff?

if (!needs_poll) {
	cond_resched();
	continue;
}

/* arm polling / etc. */

And that's the only place we need cond_resched(), so the one
at the end can be deleted.

> +				bool armed = false;
> +
> +				ret = 0;
> +				needs_poll = false;
> +				issue_flags &= ~IO_URING_F_NONBLOCK;
> +
> +				switch (io_arm_poll_handler(req)) {

1) Not a big fun of back hopping jumps
2) don't like armed logic

What we can do is replace switch with if's, and then you can freely
use break instead of @armed and continue instead of goto. Also,
considering a cond_reshed() comment it can be:

if (io_arm_poll_handler(req) == IO_APOLL_OK)
	break;
continue;

> +				case IO_APOLL_READY:
> +					goto issue_sqe;
> +				case IO_APOLL_ABORTED:
> +					/*
> +					 * somehow we failed to arm the poll infra,
> +					 * fallback it to a normal async worker try.
> +					 */
> +					break;
> +				case IO_APOLL_OK:
> +					armed = true;
> +					break;
> +				}
> +
> +				if (armed)
> +					break;
> +			}
>   			cond_resched();
>   		} while (1);
>   	}
> 

With all but (pollin || pollout) changes looks for me like
below (not tested), I guess can be brushed further.

static void io_wq_submit_work(struct io_wq_work *work)
{
	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
	unsigned int issue_flags = IO_URING_F_UNLOCKED;
	bool needs_poll = false;
	struct io_kiocb *timeout;
	int ret;

	/* one will be dropped by ->io_free_work() after returning to io-wq */
	if (!(req->flags & REQ_F_REFCOUNT))
		__io_req_set_refcount(req, 2);
	else
		req_ref_get(req);

	timeout = io_prep_linked_timeout(req);
	if (timeout)
		io_queue_linked_timeout(timeout);

	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
	if (unlikely(work->flags & IO_WQ_WORK_CANCEL)) {
		io_req_task_queue_fail(req, -ECANCELED);
		return;
	}

	if (req->flags & REQ_F_FORCE_ASYNC) {
		needs_poll = req->file && file_can_poll(req->file);
		if (needs_poll)
			issue_flags |= IO_URING_F_NONBLOCK;
	}

	do {
		ret = io_issue_sqe(req, issue_flags);
		/*
		 * We can get EAGAIN for polled IO even though we're
		 * forcing a sync submission from here, since we can't
		 * wait for request slots on the block side.
		 */
		if (ret != -EAGAIN)
			break;
		if (!needs_poll) {
			cond_resched();
			continue;
		}

		if (io_arm_poll_handler(req) == IO_APOLL_OK)
			return;
		/* ready or aborted, in either case fall back to blocking */
		needs_poll = false;
		issue_flags &= ~IO_URING_F_NONBLOCK;
	} while (1);

	/* avoid locking problems by failing it from a clean context */
	if (ret)
		io_req_task_queue_fail(req, ret);
}

-- 
Pavel Begunkov