Re: io_uring: not good enough for release

From: Jens Axboe <axboe@kernel.dk>
To: "Stefan Bühler" <source@stbuehler.de>,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: io_uring: not good enough for release
Date: Tue, 23 Apr 2019 14:31:51 -0600	[thread overview]
Message-ID: <37071226-375a-07a6-d3d3-21323145de71@kernel.dk> (raw)
In-Reply-To: <366484f9-cc5b-e477-6cc5-6c65f21afdcb@stbuehler.de>

On 4/23/19 1:06 PM, Stefan Bühler wrote:
> Hi,
> 
> now that I've got some of my rust code running with io_uring I don't
> think io_uring is ready.
> 
> If marking it as EXPERIMENTAL (and not "default y") is considered a
> clear flag for "API might still change" I'd recommend going for that.

That might be an option, but I don't think we need to do that. We've
still got a least a few weeks, and the only issue mentioned below that's
really a change that would warrant something like that is easily doable
now. All it needs is agreement.

> Here is my current issue list:
> 
> ---
> 
> 1. An error for a submission should be returned as completion for that
> submission.  Please don't break my main event loop with strange error
> codes just because a single operation is broken/not supported/...

So that's the case I was referring to above. We can just make that change,
there's absolutely no reason to have errors passed back through a different
channel.

> 2. {read,write}_iter and FMODE_NOWAIT / IOCB_NOWAIT is broken at the vfs
> layer: vfs_{read,write} should set IOCB_NOWAIT if O_NONBLOCK is set when
> they call {read,write}_iter (i.e. init_sync_kiocb/iocb_flags needs to
> convert the flag).
> 
> And all {read,write}_iter should check IOCB_NOWAIT instead of O_NONBLOCK
> (hi there pipe.c!), and set FMODE_NOWAIT if they support IOCB_NOWAIT.
> 
> {read,write}_iter should only queue the IOCB though if is_sync_kiocb()
> returns false (i.e. if ki_callback is set).

That's a trivial fix. I agree that it should be done.

> Because right now an IORING_OP_READV on a blocking pipe *blocks*
> io_uring_enter, and on a non-blocking pipe completes with EAGAIN all the
> time.
> 
> So io_uring (readv) doesn't even work on a pipe!  (At least
> IORING_OP_POLL_ADD is working...)

It works, but it blocks. That can be argued as broken, and I agree that
it is, but it's important to make the distinction!

> As another side note: timerfd doesn't have read_iter, so needs
> IORING_OP_POLL_ADD too... :(
> 
> (Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
> always removed in the workqueue context, and I don't see an early EAGAIN
> completion).

That's a case I didn't consider, that you'd want to see EAGAIN after
it's been punted. Once punted, we're not going to return EAGAIN since
we can now block. Not sure how you'd want to handle that any better...

> 3. io_file_supports_async should check for FMODE_NOWAIT instead of using
> some hard-coded magic checks.

We probably just need to err on the side of caution there, and suffer
the extra async punts.

> 4. io_prep_rw shouldn't disable force_nonblock if FMODE_NOWAIT isn't
> available; it should return EAGAIN instead and let the workqueue handle it.

Agree

> I'm guessing especially 2. has something to do with why aio never took
> off - so maybe it's time to fix the underlying issues first.

It only really works for a subset of it, but we should ensure that it's
caught and always punted so we don't end up with io_uring_enter() blocking.
That should be the key goal. For regular file writes, should be easy
enough to do. But it should end up being an optimization to what we have,
getting rid of an unecessary async indirection, instead of having cases
where io_uring_enter() blocks.

> I'd be happy to contribute a few patches to those issues if there is an
> agreement what the result should look like :)

Pretty sure folks would be happy to see that :-)

> I have one other question: is there any way to cancel an IO read/write
> operation? I don't think closing io_uring has any effect, what about
> closing the files I'm reading/writing?  (Adding cancelation to kiocb
> sounds like a non-trivial task; and I don't think it already supports it.)

There is no way to do that. If you look at existing aio, nobody supports
that either. Hence io_uring doesn't export any sort of cancellation outside
of the poll case where we can handle it internally to io_uring.

If you look at storage, then generally IO doesn't wait around in the stack,
it's issued. Most hardware only supports queue abort like cancellation,
which isn't useful at all.

So I don't think that will ever happen.

> So cleanup in general seems hard to me: do I have to wait for all
> read/write operations to complete so I can safely free all buffers
> before I close the event loop?

The ring exit waits for IO to complete already.

-- 
Jens Axboe