linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* io_uring: not good enough for release
@ 2019-04-23 19:06 Stefan Bühler
  2019-04-23 20:31 ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-23 19:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

now that I've got some of my rust code running with io_uring I don't
think io_uring is ready.

If marking it as EXPERIMENTAL (and not "default y") is considered a
clear flag for "API might still change" I'd recommend going for that.

Here is my current issue list:

---

1. An error for a submission should be returned as completion for that
submission.  Please don't break my main event loop with strange error
codes just because a single operation is broken/not supported/...


2. {read,write}_iter and FMODE_NOWAIT / IOCB_NOWAIT is broken at the vfs
layer: vfs_{read,write} should set IOCB_NOWAIT if O_NONBLOCK is set when
they call {read,write}_iter (i.e. init_sync_kiocb/iocb_flags needs to
convert the flag).

And all {read,write}_iter should check IOCB_NOWAIT instead of O_NONBLOCK
(hi there pipe.c!), and set FMODE_NOWAIT if they support IOCB_NOWAIT.

{read,write}_iter should only queue the IOCB though if is_sync_kiocb()
returns false (i.e. if ki_callback is set).


Because right now an IORING_OP_READV on a blocking pipe *blocks*
io_uring_enter, and on a non-blocking pipe completes with EAGAIN all the
time.

So io_uring (readv) doesn't even work on a pipe!  (At least
IORING_OP_POLL_ADD is working...)

As another side note: timerfd doesn't have read_iter, so needs
IORING_OP_POLL_ADD too... :(

(Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
always removed in the workqueue context, and I don't see an early EAGAIN
completion).


3. io_file_supports_async should check for FMODE_NOWAIT instead of using
some hard-coded magic checks.


4. io_prep_rw shouldn't disable force_nonblock if FMODE_NOWAIT isn't
available; it should return EAGAIN instead and let the workqueue handle it.

Because otherwise the event loop is blocking again.

---

I'm guessing especially 2. has something to do with why aio never took
off - so maybe it's time to fix the underlying issues first.

I'd be happy to contribute a few patches to those issues if there is an
agreement what the result should look like :)


I have one other question: is there any way to cancel an IO read/write
operation? I don't think closing io_uring has any effect, what about
closing the files I'm reading/writing?  (Adding cancelation to kiocb
sounds like a non-trivial task; and I don't think it already supports it.)

So cleanup in general seems hard to me: do I have to wait for all
read/write operations to complete so I can safely free all buffers
before I close the event loop?


cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: not good enough for release
  2019-04-23 19:06 io_uring: not good enough for release Stefan Bühler
@ 2019-04-23 20:31 ` Jens Axboe
  2019-04-23 22:07   ` Jens Axboe
                     ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Jens Axboe @ 2019-04-23 20:31 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/23/19 1:06 PM, Stefan Bühler wrote:
> Hi,
> 
> now that I've got some of my rust code running with io_uring I don't
> think io_uring is ready.
> 
> If marking it as EXPERIMENTAL (and not "default y") is considered a
> clear flag for "API might still change" I'd recommend going for that.

That might be an option, but I don't think we need to do that. We've
still got a least a few weeks, and the only issue mentioned below that's
really a change that would warrant something like that is easily doable
now. All it needs is agreement.

> Here is my current issue list:
> 
> ---
> 
> 1. An error for a submission should be returned as completion for that
> submission.  Please don't break my main event loop with strange error
> codes just because a single operation is broken/not supported/...

So that's the case I was referring to above. We can just make that change,
there's absolutely no reason to have errors passed back through a different
channel.

> 2. {read,write}_iter and FMODE_NOWAIT / IOCB_NOWAIT is broken at the vfs
> layer: vfs_{read,write} should set IOCB_NOWAIT if O_NONBLOCK is set when
> they call {read,write}_iter (i.e. init_sync_kiocb/iocb_flags needs to
> convert the flag).
> 
> And all {read,write}_iter should check IOCB_NOWAIT instead of O_NONBLOCK
> (hi there pipe.c!), and set FMODE_NOWAIT if they support IOCB_NOWAIT.
> 
> {read,write}_iter should only queue the IOCB though if is_sync_kiocb()
> returns false (i.e. if ki_callback is set).

That's a trivial fix. I agree that it should be done.

> Because right now an IORING_OP_READV on a blocking pipe *blocks*
> io_uring_enter, and on a non-blocking pipe completes with EAGAIN all the
> time.
> 
> So io_uring (readv) doesn't even work on a pipe!  (At least
> IORING_OP_POLL_ADD is working...)

It works, but it blocks. That can be argued as broken, and I agree that
it is, but it's important to make the distinction!

> As another side note: timerfd doesn't have read_iter, so needs
> IORING_OP_POLL_ADD too... :(
> 
> (Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
> always removed in the workqueue context, and I don't see an early EAGAIN
> completion).

That's a case I didn't consider, that you'd want to see EAGAIN after
it's been punted. Once punted, we're not going to return EAGAIN since
we can now block. Not sure how you'd want to handle that any better...

> 3. io_file_supports_async should check for FMODE_NOWAIT instead of using
> some hard-coded magic checks.

We probably just need to err on the side of caution there, and suffer
the extra async punts.

> 4. io_prep_rw shouldn't disable force_nonblock if FMODE_NOWAIT isn't
> available; it should return EAGAIN instead and let the workqueue handle it.

Agree

> I'm guessing especially 2. has something to do with why aio never took
> off - so maybe it's time to fix the underlying issues first.

It only really works for a subset of it, but we should ensure that it's
caught and always punted so we don't end up with io_uring_enter() blocking.
That should be the key goal. For regular file writes, should be easy
enough to do. But it should end up being an optimization to what we have,
getting rid of an unecessary async indirection, instead of having cases
where io_uring_enter() blocks.

> I'd be happy to contribute a few patches to those issues if there is an
> agreement what the result should look like :)

Pretty sure folks would be happy to see that :-)

> I have one other question: is there any way to cancel an IO read/write
> operation? I don't think closing io_uring has any effect, what about
> closing the files I'm reading/writing?  (Adding cancelation to kiocb
> sounds like a non-trivial task; and I don't think it already supports it.)

There is no way to do that. If you look at existing aio, nobody supports
that either. Hence io_uring doesn't export any sort of cancellation outside
of the poll case where we can handle it internally to io_uring.

If you look at storage, then generally IO doesn't wait around in the stack,
it's issued. Most hardware only supports queue abort like cancellation,
which isn't useful at all.

So I don't think that will ever happen.

> So cleanup in general seems hard to me: do I have to wait for all
> read/write operations to complete so I can safely free all buffers
> before I close the event loop?

The ring exit waits for IO to complete already.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: not good enough for release
  2019-04-23 20:31 ` Jens Axboe
@ 2019-04-23 22:07   ` Jens Axboe
  2019-04-24 16:09     ` Jens Axboe
  2019-04-27 15:50     ` io_uring: submission error handling Stefan Bühler
  2019-04-27 21:07   ` io_uring: closing / release Stefan Bühler
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Jens Axboe @ 2019-04-23 22:07 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/23/19 2:31 PM, Jens Axboe wrote:
>> 1. An error for a submission should be returned as completion for that
>> submission.  Please don't break my main event loop with strange error
>> codes just because a single operation is broken/not supported/...
> 
> So that's the case I was referring to above. We can just make that change,
> there's absolutely no reason to have errors passed back through a different
> channel.

Thinking about this a bit more, and I think the current approach is the
best one. The issue is that only submission side events tied to an sqe
can return an cqe, the rest have to be returned through the system call
value. So I think it's cleaner to keep it as-is, honestly.

>> (Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
>> always removed in the workqueue context, and I don't see an early EAGAIN
>> completion).
> 
> That's a case I didn't consider, that you'd want to see EAGAIN after
> it's been punted. Once punted, we're not going to return EAGAIN since
> we can now block. Not sure how you'd want to handle that any better...

I think I grok this one too now - what you're saying is that if the
caller has RWF_NOWAIT set, then the EAGAIN should be returned instead of
being punted to the workqueue? I totally agree with that, that's a bug.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: not good enough for release
  2019-04-23 22:07   ` Jens Axboe
@ 2019-04-24 16:09     ` Jens Axboe
  2019-04-27 16:05       ` io_uring: RWF_NOWAIT support Stefan Bühler
  2019-04-27 15:50     ` io_uring: submission error handling Stefan Bühler
  1 sibling, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2019-04-24 16:09 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/23/19 4:07 PM, Jens Axboe wrote:
>>> (Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
>>> always removed in the workqueue context, and I don't see an early EAGAIN
>>> completion).
>>
>> That's a case I didn't consider, that you'd want to see EAGAIN after
>> it's been punted. Once punted, we're not going to return EAGAIN since
>> we can now block. Not sure how you'd want to handle that any better...
> 
> I think I grok this one too now - what you're saying is that if the
> caller has RWF_NOWAIT set, then the EAGAIN should be returned instead of
> being punted to the workqueue? I totally agree with that, that's a bug.

This should do it for the EAGAIN part, if the user has set RWF_NOWAIT
in the sqe, then we don't do the automatic punt to workqueue. We just
return the EAGAIN instead.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 58ec6e449fd8..6c0d49c3736b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -822,7 +822,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
 	if (unlikely(ret))
 		return ret;
-	if (force_nonblock) {
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_NOWAIT)) {
 		kiocb->ki_flags |= IOCB_NOWAIT;
 		req->flags |= REQ_F_FORCE_NONBLOCK;
 	}
@@ -1828,7 +1828,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	}
 
 	ret = __io_submit_sqe(ctx, req, s, true);
-	if (ret == -EAGAIN) {
+	if (ret == -EAGAIN && (req->flags & REQ_F_FORCE_NONBLOCK)) {
 		struct io_uring_sqe *sqe_copy;
 
 		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: io_uring: submission error handling
  2019-04-23 22:07   ` Jens Axboe
  2019-04-24 16:09     ` Jens Axboe
@ 2019-04-27 15:50     ` Stefan Bühler
  2019-04-30 16:02       ` Jens Axboe
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-27 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 24.04.19 00:07, Jens Axboe wrote:
> On 4/23/19 2:31 PM, Jens Axboe wrote:
>>> 1. An error for a submission should be returned as completion for that
>>> submission.  Please don't break my main event loop with strange error
>>> codes just because a single operation is broken/not supported/...
>>
>> So that's the case I was referring to above. We can just make that change,
>> there's absolutely no reason to have errors passed back through a different
>> channel.
> 
> Thinking about this a bit more, and I think the current approach is the
> best one. The issue is that only submission side events tied to an sqe
> can return an cqe, the rest have to be returned through the system call
> value. So I think it's cleaner to keep it as-is, honestly.

Not sure we're talking about the same.

I'm talking about the errors returned by io_submit_sqe: io_submit_sqes
(called by the SQ thread) calls io_cqring_add_event if there was an
error, but io_ring_submit (called by io_uring_enter) doesn't: instead,
if there were successfully submitted entries before, it will just return
those (and "undo" the current SQE), otherwise it will return the error,
which will then be returned by io_uring_enter.

But if I get an error from io_uring_enter I have no idea whether it was
some generic error (say EINVAL for broken flags or EBADF for a
non-io-uring filedescriptor) or an error related to a single submission.

I think io_ring_submit should call io_cqring_add_event on errors too
(like io_submit_sqes), and not stop handling submissions (and never
return an error).

Maybe io_cqring_add_event could then even be moved to io_submit_sqe and
just return whether the job is already done or not (io_submit_sqes
returns the new "inflight" jobs, and io_ring_submit the total number of
submitted jobs).

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: RWF_NOWAIT support
  2019-04-24 16:09     ` Jens Axboe
@ 2019-04-27 16:05       ` Stefan Bühler
  2019-04-27 18:34         ` [PATCH v1 1/1] [io_uring] fix handling SQEs requesting NOWAIT Stefan Bühler
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-27 16:05 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 24.04.19 18:09, Jens Axboe wrote:
> On 4/23/19 4:07 PM, Jens Axboe wrote:
>>>> (Also RWF_NOWAIT doesn't work in io_uring right now: IOCB_NOWAIT is
>>>> always removed in the workqueue context, and I don't see an early EAGAIN
>>>> completion).
>>>
>>> That's a case I didn't consider, that you'd want to see EAGAIN after
>>> it's been punted. Once punted, we're not going to return EAGAIN since
>>> we can now block. Not sure how you'd want to handle that any better...
>>
>> I think I grok this one too now - what you're saying is that if the
>> caller has RWF_NOWAIT set, then the EAGAIN should be returned instead of
>> being punted to the workqueue? I totally agree with that, that's a bug.
> 
> This should do it for the EAGAIN part, if the user has set RWF_NOWAIT
> in the sqe, then we don't do the automatic punt to workqueue. We just
> return the EAGAIN instead.
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 58ec6e449fd8..6c0d49c3736b 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -822,7 +822,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
>  	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
>  	if (unlikely(ret))
>  		return ret;
> -	if (force_nonblock) {
> +	if (force_nonblock && !(kiocb->ki_flags & IOCB_NOWAIT)) {
>  		kiocb->ki_flags |= IOCB_NOWAIT;
>  		req->flags |= REQ_F_FORCE_NONBLOCK;
>  	}
> @@ -1828,7 +1828,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
>  	}
>  
>  	ret = __io_submit_sqe(ctx, req, s, true);
> -	if (ret == -EAGAIN) {
> +	if (ret == -EAGAIN && (req->flags & REQ_F_FORCE_NONBLOCK)) {
>  		struct io_uring_sqe *sqe_copy;
>  
>  		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
> 

I think this breaks other request types.

Only io_prep_rw ever sets REQ_F_FORCE_NONBLOCK, but e.g. IORING_OP_FSYNC
always wants to punt.

Given that REQ_F_FORCE_NONBLOCK wasn't actually used before (never read
afaict), maybe remove it and insert a new flag REQ_F_NOWAIT that will
prevent the punt (inverted semantics!), e.g:

> -	if (ret == -EAGAIN) {
> +	if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {

And set this flag in io_prep_rw if RWF_NOWAIT is set.

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v1 1/1] [io_uring] fix handling SQEs requesting NOWAIT
  2019-04-27 16:05       ` io_uring: RWF_NOWAIT support Stefan Bühler
@ 2019-04-27 18:34         ` Stefan Bühler
  2019-04-30 15:40           ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-27 18:34 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Not all request types set REQ_F_FORCE_NONBLOCK when they needed async
punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't
be punted.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 25632e399a78..77b247b5d10b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -322,7 +322,7 @@ struct io_kiocb {
 	struct list_head	list;
 	unsigned int		flags;
 	refcount_t		refs;
-#define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+#define REQ_F_NOWAIT		1	/* must not punt to workers */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
 #define REQ_F_SEQ_PREV		8	/* sequential with previous */
@@ -872,11 +872,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
 	if (unlikely(ret))
 		return ret;
-	/* only force async punt if the sqe didn't ask for NOWAIT */
-	if (force_nonblock && !(kiocb->ki_flags & IOCB_NOWAIT)) {
+
+	/* don't allow async punt if RWF_NOWAIT was requested */
+	if (kiocb->ki_flags & IOCB_NOWAIT)
+		req->flags |= REQ_F_NOWAIT;
+
+	if (force_nonblock)
 		kiocb->ki_flags |= IOCB_NOWAIT;
-		req->flags |= REQ_F_FORCE_NONBLOCK;
-	}
+
 	if (ctx->flags & IORING_SETUP_IOPOLL) {
 		if (!(kiocb->ki_flags & IOCB_DIRECT) ||
 		    !kiocb->ki_filp->f_op->iopoll)
@@ -1535,8 +1538,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		struct sqe_submit *s = &req->submit;
 		const struct io_uring_sqe *sqe = s->sqe;
 
-		/* Ensure we clear previously set forced non-block flag */
-		req->flags &= ~REQ_F_FORCE_NONBLOCK;
+		/* Ensure we clear previously set non-block flag */
 		req->rw.ki_flags &= ~IOCB_NOWAIT;
 
 		ret = 0;
@@ -1722,7 +1724,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 		goto out;
 
 	ret = __io_submit_sqe(ctx, req, s, true);
-	if (ret == -EAGAIN && (req->flags & REQ_F_FORCE_NONBLOCK)) {
+	if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
 		struct io_uring_sqe *sqe_copy;
 
 		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: io_uring: closing / release
  2019-04-23 20:31 ` Jens Axboe
  2019-04-23 22:07   ` Jens Axboe
@ 2019-04-27 21:07   ` Stefan Bühler
  2019-05-11 16:26     ` Stefan Bühler
  2019-04-28 15:54   ` io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess Stefan Bühler
  2019-05-03  9:47   ` [PATCH 1/2] io_uring: restructure io_{read,write} control flow Stefan Bühler
  3 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-27 21:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 23.04.19 22:31, Jens Axboe wrote:
> On 4/23/19 1:06 PM, Stefan Bühler wrote:
>> I have one other question: is there any way to cancel an IO read/write
>> operation? I don't think closing io_uring has any effect, what about
>> closing the files I'm reading/writing?  (Adding cancelation to kiocb
>> sounds like a non-trivial task; and I don't think it already supports it.)
> 
> There is no way to do that. If you look at existing aio, nobody supports
> that either. Hence io_uring doesn't export any sort of cancellation outside
> of the poll case where we can handle it internally to io_uring.
> 
> If you look at storage, then generally IO doesn't wait around in the stack,
> it's issued. Most hardware only supports queue abort like cancellation,
> which isn't useful at all.
> 
> So I don't think that will ever happen.
> 
>> So cleanup in general seems hard to me: do I have to wait for all
>> read/write operations to complete so I can safely free all buffers
>> before I close the event loop?
> 
> The ring exit waits for IO to complete already.

I now understand at least how that part is working;
io_ring_ctx_wait_and_kill calls wait_for_completion(&ctx->ctx_done),
which only completes after all references are gone; each pending job
keeps a reference.

But wait_for_completion is not interruptible; so if there are "stuck"
jobs even root can't kill the task (afaict) anymore.

Once e.g. readv is working on pipes/sockets (I have some local patches
here for that), you can easily end up in a situation where a
socketpair() or a pipe() is still alive, but the read will never finish
(I can trigger this problem with an explicit close(uring_fd), not sure
how to observe this on process exit).

For a socketpair() even both ends could be kept alive by never ending
read jobs.

Using wait_for_completion seems like a really bad idea to me; this is a
problem for io_uring_register too.

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess
  2019-04-23 20:31 ` Jens Axboe
  2019-04-23 22:07   ` Jens Axboe
  2019-04-27 21:07   ` io_uring: closing / release Stefan Bühler
@ 2019-04-28 15:54   ` Stefan Bühler
  2019-05-11 16:34     ` Stefan Bühler
  2019-05-03  9:47   ` [PATCH 1/2] io_uring: restructure io_{read,write} control flow Stefan Bühler
  3 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-28 15:54 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Hi,

On 23.04.19 22:31, Jens Axboe wrote:
> On 4/23/19 1:06 PM, Stefan Bühler wrote:
>> 2. {read,write}_iter and FMODE_NOWAIT / IOCB_NOWAIT is broken at the vfs
>> layer: vfs_{read,write} should set IOCB_NOWAIT if O_NONBLOCK is set when
>> they call {read,write}_iter (i.e. init_sync_kiocb/iocb_flags needs to
>> convert the flag).
>>
>> And all {read,write}_iter should check IOCB_NOWAIT instead of O_NONBLOCK
>> (hi there pipe.c!), and set FMODE_NOWAIT if they support IOCB_NOWAIT.
>>
>> {read,write}_iter should only queue the IOCB though if is_sync_kiocb()
>> returns false (i.e. if ki_callback is set).
> 
> That's a trivial fix. I agree that it should be done.

Doesn't look trivial to me.

Various functions take rwf_t flags, e.g. do_readv, which is called with
0 from readv and with flags from userspace in preadv2.

Now is preadv2() supposed to be non-blocking if the file has O_NONBLOCK,
or only if RWF_NOWAIT was passed?

Other places seem (at least to me) explicitly mean "please block" if
they don't pass RWF_NOWAIT, e.g. ovl_read_iter from fs/overlayfs, which
uses ovl_iocb_to_rwf to convert iocb flags back to rwf.

Imho the clean way is to ignore O_NONBLOCK when there are rwf_t flags;
e.g. kiocb_set_rw_flags should unset IOCB_NOWAIT if RWF_NOWAIT was not set.

But then various functions (like readv) will need to create rwf_t
"default" flags from a file (similar to iocb_flags) instead of using 0.
And ovl_iocb_to_rwf should probably be used in more places as well.

There is also generic_file_splice_read, which should use
SPLICE_F_NONBLOCK to trigger IOCB_NOWAIT; again it is unclear whether
O_NONBLOCK should trigger IOCB_NOWAIT too (do_sendfile explicitly does
NOT with a "need to debate" comment).


I don't think I'm the right person to do this - I think it requires a
deeper understanding of all the code involved.

I do have patches for pipe.c and and socket.c to ignore O_NONBLOCK, use
IOCB_NOWAIT and set FMODE_NOAWAIT after the fs part is ready.

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/1] [io_uring] fix handling SQEs requesting NOWAIT
  2019-04-27 18:34         ` [PATCH v1 1/1] [io_uring] fix handling SQEs requesting NOWAIT Stefan Bühler
@ 2019-04-30 15:40           ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2019-04-30 15:40 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/27/19 12:34 PM, Stefan Bühler wrote:
> Not all request types set REQ_F_FORCE_NONBLOCK when they needed async
> punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't
> be punted.

I like doing it this way, so we don't have to touch the other callers.
I've merged this one with my patch, no need to have two separate fixes
for it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: submission error handling
  2019-04-27 15:50     ` io_uring: submission error handling Stefan Bühler
@ 2019-04-30 16:02       ` Jens Axboe
  2019-04-30 16:15         ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2019-04-30 16:02 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/27/19 9:50 AM, Stefan Bühler wrote:
> Hi,
> 
> On 24.04.19 00:07, Jens Axboe wrote:
>> On 4/23/19 2:31 PM, Jens Axboe wrote:
>>>> 1. An error for a submission should be returned as completion for that
>>>> submission.  Please don't break my main event loop with strange error
>>>> codes just because a single operation is broken/not supported/...
>>>
>>> So that's the case I was referring to above. We can just make that change,
>>> there's absolutely no reason to have errors passed back through a different
>>> channel.
>>
>> Thinking about this a bit more, and I think the current approach is the
>> best one. The issue is that only submission side events tied to an sqe
>> can return an cqe, the rest have to be returned through the system call
>> value. So I think it's cleaner to keep it as-is, honestly.
> 
> Not sure we're talking about the same.
> 
> I'm talking about the errors returned by io_submit_sqe: io_submit_sqes
> (called by the SQ thread) calls io_cqring_add_event if there was an
> error, but io_ring_submit (called by io_uring_enter) doesn't: instead,
> if there were successfully submitted entries before, it will just return
> those (and "undo" the current SQE), otherwise it will return the error,
> which will then be returned by io_uring_enter.
> 
> But if I get an error from io_uring_enter I have no idea whether it was
> some generic error (say EINVAL for broken flags or EBADF for a
> non-io-uring filedescriptor) or an error related to a single submission.
> 
> I think io_ring_submit should call io_cqring_add_event on errors too
> (like io_submit_sqes), and not stop handling submissions (and never
> return an error).
> 
> Maybe io_cqring_add_event could then even be moved to io_submit_sqe and
> just return whether the job is already done or not (io_submit_sqes
> returns the new "inflight" jobs, and io_ring_submit the total number of
> submitted jobs).

I think we are talking about the same thing, actually. See below patch.
This changes it so that any error that occurs on behalf of a specific
sqe WILL trigger a completion event, instead of returning it through
io_uring_enter(). io_uring_enter() can still return -ERROR for errors
that aren't specific to an sqe.

I think this is what you had in mind?

Totally untested, will do so now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: submission error handling
  2019-04-30 16:02       ` Jens Axboe
@ 2019-04-30 16:15         ` Jens Axboe
  2019-04-30 18:15           ` Stefan Bühler
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2019-04-30 16:15 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/30/19 10:02 AM, Jens Axboe wrote:
> On 4/27/19 9:50 AM, Stefan Bühler wrote:
>> Hi,
>>
>> On 24.04.19 00:07, Jens Axboe wrote:
>>> On 4/23/19 2:31 PM, Jens Axboe wrote:
>>>>> 1. An error for a submission should be returned as completion for that
>>>>> submission.  Please don't break my main event loop with strange error
>>>>> codes just because a single operation is broken/not supported/...
>>>>
>>>> So that's the case I was referring to above. We can just make that change,
>>>> there's absolutely no reason to have errors passed back through a different
>>>> channel.
>>>
>>> Thinking about this a bit more, and I think the current approach is the
>>> best one. The issue is that only submission side events tied to an sqe
>>> can return an cqe, the rest have to be returned through the system call
>>> value. So I think it's cleaner to keep it as-is, honestly.
>>
>> Not sure we're talking about the same.
>>
>> I'm talking about the errors returned by io_submit_sqe: io_submit_sqes
>> (called by the SQ thread) calls io_cqring_add_event if there was an
>> error, but io_ring_submit (called by io_uring_enter) doesn't: instead,
>> if there were successfully submitted entries before, it will just return
>> those (and "undo" the current SQE), otherwise it will return the error,
>> which will then be returned by io_uring_enter.
>>
>> But if I get an error from io_uring_enter I have no idea whether it was
>> some generic error (say EINVAL for broken flags or EBADF for a
>> non-io-uring filedescriptor) or an error related to a single submission.
>>
>> I think io_ring_submit should call io_cqring_add_event on errors too
>> (like io_submit_sqes), and not stop handling submissions (and never
>> return an error).
>>
>> Maybe io_cqring_add_event could then even be moved to io_submit_sqe and
>> just return whether the job is already done or not (io_submit_sqes
>> returns the new "inflight" jobs, and io_ring_submit the total number of
>> submitted jobs).
> 
> I think we are talking about the same thing, actually. See below patch.
> This changes it so that any error that occurs on behalf of a specific
> sqe WILL trigger a completion event, instead of returning it through
> io_uring_enter(). io_uring_enter() can still return -ERROR for errors
> that aren't specific to an sqe.
> 
> I think this is what you had in mind?
> 
> Totally untested, will do so now.

Seems to work for me, just needed to adjust the -EAGAIN test case in
liburing.

I forgot to mention, but this will still stall the submission sequence.
Before, if you wanted to queue 8 sqes and we had an error on the 5th, we'd
return ret == 4 and the application would have to look at the sqring to
figure out what is wrong with the head entry. Now we'd return 5, and
have a cqe posted for the 5th entry that had an error. The app can then
decide if it needs to do anything about this. If it doesn't, it just
needs to call io_uring_enter() again to submit the remaining 3 entries.

I do like this change. Any error on an sqe will result in a cqe being
posted, instead of having the submission be slightly different and have
the cqe be dependent on where the error occurred.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: submission error handling
  2019-04-30 16:15         ` Jens Axboe
@ 2019-04-30 18:15           ` Stefan Bühler
  2019-04-30 18:42             ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-04-30 18:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 30.04.19 18:15, Jens Axboe wrote:
> On 4/30/19 10:02 AM, Jens Axboe wrote:
>> On 4/27/19 9:50 AM, Stefan Bühler wrote:
>>> Hi,
>>>
>>> On 24.04.19 00:07, Jens Axboe wrote:
>>>> On 4/23/19 2:31 PM, Jens Axboe wrote:
>>>>>> 1. An error for a submission should be returned as completion for that
>>>>>> submission.  Please don't break my main event loop with strange error
>>>>>> codes just because a single operation is broken/not supported/...
>>>>>
>>>>> So that's the case I was referring to above. We can just make that change,
>>>>> there's absolutely no reason to have errors passed back through a different
>>>>> channel.
>>>>
>>>> Thinking about this a bit more, and I think the current approach is the
>>>> best one. The issue is that only submission side events tied to an sqe
>>>> can return an cqe, the rest have to be returned through the system call
>>>> value. So I think it's cleaner to keep it as-is, honestly.
>>>
>>> Not sure we're talking about the same.
>>>
>>> I'm talking about the errors returned by io_submit_sqe: io_submit_sqes
>>> (called by the SQ thread) calls io_cqring_add_event if there was an
>>> error, but io_ring_submit (called by io_uring_enter) doesn't: instead,
>>> if there were successfully submitted entries before, it will just return
>>> those (and "undo" the current SQE), otherwise it will return the error,
>>> which will then be returned by io_uring_enter.
>>>
>>> But if I get an error from io_uring_enter I have no idea whether it was
>>> some generic error (say EINVAL for broken flags or EBADF for a
>>> non-io-uring filedescriptor) or an error related to a single submission.
>>>
>>> I think io_ring_submit should call io_cqring_add_event on errors too
>>> (like io_submit_sqes), and not stop handling submissions (and never
>>> return an error).
>>>
>>> Maybe io_cqring_add_event could then even be moved to io_submit_sqe and
>>> just return whether the job is already done or not (io_submit_sqes
>>> returns the new "inflight" jobs, and io_ring_submit the total number of
>>> submitted jobs).
>>
>> I think we are talking about the same thing, actually. See below patch.
>> This changes it so that any error that occurs on behalf of a specific
>> sqe WILL trigger a completion event, instead of returning it through
>> io_uring_enter(). io_uring_enter() can still return -ERROR for errors
>> that aren't specific to an sqe.
>>
>> I think this is what you had in mind?
>>
>> Totally untested, will do so now.
> 
> Seems to work for me, just needed to adjust the -EAGAIN test case in
> liburing.
> 
> I forgot to mention, but this will still stall the submission sequence.
> Before, if you wanted to queue 8 sqes and we had an error on the 5th, we'd
> return ret == 4 and the application would have to look at the sqring to
> figure out what is wrong with the head entry. Now we'd return 5, and
> have a cqe posted for the 5th entry that had an error. The app can then
> decide if it needs to do anything about this. If it doesn't, it just
> needs to call io_uring_enter() again to submit the remaining 3 entries.
> 
> I do like this change. Any error on an sqe will result in a cqe being
> posted, instead of having the submission be slightly different and have
> the cqe be dependent on where the error occurred.

I think you forgot to attach the patch - I can't find it :)

Without seeing the patch I'd like to point out the stalling is buggy
right now: if you stall you must not wait for an event (unless you
decrease min_complete by the number of events you didn't submit,
min(...) is wrong), as the event that possibly might wake up the loop
might not have been submitted in the first place.

The patch I was working on locally didn't stall so it also indirectly
fixed that bug; the challenge with stalling will be how to detect it:
"submitted < to_submit" does not indicate a stall if we accept that
applications are allowed to pass "to_submit" values larger than what is
actually in the queue.

While I agree that for special applications stalling on errors might be
an interesting feature, for more "general purpose" libraries stalling is
probably not useful (also note that the SQ thread won't stall either).

Maybe other solutions can be found: e.g. mark submissions as depending
on other submissions (including a barrier), and fail if the previous one
fails.

One last thing about stalling: the ring documentation so far describes
pending SQ elements as "owned by kernel"; allowing the application to
update it afterwards might make the documentation more complex (I don't
see a technical problem: without SQ thread the kernel never accesses the
submission queue outside io_uring_enter; but the application mustn't
modify SQ head).

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: submission error handling
  2019-04-30 18:15           ` Stefan Bühler
@ 2019-04-30 18:42             ` Jens Axboe
  2019-05-01 11:49               ` [PATCH v1 1/1] [io_uring] don't stall on submission errors Stefan Bühler
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2019-04-30 18:42 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/30/19 12:15 PM, Stefan Bühler wrote:
> Hi,
> 
> On 30.04.19 18:15, Jens Axboe wrote:
>> On 4/30/19 10:02 AM, Jens Axboe wrote:
>>> On 4/27/19 9:50 AM, Stefan Bühler wrote:
>>>> Hi,
>>>>
>>>> On 24.04.19 00:07, Jens Axboe wrote:
>>>>> On 4/23/19 2:31 PM, Jens Axboe wrote:
>>>>>>> 1. An error for a submission should be returned as completion for that
>>>>>>> submission.  Please don't break my main event loop with strange error
>>>>>>> codes just because a single operation is broken/not supported/...
>>>>>>
>>>>>> So that's the case I was referring to above. We can just make that change,
>>>>>> there's absolutely no reason to have errors passed back through a different
>>>>>> channel.
>>>>>
>>>>> Thinking about this a bit more, and I think the current approach is the
>>>>> best one. The issue is that only submission side events tied to an sqe
>>>>> can return an cqe, the rest have to be returned through the system call
>>>>> value. So I think it's cleaner to keep it as-is, honestly.
>>>>
>>>> Not sure we're talking about the same.
>>>>
>>>> I'm talking about the errors returned by io_submit_sqe: io_submit_sqes
>>>> (called by the SQ thread) calls io_cqring_add_event if there was an
>>>> error, but io_ring_submit (called by io_uring_enter) doesn't: instead,
>>>> if there were successfully submitted entries before, it will just return
>>>> those (and "undo" the current SQE), otherwise it will return the error,
>>>> which will then be returned by io_uring_enter.
>>>>
>>>> But if I get an error from io_uring_enter I have no idea whether it was
>>>> some generic error (say EINVAL for broken flags or EBADF for a
>>>> non-io-uring filedescriptor) or an error related to a single submission.
>>>>
>>>> I think io_ring_submit should call io_cqring_add_event on errors too
>>>> (like io_submit_sqes), and not stop handling submissions (and never
>>>> return an error).
>>>>
>>>> Maybe io_cqring_add_event could then even be moved to io_submit_sqe and
>>>> just return whether the job is already done or not (io_submit_sqes
>>>> returns the new "inflight" jobs, and io_ring_submit the total number of
>>>> submitted jobs).
>>>
>>> I think we are talking about the same thing, actually. See below patch.
>>> This changes it so that any error that occurs on behalf of a specific
>>> sqe WILL trigger a completion event, instead of returning it through
>>> io_uring_enter(). io_uring_enter() can still return -ERROR for errors
>>> that aren't specific to an sqe.
>>>
>>> I think this is what you had in mind?
>>>
>>> Totally untested, will do so now.
>>
>> Seems to work for me, just needed to adjust the -EAGAIN test case in
>> liburing.
>>
>> I forgot to mention, but this will still stall the submission sequence.
>> Before, if you wanted to queue 8 sqes and we had an error on the 5th, we'd
>> return ret == 4 and the application would have to look at the sqring to
>> figure out what is wrong with the head entry. Now we'd return 5, and
>> have a cqe posted for the 5th entry that had an error. The app can then
>> decide if it needs to do anything about this. If it doesn't, it just
>> needs to call io_uring_enter() again to submit the remaining 3 entries.
>>
>> I do like this change. Any error on an sqe will result in a cqe being
>> posted, instead of having the submission be slightly different and have
>> the cqe be dependent on where the error occurred.
> 
> I think you forgot to attach the patch - I can't find it :)

Oops, here it is...

> Without seeing the patch I'd like to point out the stalling is buggy
> right now: if you stall you must not wait for an event (unless you
> decrease min_complete by the number of events you didn't submit,
> min(...) is wrong), as the event that possibly might wake up the loop
> might not have been submitted in the first place.
> 
> The patch I was working on locally didn't stall so it also indirectly
> fixed that bug; the challenge with stalling will be how to detect it:
> "submitted < to_submit" does not indicate a stall if we accept that
> applications are allowed to pass "to_submit" values larger than what is
> actually in the queue.
> 
> While I agree that for special applications stalling on errors might be
> an interesting feature, for more "general purpose" libraries stalling is
> probably not useful (also note that the SQ thread won't stall either).

I think we can get away with not stalling now, with the intent to add
the barrier type feature/flag for the queue. Might be safer to just
never wait if we do end up posting an event and quitting out earlier.

> Maybe other solutions can be found: e.g. mark submissions as depending
> on other submissions (including a barrier), and fail if the previous one
> fails.
> 
> One last thing about stalling: the ring documentation so far describes
> pending SQ elements as "owned by kernel"; allowing the application to
> update it afterwards might make the documentation more complex (I don't
> see a technical problem: without SQ thread the kernel never accesses the
> submission queue outside io_uring_enter; but the application mustn't
> modify SQ head).

If we post a cqe for it on error, then the event is consumed and gone.
A new entry would have to get added instead, which is different than
allowing the application to fiddle with the stalled one and resubmit
again from there.


commit fd25de20d38dd07e2ce7108986f1fd08f9bf03d4
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Apr 30 10:16:07 2019 -0600

    io_uring: have submission side sqe errors post a cqe
    
    Currently we only post a cqe if we get an error OUTSIDE of submission.
    For submission, we return the error directly through io_uring_enter().
    This is a bit awkward for applications, and it makes more sense to
    always post a cqe with an error, if the error happens on behalf of an
    sqe.
    
    This changes submission behavior a bit. io_uring_enter() returns -ERROR
    for an error, and > 0 for number of sqes submitted. Before this change,
    if you wanted to submit 8 entries and had an error on the 5th entry,
    io_uring_enter() would return 4 (for number of entries successfully
    submitted) and rewind the sqring. The application would then have to
    peek at the sqring and figure out what was wrong with the head sqe, and
    then skip it itself. With this change, we'll return 5 since we did
    consume 5 sqes, and the last sqe (with the error) will result in a cqe
    being posted with the error.
    
    This makes the logic easier to handle in the application, and it cleans
    up the submission part.
    
    Suggested-by: Stefan Bühler <source@stbuehler.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b247b5d10b..376ae44f0748 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1801,14 +1801,6 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
 	}
 }
 
-/*
- * Undo last io_get_sqring()
- */
-static void io_drop_sqring(struct io_ring_ctx *ctx)
-{
-	ctx->cached_sq_head--;
-}
-
 /*
  * Fetch an sqe, if one is available. Note that s->sqe will point to memory
  * that is mapped by userspace. This means that care needs to be taken to
@@ -2018,7 +2010,7 @@ static int io_sq_thread(void *data)
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
-	int i, ret = 0, submit = 0;
+	int i, submit = 0;
 
 	if (to_submit > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, ctx, to_submit);
@@ -2027,6 +2019,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 
 	for (i = 0; i < to_submit; i++) {
 		struct sqe_submit s;
+		int ret;
 
 		if (!io_get_sqring(ctx, &s))
 			break;
@@ -2034,21 +2027,20 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		s.has_user = true;
 		s.needs_lock = false;
 		s.needs_fixed_file = false;
+		submit++;
 
 		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
-			io_drop_sqring(ctx);
+			io_cqring_add_event(ctx, s.sqe->user_data, ret, 0);
 			break;
 		}
-
-		submit++;
 	}
 	io_commit_sqring(ctx);
 
 	if (statep)
 		io_submit_state_end(statep);
 
-	return submit ? submit : ret;
+	return submit;
 }
 
 static unsigned io_cqring_events(struct io_cq_ring *ring)
@@ -2779,9 +2771,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		mutex_lock(&ctx->uring_lock);
 		submitted = io_ring_submit(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
-
-		if (submitted < 0)
-			goto out_ctx;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
 		unsigned nr_events = 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 1/1] [io_uring] don't stall on submission errors
  2019-04-30 18:42             ` Jens Axboe
@ 2019-05-01 11:49               ` Stefan Bühler
  2019-05-01 12:43                 ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-05-01 11:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Stalling is buggy right now, because it might wait for too many events;
for proper stalling we'd need to know how many submissions we ignored,
and reduce min_complete by that amount.

Easier not to stall at all.

Also fix some local variable names.

Should probably be merged with commit
fd25de20d38dd07e2ce7108986f1fd08f9bf03d4:
    io_uring: have submission side sqe errors post a cqe

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e614a675a1e0..17eae94a54fc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1846,7 +1846,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
 			  unsigned int nr, bool has_user, bool mm_fault)
 {
 	struct io_submit_state state, *statep = NULL;
-	int ret, i, submitted = 0;
+	int ret, i, inflight = 0;
 
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, ctx, nr);
@@ -1863,7 +1863,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
 			ret = io_submit_sqe(ctx, &sqes[i], statep);
 		}
 		if (!ret) {
-			submitted++;
+			inflight++;
 			continue;
 		}
 
@@ -1873,7 +1873,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
 	if (statep)
 		io_submit_state_end(&state);
 
-	return submitted;
+	return inflight;
 }
 
 static int io_sq_thread(void *data)
@@ -2011,7 +2011,7 @@ static int io_sq_thread(void *data)
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
-	int i, submit = 0;
+	int i, submitted = 0;
 
 	if (to_submit > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, ctx, to_submit);
@@ -2028,12 +2028,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		s.has_user = true;
 		s.needs_lock = false;
 		s.needs_fixed_file = false;
-		submit++;
+		submitted++;
 
 		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
 			io_cqring_add_event(ctx, s.sqe->user_data, ret, 0);
-			break;
 		}
 	}
 	io_commit_sqring(ctx);
@@ -2041,7 +2040,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	if (statep)
 		io_submit_state_end(statep);
 
-	return submit;
+	return submitted;
 }
 
 static unsigned io_cqring_events(struct io_cq_ring *ring)
@@ -2779,15 +2778,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 		min_complete = min(min_complete, ctx->cq_entries);
 
-		/*
-		 * The application could have included the 'to_submit' count
-		 * in how many events it wanted to wait for. If we failed to
-		 * submit the desired count, we may need to adjust the number
-		 * of events to poll/wait for.
-		 */
-		if (submitted < to_submit)
-			min_complete = min_t(unsigned, submitted, min_complete);
-
 		if (ctx->flags & IORING_SETUP_IOPOLL) {
 			mutex_lock(&ctx->uring_lock);
 			ret = io_iopoll_check(ctx, &nr_events, min_complete);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/1] [io_uring] don't stall on submission errors
  2019-05-01 11:49               ` [PATCH v1 1/1] [io_uring] don't stall on submission errors Stefan Bühler
@ 2019-05-01 12:43                 ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2019-05-01 12:43 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 5/1/19 5:49 AM, Stefan Bühler wrote:
> Stalling is buggy right now, because it might wait for too many events;
> for proper stalling we'd need to know how many submissions we ignored,
> and reduce min_complete by that amount.
> 
> Easier not to stall at all.
> 
> Also fix some local variable names.

I folded this in, but removed the renaming. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] io_uring: restructure io_{read,write} control flow
  2019-04-23 20:31 ` Jens Axboe
                     ` (2 preceding siblings ...)
  2019-04-28 15:54   ` io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess Stefan Bühler
@ 2019-05-03  9:47   ` Stefan Bühler
  2019-05-03  9:47     ` [PATCH 2/2] io_uring: punt to workers if file doesn't support async Stefan Bühler
  3 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-05-03  9:47 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Call io_async_list_note at the end if -EAGAIN is going to be returned;
we need iov_count for that, which we have (almost) at the same time as
we need to free iovec.

Instead of using a second return value reset the normal one after
passing it to io_rw_done.

Unless rw_verify_area returns -EAGAIN this shouldn't result in different
behavior.

This change should make it easier to punt a request to the workers by
returning -EAGAIN and still calling io_async_list_note if needed.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 89 ++++++++++++++++++++++-----------------------------
 1 file changed, 39 insertions(+), 50 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 84efb8956734..52e435a72b6f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1062,26 +1062,24 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
 	if (ret)
 		return ret;
-
 	iov_count = iov_iter_count(&iter);
+
 	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count);
-	if (!ret) {
-		ssize_t ret2;
+	if (ret)
+		goto out_free;
 
-		/* Catch -EAGAIN return for forced non-blocking submission */
-		ret2 = call_read_iter(file, kiocb, &iter);
-		if (!force_nonblock || ret2 != -EAGAIN) {
-			io_rw_done(kiocb, ret2);
-		} else {
-			/*
-			 * If ->needs_lock is true, we're already in async
-			 * context.
-			 */
-			if (!s->needs_lock)
-				io_async_list_note(READ, req, iov_count);
-			ret = -EAGAIN;
-		}
+	/* Passthrough -EAGAIN return for forced non-blocking submission */
+	ret = call_read_iter(file, kiocb, &iter);
+	if (!(force_nonblock && ret == -EAGAIN)) {
+		io_rw_done(kiocb, ret);
+		ret = 0;
 	}
+
+out_free:
+	/* If ->needs_lock is true, we're already in async context. */
+	if (ret == -EAGAIN && !s->needs_lock)
+		io_async_list_note(READ, req, iov_count);
+
 	kfree(iovec);
 	return ret;
 }
@@ -1109,50 +1107,41 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
 	if (ret)
 		return ret;
-
 	iov_count = iov_iter_count(&iter);
 
 	ret = -EAGAIN;
-	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) {
-		/* If ->needs_lock is true, we're already in async context. */
-		if (!s->needs_lock)
-			io_async_list_note(WRITE, req, iov_count);
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
 		goto out_free;
-	}
 
 	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count);
-	if (!ret) {
-		ssize_t ret2;
+	if (ret)
+		goto out_free;
 
-		/*
-		 * Open-code file_start_write here to grab freeze protection,
-		 * which will be released by another thread in
-		 * io_complete_rw().  Fool lockdep by telling it the lock got
-		 * released so that it doesn't complain about the held lock when
-		 * we return to userspace.
-		 */
-		if (S_ISREG(file_inode(file)->i_mode)) {
-			__sb_start_write(file_inode(file)->i_sb,
-						SB_FREEZE_WRITE, true);
-			__sb_writers_release(file_inode(file)->i_sb,
-						SB_FREEZE_WRITE);
-		}
-		kiocb->ki_flags |= IOCB_WRITE;
+	/*
+	 * Open-code file_start_write here to grab freeze protection,
+	 * which will be released by another thread in
+	 * io_complete_rw().  Fool lockdep by telling it the lock got
+	 * released so that it doesn't complain about the held lock when
+	 * we return to userspace.
+	 */
+	if (S_ISREG(file_inode(file)->i_mode)) {
+		__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
+		__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+	}
+	kiocb->ki_flags |= IOCB_WRITE;
 
-		ret2 = call_write_iter(file, kiocb, &iter);
-		if (!force_nonblock || ret2 != -EAGAIN) {
-			io_rw_done(kiocb, ret2);
-		} else {
-			/*
-			 * If ->needs_lock is true, we're already in async
-			 * context.
-			 */
-			if (!s->needs_lock)
-				io_async_list_note(WRITE, req, iov_count);
-			ret = -EAGAIN;
-		}
+	/* Passthrough -EAGAIN return for forced non-blocking submission */
+	ret = call_write_iter(file, kiocb, &iter);
+	if (!(force_nonblock && ret == -EAGAIN)) {
+		io_rw_done(kiocb, ret);
+		ret = 0;
 	}
+
 out_free:
+	/* If ->needs_lock is true, we're already in async context. */
+	if (ret == -EAGAIN && !s->needs_lock)
+		io_async_list_note(WRITE, req, iov_count);
+
 	kfree(iovec);
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] io_uring: punt to workers if file doesn't support async
  2019-05-03  9:47   ` [PATCH 1/2] io_uring: restructure io_{read,write} control flow Stefan Bühler
@ 2019-05-03  9:47     ` Stefan Bühler
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-03  9:47 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

If force_nonblock is set we must not try io operations that don't
support async (i.e. are missing the IOCB_NOWAIT flag).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 52e435a72b6f..1be7fdb65c3f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1064,6 +1064,11 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
 		return ret;
 	iov_count = iov_iter_count(&iter);
 
+	/* async punt if file doesn't support non-blocking */
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_NOWAIT))
+		goto out_free;
+
 	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count);
 	if (ret)
 		goto out_free;
@@ -1109,6 +1114,11 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
 		return ret;
 	iov_count = iov_iter_count(&iter);
 
+	/* async punt if file doesn't support non-blocking */
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_NOWAIT))
+		goto out_free;
+
 	ret = -EAGAIN;
 	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
 		goto out_free;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: io_uring: closing / release
  2019-04-27 21:07   ` io_uring: closing / release Stefan Bühler
@ 2019-05-11 16:26     ` Stefan Bühler
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:26 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 27.04.19 23:07, Stefan Bühler wrote:
> Hi,
> 
> On 23.04.19 22:31, Jens Axboe wrote:
>> On 4/23/19 1:06 PM, Stefan Bühler wrote:
>>> I have one other question: is there any way to cancel an IO read/write
>>> operation? I don't think closing io_uring has any effect, what about
>>> closing the files I'm reading/writing?  (Adding cancelation to kiocb
>>> sounds like a non-trivial task; and I don't think it already supports it.)
>>
>> There is no way to do that. If you look at existing aio, nobody supports
>> that either. Hence io_uring doesn't export any sort of cancellation outside
>> of the poll case where we can handle it internally to io_uring.
>>
>> If you look at storage, then generally IO doesn't wait around in the stack,
>> it's issued. Most hardware only supports queue abort like cancellation,
>> which isn't useful at all.
>>
>> So I don't think that will ever happen.
>>
>>> So cleanup in general seems hard to me: do I have to wait for all
>>> read/write operations to complete so I can safely free all buffers
>>> before I close the event loop?
>>
>> The ring exit waits for IO to complete already.
> 
> I now understand at least how that part is working;
> io_ring_ctx_wait_and_kill calls wait_for_completion(&ctx->ctx_done),
> which only completes after all references are gone; each pending job
> keeps a reference.
> 
> But wait_for_completion is not interruptible; so if there are "stuck"
> jobs even root can't kill the task (afaict) anymore.
> 
> Once e.g. readv is working on pipes/sockets (I have some local patches
> here for that), you can easily end up in a situation where a
> socketpair() or a pipe() is still alive, but the read will never finish
> (I can trigger this problem with an explicit close(uring_fd), not sure
> how to observe this on process exit).
> 
> For a socketpair() even both ends could be kept alive by never ending
> read jobs.
> 
> Using wait_for_completion seems like a really bad idea to me; this is a
> problem for io_uring_register too.

As far as I know this is not a problem yet in 5.1 as reads on pipes and
sockets are still blocking the submission, and SQPOLL is only for
CAP_SYS_ADMIN.

But once my "punt to workers if file doesn't support async" patch (or
something similar) goes in, this will become a real problem.

My current trigger looks like this: create socketpair(), submit read
from both ends (using the same iovec... who cares), wait for the workers
to pick up the reads, munmap everything and exit.

The kernel will then cleanup the files: but the sockets are still in use
by io_uring, and it will only close the io_uring context, which will
then get stuck in io_ring_ctx_wait_and_kill.

In my qemu test environment "nobody" can leak 16 contexts before hitting
some resource limits.

cheers,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess
  2019-04-28 15:54   ` io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess Stefan Bühler
@ 2019-05-11 16:34     ` Stefan Bühler
  2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:34 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Hi,

I am a little bit disappointed that this seems to get ignored.

I will reply to this mail with some patches, but I really think someone
with a deeper understanding needs to think this through.

If there were some clear documentation/specification how those flags are
supposed to work (especially in combination) it would be much easier to
tell what is broken and how to fix it...

cheers,
Stefan

On 28.04.19 17:54, Stefan Bühler wrote:
> Hi,
> 
> On 23.04.19 22:31, Jens Axboe wrote:
>> On 4/23/19 1:06 PM, Stefan Bühler wrote:
>>> 2. {read,write}_iter and FMODE_NOWAIT / IOCB_NOWAIT is broken at the vfs
>>> layer: vfs_{read,write} should set IOCB_NOWAIT if O_NONBLOCK is set when
>>> they call {read,write}_iter (i.e. init_sync_kiocb/iocb_flags needs to
>>> convert the flag).
>>>
>>> And all {read,write}_iter should check IOCB_NOWAIT instead of O_NONBLOCK
>>> (hi there pipe.c!), and set FMODE_NOWAIT if they support IOCB_NOWAIT.
>>>
>>> {read,write}_iter should only queue the IOCB though if is_sync_kiocb()
>>> returns false (i.e. if ki_callback is set).
>>
>> That's a trivial fix. I agree that it should be done.
> 
> Doesn't look trivial to me.
> 
> Various functions take rwf_t flags, e.g. do_readv, which is called with
> 0 from readv and with flags from userspace in preadv2.
> 
> Now is preadv2() supposed to be non-blocking if the file has O_NONBLOCK,
> or only if RWF_NOWAIT was passed?
> 
> Other places seem (at least to me) explicitly mean "please block" if
> they don't pass RWF_NOWAIT, e.g. ovl_read_iter from fs/overlayfs, which
> uses ovl_iocb_to_rwf to convert iocb flags back to rwf.
> 
> Imho the clean way is to ignore O_NONBLOCK when there are rwf_t flags;
> e.g. kiocb_set_rw_flags should unset IOCB_NOWAIT if RWF_NOWAIT was not set.
> 
> But then various functions (like readv) will need to create rwf_t
> "default" flags from a file (similar to iocb_flags) instead of using 0.
> And ovl_iocb_to_rwf should probably be used in more places as well.
> 
> There is also generic_file_splice_read, which should use
> SPLICE_F_NONBLOCK to trigger IOCB_NOWAIT; again it is unclear whether
> O_NONBLOCK should trigger IOCB_NOWAIT too (do_sendfile explicitly does
> NOT with a "need to debate" comment).
> 
> 
> I don't think I'm the right person to do this - I think it requires a
> deeper understanding of all the code involved.
> 
> I do have patches for pipe.c and and socket.c to ignore O_NONBLOCK, use
> IOCB_NOWAIT and set FMODE_NOAWAIT after the fs part is ready.
> 
> cheers,
> Stefan
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags
  2019-05-11 16:34     ` Stefan Bühler
@ 2019-05-11 16:57       ` Stefan Bühler
  2019-05-11 16:57         ` [PATCH 2/5] tcp: handle SPLICE_F_NONBLOCK in tcp_splice_read Stefan Bühler
                           ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:57 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Mapping between IOCB, RWF and file flags:

IOCB_APPEND | RWF_APPEND | O_APPEND
IOCB_HIPRI  | RWF_HIPRI  | -
IOCB_DSYNC  | RWF_DSYNC  | O_DSYNC || IS_SYNC(f_mapping->host)
IOCB_SYNC   | RWF_SYNC   | __O_SYNC
IOCB_NOWAIT | RWF_NOWAIT | O_NONBLOCK && (f_mode & FMODE_NOWAIT)
IOCB_DIRECT | -          | io_is_direct()

Most internal kernel functions taking rwf_t flags support taking
_RWF_DEFAULT instead of individual flags to signal "use default flags
from file flags"; APIs exposed to userspace should return EOPNOTSUPP
when they are passed _RWF_DEFAULT.

Also convert O_NONBLOCK file flag to IOCB_NOWAIT if supported by the
file (i.e. FMODE_NOWAIT is set), so read_iter/write_iter implementations
can consistently check for IOCB_NOWAIT instead of O_NONBLOCK.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 drivers/block/loop.c              |  8 +++--
 drivers/staging/android/ashmem.c  |  2 +-
 drivers/target/target_core_file.c |  6 ++--
 fs/coda/file.c                    |  6 ++--
 fs/nfsd/vfs.c                     |  4 +--
 fs/overlayfs/file.c               | 21 ++-----------
 fs/read_write.c                   | 40 ++++++++++++++++++------
 fs/splice.c                       | 11 +++++--
 include/linux/fs.h                | 51 +++++++++++++++++++++++++++++++
 9 files changed, 107 insertions(+), 42 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 102d79575895..cb06eef7d0d2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -280,7 +280,7 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
 	loop_iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len);
 
 	file_start_write(file);
-	bw = vfs_iter_write(file, &i, ppos, 0);
+	bw = vfs_iter_write(file, &i, ppos, _RWF_DEFAULT);
 	file_end_write(file);
 
 	if (likely(bw ==  bvec->bv_len))
@@ -356,7 +356,8 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
 
 	rq_for_each_segment(bvec, rq, iter) {
 		loop_iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len);
-		len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0);
+		len = vfs_iter_read(lo->lo_backing_file, &i, &pos,
+				    _RWF_DEFAULT);
 		if (len < 0)
 			return len;
 
@@ -397,7 +398,8 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
 		b.bv_len = bvec.bv_len;
 
 		loop_iov_iter_bvec(&i, READ, &b, 1, b.bv_len);
-		len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0);
+		len = vfs_iter_read(lo->lo_backing_file, &i, &pos,
+				    _RWF_DEFAULT);
 		if (len < 0) {
 			ret = len;
 			goto out_free_page;
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 74d497d39c5a..813353d74eaf 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -307,7 +307,7 @@ static ssize_t ashmem_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	 * ashmem_release is called.
 	 */
 	mutex_unlock(&ashmem_mutex);
-	ret = vfs_iter_read(asma->file, iter, &iocb->ki_pos, 0);
+	ret = vfs_iter_read(asma->file, iter, &iocb->ki_pos, _RWF_DEFAULT);
 	mutex_lock(&ashmem_mutex);
 	if (ret > 0)
 		asma->file->f_pos = iocb->ki_pos;
diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 49b110d1b972..8ebfbf6bc56d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -355,9 +355,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct file *fd,
 
 	iov_iter_bvec(&iter, READ, bvec, sgl_nents, len);
 	if (is_write)
-		ret = vfs_iter_write(fd, &iter, &pos, 0);
+		ret = vfs_iter_write(fd, &iter, &pos, _RWF_DEFAULT);
 	else
-		ret = vfs_iter_read(fd, &iter, &pos, 0);
+		ret = vfs_iter_read(fd, &iter, &pos, _RWF_DEFAULT);
 
 	if (is_write) {
 		if (ret < 0 || ret != data_length) {
@@ -491,7 +491,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 	}
 
 	iov_iter_bvec(&iter, READ, bvec, nolb, len);
-	ret = vfs_iter_write(fd_dev->fd_file, &iter, &pos, 0);
+	ret = vfs_iter_write(fd_dev->fd_file, &iter, &pos, _RWF_DEFAULT);
 
 	kfree(bvec);
 	if (ret < 0 || ret != len) {
diff --git a/fs/coda/file.c b/fs/coda/file.c
index 1cbc1f2298ee..65d1cc7017e4 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -35,7 +35,8 @@ coda_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
 
-	return vfs_iter_read(cfi->cfi_container, to, &iocb->ki_pos, 0);
+	return vfs_iter_read(cfi->cfi_container, to, &iocb->ki_pos,
+			     _RWF_DEFAULT);
 }
 
 static ssize_t
@@ -52,7 +53,8 @@ coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 	host_file = cfi->cfi_container;
 	file_start_write(host_file);
 	inode_lock(coda_inode);
-	ret = vfs_iter_write(cfi->cfi_container, to, &iocb->ki_pos, 0);
+	ret = vfs_iter_write(cfi->cfi_container, to, &iocb->ki_pos,
+			     _RWF_DEFAULT);
 	coda_inode->i_size = file_inode(host_file)->i_size;
 	coda_inode->i_blocks = (coda_inode->i_size + 511) >> 9;
 	coda_inode->i_mtime = coda_inode->i_ctime = current_time(coda_inode);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 7dc98e14655d..2d563d916b5c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -943,7 +943,7 @@ __be32 nfsd_readv(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
 	iov_iter_kvec(&iter, READ, vec, vlen, *count);
-	host_err = vfs_iter_read(file, &iter, &offset, 0);
+	host_err = vfs_iter_read(file, &iter, &offset, _RWF_DEFAULT);
 	return nfsd_finish_read(rqstp, fhp, file, offset, count, host_err);
 }
 
@@ -996,7 +996,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 	int			use_wgather;
 	loff_t			pos = offset;
 	unsigned int		pflags = current->flags;
-	rwf_t			flags = 0;
+	rwf_t			flags = rwf_flags(file);
 
 	trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 84dd957efa24..8567f2cc189f 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -174,23 +174,6 @@ static void ovl_file_accessed(struct file *file)
 	touch_atime(&file->f_path);
 }
 
-static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
-{
-	int ifl = iocb->ki_flags;
-	rwf_t flags = 0;
-
-	if (ifl & IOCB_NOWAIT)
-		flags |= RWF_NOWAIT;
-	if (ifl & IOCB_HIPRI)
-		flags |= RWF_HIPRI;
-	if (ifl & IOCB_DSYNC)
-		flags |= RWF_DSYNC;
-	if (ifl & IOCB_SYNC)
-		flags |= RWF_SYNC;
-
-	return flags;
-}
-
 static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
@@ -207,7 +190,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
-			    ovl_iocb_to_rwf(iocb));
+			    rwf_from_iocb_flags(iocb));
 	revert_creds(old_cred);
 
 	ovl_file_accessed(file);
@@ -242,7 +225,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	file_start_write(real.file);
 	ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
-			     ovl_iocb_to_rwf(iocb));
+			     rwf_from_iocb_flags(iocb));
 	file_end_write(real.file);
 	revert_creds(old_cred);
 
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..1bb37364cb44 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -670,6 +670,8 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
+	if (flags == _RWF_DEFAULT)
+		flags = rwf_flags(filp);
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
@@ -690,7 +692,7 @@ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 
-	if (flags & ~RWF_HIPRI)
+	if ((flags != _RWF_DEFAULT) && (flags & ~RWF_HIPRI))
 		return -EOPNOTSUPP;
 
 	while (iov_iter_count(iter)) {
@@ -1101,13 +1103,13 @@ static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
 SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	return do_readv(fd, vec, vlen, 0);
+	return do_readv(fd, vec, vlen, _RWF_DEFAULT);
 }
 
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	return do_writev(fd, vec, vlen, 0);
+	return do_writev(fd, vec, vlen, _RWF_DEFAULT);
 }
 
 SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
@@ -1115,7 +1117,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 
-	return do_preadv(fd, vec, vlen, pos, 0);
+	return do_preadv(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 
 SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
@@ -1124,6 +1126,9 @@ SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_readv(fd, vec, vlen, flags);
 
@@ -1135,7 +1140,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 
-	return do_pwritev(fd, vec, vlen, pos, 0);
+	return do_pwritev(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 
 SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
@@ -1144,6 +1149,9 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_writev(fd, vec, vlen, flags);
 
@@ -1221,7 +1229,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return do_compat_preadv64(fd, vec, vlen, pos, 0);
+	return do_compat_preadv64(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 #endif
 
@@ -1231,7 +1239,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return do_compat_preadv64(fd, vec, vlen, pos, 0);
+	return do_compat_preadv64(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 
 #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64V2
@@ -1239,6 +1247,9 @@ COMPAT_SYSCALL_DEFINE5(preadv64v2, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos, rwf_t, flags)
 {
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_compat_readv(fd, vec, vlen, flags);
 
@@ -1253,6 +1264,9 @@ COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_compat_readv(fd, vec, vlen, flags);
 
@@ -1303,7 +1317,7 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 		const struct compat_iovec __user *, vec,
 		compat_ulong_t, vlen)
 {
-	return do_compat_writev(fd, vec, vlen, 0);
+	return do_compat_writev(fd, vec, vlen, _RWF_DEFAULT);
 }
 
 static long do_compat_pwritev64(unsigned long fd,
@@ -1330,7 +1344,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return do_compat_pwritev64(fd, vec, vlen, pos, 0);
+	return do_compat_pwritev64(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 #endif
 
@@ -1340,7 +1354,7 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return do_compat_pwritev64(fd, vec, vlen, pos, 0);
+	return do_compat_pwritev64(fd, vec, vlen, pos, _RWF_DEFAULT);
 }
 
 #ifdef __ARCH_WANT_COMPAT_SYS_PWRITEV64V2
@@ -1348,6 +1362,9 @@ COMPAT_SYSCALL_DEFINE5(pwritev64v2, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos, rwf_t, flags)
 {
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_compat_writev(fd, vec, vlen, flags);
 
@@ -1361,6 +1378,9 @@ COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
+	if (flags == _RWF_DEFAULT)
+		return -EOPNOTSUPP;
+
 	if (pos == -1)
 		return do_compat_writev(fd, vec, vlen, flags);
 
diff --git a/fs/splice.c b/fs/splice.c
index 25212dcca2df..1b7fdcce5d6b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -302,6 +302,12 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	iov_iter_pipe(&to, READ, pipe, len);
 	idx = to.idx;
 	init_sync_kiocb(&kiocb, in);
+	/*
+	 * SPLICE_F_NONBLOCK is only used on the pipe/socket end???
+	 *
+	 * Don't set IOCB_NOWAIT based on it here (O_NONBLOCK might set
+	 * IOCB_NOWAIT though).
+	 */
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 	if (ret > 0) {
@@ -355,7 +361,8 @@ static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
 	old_fs = get_fs();
 	set_fs(KERNEL_DS);
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos,
+			_RWF_DEFAULT);
 	set_fs(old_fs);
 
 	return res;
@@ -742,7 +749,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		}
 
 		iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left);
-		ret = vfs_iter_write(out, &from, &sd.pos, 0);
+		ret = vfs_iter_write(out, &from, &sd.pos, _RWF_DEFAULT);
 		if (ret <= 0)
 			break;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f66e247ecba..bc159cb87ea5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -83,6 +83,12 @@ extern int sysctl_protected_fifos;
 extern int sysctl_protected_regular;
 
 typedef __kernel_rwf_t rwf_t;
+/*
+ * kernel internal only value to signal using default flags from file;
+ * can't be combined with other flags and only used to simplify kernel
+ * APIs.
+ */
+#define _RWF_DEFAULT ((__force __kernel_rwf_t)-1)
 
 struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
@@ -3332,11 +3338,21 @@ static inline int iocb_flags(struct file *file)
 		res |= IOCB_DSYNC;
 	if (file->f_flags & __O_SYNC)
 		res |= IOCB_SYNC;
+	if ((file->f_flags & O_NONBLOCK) && (file->f_mode & FMODE_NOWAIT))
+		res |= IOCB_NOWAIT;
 	return res;
 }
 
+/* does NOT handle _RWF_DEFAULT */
 static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 {
+	/*
+	 * unset all flags controlled through RWF, right now all but:
+	 * IOCB_EVENTFD, IOCB_DIRECT, IOCB_WRITE
+	 */
+	ki->ki_flags &= ~(IOCB_APPEND | IOCB_HIPRI | IOCB_DSYNC | IOCB_SYNC |
+			  IOCB_NOWAIT);
+
 	if (unlikely(flags & ~RWF_SUPPORTED))
 		return -EOPNOTSUPP;
 
@@ -3356,6 +3372,41 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 	return 0;
 }
 
+static inline rwf_t rwf_flags(struct file *file)
+{
+	rwf_t flags = 0;
+
+	if (file->f_flags & O_APPEND)
+		flags |= RWF_APPEND;
+	if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
+		flags |= RWF_DSYNC;
+	if (file->f_flags & __O_SYNC)
+		flags |= RWF_SYNC;
+	if ((file->f_flags & O_NONBLOCK) && (file->f_mode & FMODE_NOWAIT))
+		flags |= RWF_NOWAIT;
+
+	return flags;
+}
+
+static inline rwf_t rwf_from_iocb_flags(struct kiocb *iocb)
+{
+	int ifl = iocb->ki_flags;
+	rwf_t flags = 0;
+
+	if (ifl & IOCB_APPEND)
+		flags |= RWF_APPEND;
+	if (ifl & IOCB_NOWAIT)
+		flags |= RWF_NOWAIT;
+	if (ifl & IOCB_HIPRI)
+		flags |= RWF_HIPRI;
+	if (ifl & IOCB_DSYNC)
+		flags |= RWF_DSYNC;
+	if (ifl & IOCB_SYNC)
+		flags |= RWF_SYNC;
+
+	return flags;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] tcp: handle SPLICE_F_NONBLOCK in tcp_splice_read
  2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
@ 2019-05-11 16:57         ` Stefan Bühler
  2019-05-11 16:57         ` [PATCH 3/5] pipe: use IOCB_NOWAIT instead of O_NONBLOCK Stefan Bühler
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:57 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Now both O_NONBLOCK and SPLICE_F_NONBLOCK will trigger non-blocking
behavior.

The spice man page is unclear about the exact semantics:

First it says splice may still block if SPLICE_F_NONBLOCK is set but
O_NONBLOCK isn't.

Then it says it might return EAGAIN if one or the other is set (and on
my debian system it says EAGAIN can only be returned if
SPLICE_F_NONBLOCK was set).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 net/ipv4/tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6baa6dc1b13b..65f9917ed8ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -784,6 +784,8 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	long timeo;
 	ssize_t spliced;
 	int ret;
+	bool noblock = (sock->file->f_flags & O_NONBLOCK) ||
+		       (flags & SPLICE_F_NONBLOCK);
 
 	sock_rps_record_flow(sk);
 	/*
@@ -796,7 +798,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, noblock);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] pipe: use IOCB_NOWAIT instead of O_NONBLOCK
  2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
  2019-05-11 16:57         ` [PATCH 2/5] tcp: handle SPLICE_F_NONBLOCK in tcp_splice_read Stefan Bühler
@ 2019-05-11 16:57         ` Stefan Bühler
  2019-05-11 16:57         ` [PATCH 4/5] socket: " Stefan Bühler
  2019-05-11 16:57         ` [PATCH 5/5] io_uring: use FMODE_NOWAIT to detect files supporting IOCB_NOWAIT Stefan Bühler
  3 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:57 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Fix pipe read_iter/write_iter implementations to handle IOCB_NOWAIT; for
simple reads IOCB_NOWAIT will be set if O_NONBLOCK was set.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/pipe.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 41065901106b..a122331cf30f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -335,13 +335,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			break;
 		if (!pipe->waiting_writers) {
 			/* syscall merging: Usually we must not sleep
-			 * if O_NONBLOCK is set, or if we got some data.
+			 * if IOCB_NOWAIT is set, or if we got some data.
 			 * But if a writer sleeps in kernel space, then
 			 * we can wait for that data without violating POSIX.
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if (iocb->ki_flags & IOCB_NOWAIT) {
 				ret = -EAGAIN;
 				break;
 			}
@@ -446,7 +446,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				pipe->tmp_page = page;
 			}
 			/* Always wake up, even if the copy fails. Otherwise
-			 * we lock up (O_NONBLOCK-)readers that sleep due to
+			 * we lock up (IOCB_NOWAIT-)readers that sleep due to
 			 * syscall merging.
 			 * FIXME! Is this really true?
 			 */
@@ -477,7 +477,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		}
 		if (bufs < pipe->buffers)
 			continue;
-		if (filp->f_flags & O_NONBLOCK) {
+		if (iocb->ki_flags & IOCB_NOWAIT) {
 			if (!ret)
 				ret = -EAGAIN;
 			break;
@@ -780,6 +780,7 @@ int create_pipe_files(struct file **res, int flags)
 		iput(inode);
 		return PTR_ERR(f);
 	}
+	f->f_mode |= FMODE_NOWAIT;
 
 	f->private_data = inode->i_pipe;
 
@@ -791,6 +792,7 @@ int create_pipe_files(struct file **res, int flags)
 		return PTR_ERR(res[0]);
 	}
 	res[0]->private_data = inode->i_pipe;
+	res[0]->f_mode |= FMODE_NOWAIT;
 	res[1] = f;
 	return 0;
 }
@@ -996,6 +998,8 @@ static int fifo_open(struct inode *inode, struct file *filp)
 		goto err;
 	}
 
+	filp->f_mode |= FMODE_NOWAIT;
+
 	/* Ok! */
 	__pipe_unlock(pipe);
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] socket: use IOCB_NOWAIT instead of O_NONBLOCK
  2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
  2019-05-11 16:57         ` [PATCH 2/5] tcp: handle SPLICE_F_NONBLOCK in tcp_splice_read Stefan Bühler
  2019-05-11 16:57         ` [PATCH 3/5] pipe: use IOCB_NOWAIT instead of O_NONBLOCK Stefan Bühler
@ 2019-05-11 16:57         ` Stefan Bühler
  2019-05-11 16:57         ` [PATCH 5/5] io_uring: use FMODE_NOWAIT to detect files supporting IOCB_NOWAIT Stefan Bühler
  3 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:57 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

Fix socket read_iter/write_iter implementations to handle IOCB_NOWAIT;
for simple reads IOCB_NOWAIT will be set if O_NONBLOCK was set.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 net/socket.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 8255f5bda0aa..1e2f6819ea2b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -410,6 +410,7 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
 		sock_release(sock);
 		return file;
 	}
+	file->f_mode |= FMODE_NOWAIT;
 
 	sock->file = file;
 	file->private_data = sock;
@@ -954,7 +955,7 @@ static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			     .msg_iocb = iocb};
 	ssize_t res;
 
-	if (file->f_flags & O_NONBLOCK)
+	if (iocb->ki_flags & IOCB_NOWAIT)
 		msg.msg_flags = MSG_DONTWAIT;
 
 	if (iocb->ki_pos != 0)
@@ -979,7 +980,7 @@ static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos != 0)
 		return -ESPIPE;
 
-	if (file->f_flags & O_NONBLOCK)
+	if (iocb->ki_flags & IOCB_NOWAIT)
 		msg.msg_flags = MSG_DONTWAIT;
 
 	if (sock->type == SOCK_SEQPACKET)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] io_uring: use FMODE_NOWAIT to detect files supporting IOCB_NOWAIT
  2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
                           ` (2 preceding siblings ...)
  2019-05-11 16:57         ` [PATCH 4/5] socket: " Stefan Bühler
@ 2019-05-11 16:57         ` Stefan Bühler
  3 siblings, 0 replies; 25+ messages in thread
From: Stefan Bühler @ 2019-05-11 16:57 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro, linux-block, linux-fsdevel

This replaces the magic check looking for S_ISBLK(mode) ||
S_ISCHR(mode); given the io_uring file doesn't support read/write the
check for io_uring_fops is useless anyway.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e1c6ab63628f..396ce6804977 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -867,23 +867,6 @@ static struct file *io_file_get(struct io_submit_state *state, int fd)
 	return state->file;
 }
 
-/*
- * If we tracked the file through the SCM inflight mechanism, we could support
- * any file. For now, just ensure that anything potentially problematic is done
- * inline.
- */
-static bool io_file_supports_async(struct file *file)
-{
-	umode_t mode = file_inode(file)->i_mode;
-
-	if (S_ISBLK(mode) || S_ISCHR(mode))
-		return true;
-	if (S_ISREG(mode) && file->f_op != &io_uring_fops)
-		return true;
-
-	return false;
-}
-
 static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 		      bool force_nonblock)
 {
@@ -896,7 +879,13 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 	if (!req->file)
 		return -EBADF;
 
-	if (force_nonblock && !io_file_supports_async(req->file))
+	/*
+	 * don't set IOCB_NOWAIT if not supported (forces async punt)
+	 *
+	 * we don't punt if NOWAIT is not supported but requested as
+	 * kiocb_set_rw_flags will return EOPNOTSUPP
+	 */
+	if (force_nonblock && !(req->file->f_mode & FMODE_NOWAIT))
 		force_nonblock = false;
 
 	kiocb->ki_pos = READ_ONCE(sqe->off);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-05-11 16:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-23 19:06 io_uring: not good enough for release Stefan Bühler
2019-04-23 20:31 ` Jens Axboe
2019-04-23 22:07   ` Jens Axboe
2019-04-24 16:09     ` Jens Axboe
2019-04-27 16:05       ` io_uring: RWF_NOWAIT support Stefan Bühler
2019-04-27 18:34         ` [PATCH v1 1/1] [io_uring] fix handling SQEs requesting NOWAIT Stefan Bühler
2019-04-30 15:40           ` Jens Axboe
2019-04-27 15:50     ` io_uring: submission error handling Stefan Bühler
2019-04-30 16:02       ` Jens Axboe
2019-04-30 16:15         ` Jens Axboe
2019-04-30 18:15           ` Stefan Bühler
2019-04-30 18:42             ` Jens Axboe
2019-05-01 11:49               ` [PATCH v1 1/1] [io_uring] don't stall on submission errors Stefan Bühler
2019-05-01 12:43                 ` Jens Axboe
2019-04-27 21:07   ` io_uring: closing / release Stefan Bühler
2019-05-11 16:26     ` Stefan Bühler
2019-04-28 15:54   ` io_uring: O_NONBLOCK/IOCB_NOWAIT/RWF_NOWAIT mess Stefan Bühler
2019-05-11 16:34     ` Stefan Bühler
2019-05-11 16:57       ` [PATCH 1/5] fs: RWF flags override default IOCB flags from file flags Stefan Bühler
2019-05-11 16:57         ` [PATCH 2/5] tcp: handle SPLICE_F_NONBLOCK in tcp_splice_read Stefan Bühler
2019-05-11 16:57         ` [PATCH 3/5] pipe: use IOCB_NOWAIT instead of O_NONBLOCK Stefan Bühler
2019-05-11 16:57         ` [PATCH 4/5] socket: " Stefan Bühler
2019-05-11 16:57         ` [PATCH 5/5] io_uring: use FMODE_NOWAIT to detect files supporting IOCB_NOWAIT Stefan Bühler
2019-05-03  9:47   ` [PATCH 1/2] io_uring: restructure io_{read,write} control flow Stefan Bühler
2019-05-03  9:47     ` [PATCH 2/2] io_uring: punt to workers if file doesn't support async Stefan Bühler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).