linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 05/18] Add io_uring IO interface
       [not found] ` <20190123153536.7081-6-axboe@kernel.dk>
@ 2019-01-28 14:57   ` Christoph Hellwig
  2019-01-28 16:26     ` Jens Axboe
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Christoph Hellwig @ 2019-01-28 14:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-fsdevel, linux-aio, linux-block, hch, jmoyer, avi,
	linux-api, linux-man

[please make sure linux-api and linux-man are CCed on new syscalls
so that we get API experts to review them]

> io_uring_enter(fd, to_submit, min_complete, flags)
> 	Initiates IO against the rings mapped to this fd, or waits for
> 	them to complete, or both. The behavior is controlled by the
> 	parameters passed in. If 'to_submit' is non-zero, then we'll
> 	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> 	kernel will wait for 'min_complete' events, if they aren't
> 	already available. It's valid to set IORING_ENTER_GETEVENTS
> 	and 'min_complete' == 0 at the same time, this allows the
> 	kernel to return already completed events without waiting
> 	for them. This is useful only for polling, as for IRQ
> 	driven IO, the application can just check the CQ ring
> 	without entering the kernel.

Especially with poll support now in the series, don't we need a ѕigmask
argument similar to pselect/ppoll/io_pgetevents now to deal with signal
blocking during waiting for events?

> +struct sqe_submit {
> +	const struct io_uring_sqe *sqe;
> +	unsigned index;
> +};

Can you make sure all the structs use tab indentation for their
field names?  Maybe even the same for all structs just to be nice
to my eyes?

> +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
> +			   const struct io_uring_sqe *sqe,
> +			   struct iovec **iovec, struct iov_iter *iter)
> +{
> +	void __user *buf = u64_to_user_ptr(sqe->addr);
> +
> +#ifdef CONFIG_COMPAT
> +	if (ctx->compat)
> +		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
> +						iovec, iter);
> +#endif

I think we can just check in_compat_syscall() here, which means we
can kill the ->compat member, and the separate compat version of the
setup syscall.

> +/*
> + * IORING_OP_NOP just posts a completion event, nothing else.
> + */
> +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	__io_cqring_add_event(ctx, sqe->user_data, 0, 0);

Can you explain why not taking the completion lock is safe here?  And
why we want to have such a somewhat dangerous special case just for the
no-op benchmarking aid?

> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
> +{
> +	struct io_sq_ring *ring = ctx->sq_ring;
> +	unsigned head;
> +
> +	head = ctx->cached_sq_head;
> +	smp_rmb();
> +	if (head == READ_ONCE(ring->r.tail))
> +		return false;

Do we really need to optimize the sq_head == tail case so much? Or
am I missing why we are using the cached sq head case here?  Maybe
add some more comments for a start.

> +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
> +			    unsigned min_complete, unsigned flags)
> +{
> +	int ret = 0;
> +
> +	if (to_submit) {
> +		ret = io_ring_submit(ctx, to_submit);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	if (flags & IORING_ENTER_GETEVENTS) {
> +		int get_ret;
> +
> +		if (!ret && to_submit)
> +			min_complete = 0;

Why do we have this special case?  Does it need some documentation?

> +
> +		get_ret = io_cqring_wait(ctx, min_complete);
> +		if (get_ret < 0 && !ret)
> +			ret = get_ret;
> +	}
> +
> +	return ret;

Maybe using different names and slightly different semantics for the
return values would clear some of this up?

	if (to_submit) {
		submitted = io_ring_submit(ctx, to_submit);
		if (submitted < 0)
			return submitted;
	}
	if (flags & IORING_ENTER_GETEVENTS) {
		...
		ret = io_cqring_wait(ctx, min_complete);
	}

	return submitted ? submitted : ret;

> +static int io_sq_offload_start(struct io_ring_ctx *ctx)

> +static void io_sq_offload_stop(struct io_ring_ctx *ctx)

Can we just merge these two functions into the callers?  Currently
the flow is a little odd with these helpers that don't seem to be
too clear about their responsibilities.

> +static void io_free_scq_urings(struct io_ring_ctx *ctx)
> +{
> +	if (ctx->sq_ring) {
> +		page_frag_free(ctx->sq_ring);
> +		ctx->sq_ring = NULL;
> +	}
> +	if (ctx->sq_sqes) {
> +		page_frag_free(ctx->sq_sqes);
> +		ctx->sq_sqes = NULL;
> +	}
> +	if (ctx->cq_ring) {
> +		page_frag_free(ctx->cq_ring);
> +		ctx->cq_ring = NULL;
> +	}

Why is this using the page_frag helpers?  Also the callers just free
these ctx structure, so there isn't much of a point zeroing them out.

Also I'd be tempted to open code the freeing in io_allocate_scq_urings
instead of caling the helper, which would avoid the NULL checks and
make the error handling code a little more obvious.

> +	if (mutex_trylock(&ctx->uring_lock)) {
> +		ret = __io_uring_enter(ctx, to_submit, min_complete, flags);

do we even need the separate __io_uring_enter helper?

> +static void io_fill_offsets(struct io_uring_params *p)

Do we really need this as a separate helper?

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 14:57   ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
@ 2019-01-28 16:26     ` Jens Axboe
  2019-01-28 16:34       ` Christoph Hellwig
  2019-01-28 18:25     ` Jens Axboe
  2019-01-29  0:47     ` Andy Lutomirski
  2 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 16:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-aio, linux-block, jmoyer, avi, linux-api, linux-man

On 1/28/19 7:57 AM, Christoph Hellwig wrote:
> [please make sure linux-api and linux-man are CCed on new syscalls
> so that we get API experts to review them]

I already did review with Arnd on those parts, I'll add linux-api and
linux-man for the next posting.

>> io_uring_enter(fd, to_submit, min_complete, flags)
>> 	Initiates IO against the rings mapped to this fd, or waits for
>> 	them to complete, or both. The behavior is controlled by the
>> 	parameters passed in. If 'to_submit' is non-zero, then we'll
>> 	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>> 	kernel will wait for 'min_complete' events, if they aren't
>> 	already available. It's valid to set IORING_ENTER_GETEVENTS
>> 	and 'min_complete' == 0 at the same time, this allows the
>> 	kernel to return already completed events without waiting
>> 	for them. This is useful only for polling, as for IRQ
>> 	driven IO, the application can just check the CQ ring
>> 	without entering the kernel.
> 
> Especially with poll support now in the series, don't we need a ѕigmask
> argument similar to pselect/ppoll/io_pgetevents now to deal with signal
> blocking during waiting for events?

I guess we do.

>> +struct sqe_submit {
>> +	const struct io_uring_sqe *sqe;
>> +	unsigned index;
>> +};
> 
> Can you make sure all the structs use tab indentation for their
> field names?  Maybe even the same for all structs just to be nice
> to my eyes?

Sure, fixed.

>> +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>> +			   const struct io_uring_sqe *sqe,
>> +			   struct iovec **iovec, struct iov_iter *iter)
>> +{
>> +	void __user *buf = u64_to_user_ptr(sqe->addr);
>> +
>> +#ifdef CONFIG_COMPAT
>> +	if (ctx->compat)
>> +		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
>> +						iovec, iter);
>> +#endif
> 
> I think we can just check in_compat_syscall() here, which means we
> can kill the ->compat member, and the separate compat version of the
> setup syscall.

Good point, I'll switch to using that so we don't have to track it.

>> +/*
>> + * IORING_OP_NOP just posts a completion event, nothing else.
>> + */
>> +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +
>> +	__io_cqring_add_event(ctx, sqe->user_data, 0, 0);
> 
> Can you explain why not taking the completion lock is safe here?  And
> why we want to have such a somewhat dangerous special case just for the
> no-op benchmarking aid?

Was going to say it's safe because we always fill the ring inside the
ring mutex, but that won't work if we intermingle with non-polled IO.
I'll switch it to using the normal locked variant.

>> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>> +{
>> +	struct io_sq_ring *ring = ctx->sq_ring;
>> +	unsigned head;
>> +
>> +	head = ctx->cached_sq_head;
>> +	smp_rmb();
>> +	if (head == READ_ONCE(ring->r.tail))
>> +		return false;
> 
> Do we really need to optimize the sq_head == tail case so much? Or
> am I missing why we are using the cached sq head case here?  Maybe
> add some more comments for a start.

It basically serves two purposes:

1) When we grab multiple events, only have to update the actual ring
   tail once. This is a big deal, especially on archs where the barriers
   are more expensive.

2) It means the kernel tracks the sq tail and cq head, instead of
   completely relying on the application. This seems a much saner
   choice.

>> +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
>> +			    unsigned min_complete, unsigned flags)
>> +{
>> +	int ret = 0;
>> +
>> +	if (to_submit) {
>> +		ret = io_ring_submit(ctx, to_submit);
>> +		if (ret < 0)
>> +			return ret;
>> +	}
>> +	if (flags & IORING_ENTER_GETEVENTS) {
>> +		int get_ret;
>> +
>> +		if (!ret && to_submit)
>> +			min_complete = 0;
> 
> Why do we have this special case?  Does it need some documentation?

At least for polled IO, if we don't submit what we were asked to, then
we can't reliably poll for the passed in number of events. The
min_complete from the application could very well include the
expectation that to_submit were submitted as well.

I'll add a comment.

>> +
>> +		get_ret = io_cqring_wait(ctx, min_complete);
>> +		if (get_ret < 0 && !ret)
>> +			ret = get_ret;
>> +	}
>> +
>> +	return ret;
> 
> Maybe using different names and slightly different semantics for the
> return values would clear some of this up?
> 
> 	if (to_submit) {
> 		submitted = io_ring_submit(ctx, to_submit);
> 		if (submitted < 0)
> 			return submitted;
> 	}
> 	if (flags & IORING_ENTER_GETEVENTS) {
> 		...
> 		ret = io_cqring_wait(ctx, min_complete);
> 	}
> 
> 	return submitted ? submitted : ret;

That would probably make it more readable, I'll make this change.

>> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> 
>> +static void io_sq_offload_stop(struct io_ring_ctx *ctx)
> 
> Can we just merge these two functions into the callers?  Currently
> the flow is a little odd with these helpers that don't seem to be
> too clear about their responsibilities.

In the initial patch I agree, but with the later thread addition, I like
having it in a separate helper. At least for the start, the top side is
more trivial.

>> +static void io_free_scq_urings(struct io_ring_ctx *ctx)
>> +{
>> +	if (ctx->sq_ring) {
>> +		page_frag_free(ctx->sq_ring);
>> +		ctx->sq_ring = NULL;
>> +	}
>> +	if (ctx->sq_sqes) {
>> +		page_frag_free(ctx->sq_sqes);
>> +		ctx->sq_sqes = NULL;
>> +	}
>> +	if (ctx->cq_ring) {
>> +		page_frag_free(ctx->cq_ring);
>> +		ctx->cq_ring = NULL;
>> +	}
> 
> Why is this using the page_frag helpers?  Also the callers just free
> these ctx structure, so there isn't much of a point zeroing them out.

Why not use the page frag helpers? No point in open-coding it. I can
kill the zeroing, double call would be a bug anyway.

> Also I'd be tempted to open code the freeing in io_allocate_scq_urings
> instead of caling the helper, which would avoid the NULL checks and
> make the error handling code a little more obvious.

OK

>> +	if (mutex_trylock(&ctx->uring_lock)) {
>> +		ret = __io_uring_enter(ctx, to_submit, min_complete, flags);
> 
> do we even need the separate __io_uring_enter helper?

I like having it nicely separated.

>> +static void io_fill_offsets(struct io_uring_params *p)
> 
> Do we really need this as a separate helper?

That does seem pointless, folded in.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 16:26     ` Jens Axboe
@ 2019-01-28 16:34       ` Christoph Hellwig
  2019-01-28 19:32         ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2019-01-28 16:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, linux-block, jmoyer,
	avi, linux-api, linux-man

On Mon, Jan 28, 2019 at 09:26:42AM -0700, Jens Axboe wrote:
> >> +static void io_free_scq_urings(struct io_ring_ctx *ctx)
> >> +{
> >> +	if (ctx->sq_ring) {
> >> +		page_frag_free(ctx->sq_ring);
> >> +		ctx->sq_ring = NULL;
> >> +	}
> >> +	if (ctx->sq_sqes) {
> >> +		page_frag_free(ctx->sq_sqes);
> >> +		ctx->sq_sqes = NULL;
> >> +	}
> >> +	if (ctx->cq_ring) {
> >> +		page_frag_free(ctx->cq_ring);
> >> +		ctx->cq_ring = NULL;
> >> +	}
> > 
> > Why is this using the page_frag helpers?  Also the callers just free
> > these ctx structure, so there isn't much of a point zeroing them out.
> 
> Why not use the page frag helpers? No point in open-coding it. I can
> kill the zeroing, double call would be a bug anyway.

Because they are at a different level of abstraction, and someone
might change the implementation, and is unlikely to catch the io_uring
mix of interfaces.  If you think this is really useful we should also
export the helpers under a different name and with documentation.
(and add a __get_free_pages version that returns a pointer..)

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 14:57   ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
  2019-01-28 16:26     ` Jens Axboe
@ 2019-01-28 18:25     ` Jens Axboe
  2019-01-29  6:30       ` Christoph Hellwig
  2019-01-29  0:47     ` Andy Lutomirski
  2 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-aio, linux-block, jmoyer, avi, linux-api, linux-man

On 1/28/19 7:57 AM, Christoph Hellwig wrote:
> [please make sure linux-api and linux-man are CCed on new syscalls
> so that we get API experts to review them]
> 
>> io_uring_enter(fd, to_submit, min_complete, flags)
>> 	Initiates IO against the rings mapped to this fd, or waits for
>> 	them to complete, or both. The behavior is controlled by the
>> 	parameters passed in. If 'to_submit' is non-zero, then we'll
>> 	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>> 	kernel will wait for 'min_complete' events, if they aren't
>> 	already available. It's valid to set IORING_ENTER_GETEVENTS
>> 	and 'min_complete' == 0 at the same time, this allows the
>> 	kernel to return already completed events without waiting
>> 	for them. This is useful only for polling, as for IRQ
>> 	driven IO, the application can just check the CQ ring
>> 	without entering the kernel.
> 
> Especially with poll support now in the series, don't we need a ѕigmask
> argument similar to pselect/ppoll/io_pgetevents now to deal with signal
> blocking during waiting for events?

Is there any way to avoid passing in the sigset_t size? If it's just a
32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
are there other cases that need to be catered to?

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 16:34       ` Christoph Hellwig
@ 2019-01-28 19:32         ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 19:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-aio, linux-block, jmoyer, avi, linux-api, linux-man

On 1/28/19 9:34 AM, Christoph Hellwig wrote:
> On Mon, Jan 28, 2019 at 09:26:42AM -0700, Jens Axboe wrote:
>>>> +static void io_free_scq_urings(struct io_ring_ctx *ctx)
>>>> +{
>>>> +	if (ctx->sq_ring) {
>>>> +		page_frag_free(ctx->sq_ring);
>>>> +		ctx->sq_ring = NULL;
>>>> +	}
>>>> +	if (ctx->sq_sqes) {
>>>> +		page_frag_free(ctx->sq_sqes);
>>>> +		ctx->sq_sqes = NULL;
>>>> +	}
>>>> +	if (ctx->cq_ring) {
>>>> +		page_frag_free(ctx->cq_ring);
>>>> +		ctx->cq_ring = NULL;
>>>> +	}
>>>
>>> Why is this using the page_frag helpers?  Also the callers just free
>>> these ctx structure, so there isn't much of a point zeroing them out.
>>
>> Why not use the page frag helpers? No point in open-coding it. I can
>> kill the zeroing, double call would be a bug anyway.
> 
> Because they are at a different level of abstraction, and someone
> might change the implementation, and is unlikely to catch the io_uring
> mix of interfaces.  If you think this is really useful we should also
> export the helpers under a different name and with documentation.
> (and add a __get_free_pages version that returns a pointer..)

Fair enough, I'll avoid using the page_frag_free().

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 14:57   ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
  2019-01-28 16:26     ` Jens Axboe
  2019-01-28 18:25     ` Jens Axboe
@ 2019-01-29  0:47     ` Andy Lutomirski
  2019-01-29  1:20       ` Jens Axboe
  2 siblings, 1 reply; 49+ messages in thread
From: Andy Lutomirski @ 2019-01-29  0:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux FS Devel, linux-aio, linux-block, Jeff Moyer,
	Avi Kivity, Linux API, linux-man

On Mon, Jan 28, 2019 at 6:57 AM Christoph Hellwig <hch@lst.de> wrote:
>
> [please make sure linux-api and linux-man are CCed on new syscalls
> so that we get API experts to review them]

> > +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
> > +                        const struct io_uring_sqe *sqe,
> > +                        struct iovec **iovec, struct iov_iter *iter)
> > +{
> > +     void __user *buf = u64_to_user_ptr(sqe->addr);
> > +
> > +#ifdef CONFIG_COMPAT
> > +     if (ctx->compat)
> > +             return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
> > +                                             iovec, iter);
> > +#endif
>
> I think we can just check in_compat_syscall() here, which means we
> can kill the ->compat member, and the separate compat version of the
> setup syscall.
>

Since this whole API is new, I don't suppose you could introduce a
struct iovec64 or similar and just make the ABI be identical for
64-bit and 32-bit code?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:47     ` Andy Lutomirski
@ 2019-01-29  1:20       ` Jens Axboe
  2019-01-29  6:45         ` Christoph Hellwig
  2019-01-31  5:11         ` Andy Lutomirski
  0 siblings, 2 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  1:20 UTC (permalink / raw)
  To: Andy Lutomirski, Christoph Hellwig
  Cc: Linux FS Devel, linux-aio, linux-block, Jeff Moyer, Avi Kivity,
	Linux API, linux-man

On 1/28/19 5:47 PM, Andy Lutomirski wrote:
> On Mon, Jan 28, 2019 at 6:57 AM Christoph Hellwig <hch@lst.de> wrote:
>>
>> [please make sure linux-api and linux-man are CCed on new syscalls
>> so that we get API experts to review them]
> 
>>> +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>>> +                        const struct io_uring_sqe *sqe,
>>> +                        struct iovec **iovec, struct iov_iter *iter)
>>> +{
>>> +     void __user *buf = u64_to_user_ptr(sqe->addr);
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +     if (ctx->compat)
>>> +             return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
>>> +                                             iovec, iter);
>>> +#endif
>>
>> I think we can just check in_compat_syscall() here, which means we
>> can kill the ->compat member, and the separate compat version of the
>> setup syscall.
>>
> 
> Since this whole API is new, I don't suppose you could introduce a
> struct iovec64 or similar and just make the ABI be identical for
> 64-bit and 32-bit code?

Sure, that would be straight forward. Is there a strong reason to do
so outside of "that would be nice"? It's not like it's a huge amount
of code.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 18:25     ` Jens Axboe
@ 2019-01-29  6:30       ` Christoph Hellwig
  2019-01-29 11:58         ` Arnd Bergmann
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2019-01-29  6:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, linux-block, jmoyer,
	avi, linux-api, linux-man

On Mon, Jan 28, 2019 at 11:25:12AM -0700, Jens Axboe wrote:
> > Especially with poll support now in the series, don't we need a ѕigmask
> > argument similar to pselect/ppoll/io_pgetevents now to deal with signal
> > blocking during waiting for events?
> 
> Is there any way to avoid passing in the sigset_t size? If it's just a
> 32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
> are there other cases that need to be catered to?

As far as I can tell we never look at it, never looked at it and don't
have any plans to look at it anytime soon.  But when I tried to omit
it for io_pgetevents I got stong pushback and thus had to add the
crazy double indirection calling convention.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:20       ` Jens Axboe
@ 2019-01-29  6:45         ` Christoph Hellwig
  2019-01-29 12:05           ` Arnd Bergmann
  2019-01-31  5:11         ` Andy Lutomirski
  1 sibling, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2019-01-29  6:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andy Lutomirski, Christoph Hellwig, Linux FS Devel, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man

On Mon, Jan 28, 2019 at 06:20:08PM -0700, Jens Axboe wrote:
> Sure, that would be straight forward. Is there a strong reason to do
> so outside of "that would be nice"? It's not like it's a huge amount
> of code.

And it would be really painful for userspace.  Because now you
can't pass struct iovec through from a higher level, but will instead
of to copy the iovec to a different type in the submission path.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  6:30       ` Christoph Hellwig
@ 2019-01-29 11:58         ` Arnd Bergmann
  2019-01-29 15:20           ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Arnd Bergmann @ 2019-01-29 11:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux FS-devel Mailing List, linux-aio, linux-block,
	Jeff Moyer, Avi Kivity, Linux API, linux-man, Deepa Dinamani

On Tue, Jan 29, 2019 at 7:30 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jan 28, 2019 at 11:25:12AM -0700, Jens Axboe wrote:
> > > Especially with poll support now in the series, don't we need a ѕigmask
> > > argument similar to pselect/ppoll/io_pgetevents now to deal with signal
> > > blocking during waiting for events?
> >
> > Is there any way to avoid passing in the sigset_t size? If it's just a
> > 32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
> > are there other cases that need to be catered to?
>
> As far as I can tell we never look at it, never looked at it and don't
> have any plans to look at it anytime soon.  But when I tried to omit
> it for io_pgetevents I got stong pushback and thus had to add the
> crazy double indirection calling convention.

Deepa has recently reworked the handling for the sigset_t handling
to be more consistent. As I understand it, we only ever check the
size argument to ensure that user and kernel space agree on
the size, as it there had been some historic differences.

If you pass a signal mask to a syscall, you should now just use the
set_user_sigmask()/set_compat_user_sigmask()/restore_user_sigmask()
helpers. The compat version is required for the incompatible bit order
between 32-bit and 64-bit big-endian architectures (on little-endian, compat
and native signal masks are compatible), and to deal with the one
architecture that has an _NSIG defined to something other than 64:
MIPS uses 128 because of a historic accident.

I think Deepa originally suggested combining set_user_sigmask()
and set_compat_user_sigmask(), using an in_compat_syscall()
check. This would let us simplify a number of compat syscalls
(ppoll, ppoll_time32, pselect6, pselect6_time32, epoll_pwait()
io_pgetevents_time64, io_pgetevents_time32). I advised against
changing it at the time for consistency with other compat syscalls,
but it's something we can still do now.

There was a recent discussion about the size of sigset_t in glibc,
which is 1024 bits there instead of 64 bits in the kernel, the idea
being that the kernel might eventually grow more signals at
some point in the future, as we did when we extended from 32
to 64 a long time ago with the addition of the rt_sig* signals;
see the thread around
https://www.spinics.net/lists/linux-snps-arc/msg04860.html

    Arnd

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  6:45         ` Christoph Hellwig
@ 2019-01-29 12:05           ` Arnd Bergmann
  0 siblings, 0 replies; 49+ messages in thread
From: Arnd Bergmann @ 2019-01-29 12:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Andy Lutomirski, Linux FS Devel, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man

On Tue, Jan 29, 2019 at 7:45 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jan 28, 2019 at 06:20:08PM -0700, Jens Axboe wrote:
> > Sure, that would be straight forward. Is there a strong reason to do
> > so outside of "that would be nice"? It's not like it's a huge amount
> > of code.
>
> And it would be really painful for userspace.  Because now you
> can't pass struct iovec through from a higher level, but will instead
> of to copy the iovec to a different type in the submission path.

Agreed. However, if we decide to add the in_compat_syscall() check
to set_user_sigmask()/set_compat_user_sigmask(), we probably want
to do the same thing in import_iovec()/compat_import_iovec() and
rw_copy_check_uvector()/compat_rw_copy_check_uvector().

      Arnd

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 11:58         ` Arnd Bergmann
@ 2019-01-29 15:20           ` Jens Axboe
  2019-01-29 16:18             ` Arnd Bergmann
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29 15:20 UTC (permalink / raw)
  To: Arnd Bergmann, Christoph Hellwig
  Cc: Linux FS-devel Mailing List, linux-aio, linux-block, Jeff Moyer,
	Avi Kivity, Linux API, linux-man, Deepa Dinamani

On 1/29/19 4:58 AM, Arnd Bergmann wrote:
> On Tue, Jan 29, 2019 at 7:30 AM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Mon, Jan 28, 2019 at 11:25:12AM -0700, Jens Axboe wrote:
>>>> Especially with poll support now in the series, don't we need a ѕigmask
>>>> argument similar to pselect/ppoll/io_pgetevents now to deal with signal
>>>> blocking during waiting for events?
>>>
>>> Is there any way to avoid passing in the sigset_t size? If it's just a
>>> 32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
>>> are there other cases that need to be catered to?
>>
>> As far as I can tell we never look at it, never looked at it and don't
>> have any plans to look at it anytime soon.  But when I tried to omit
>> it for io_pgetevents I got stong pushback and thus had to add the
>> crazy double indirection calling convention.
> 
> Deepa has recently reworked the handling for the sigset_t handling
> to be more consistent. As I understand it, we only ever check the
> size argument to ensure that user and kernel space agree on
> the size, as it there had been some historic differences.
> 
> If you pass a signal mask to a syscall, you should now just use the
> set_user_sigmask()/set_compat_user_sigmask()/restore_user_sigmask()
> helpers. The compat version is required for the incompatible bit order
> between 32-bit and 64-bit big-endian architectures (on little-endian, compat
> and native signal masks are compatible), and to deal with the one
> architecture that has an _NSIG defined to something other than 64:
> MIPS uses 128 because of a historic accident.
> 
> I think Deepa originally suggested combining set_user_sigmask()
> and set_compat_user_sigmask(), using an in_compat_syscall()
> check. This would let us simplify a number of compat syscalls
> (ppoll, ppoll_time32, pselect6, pselect6_time32, epoll_pwait()
> io_pgetevents_time64, io_pgetevents_time32). I advised against
> changing it at the time for consistency with other compat syscalls,
> but it's something we can still do now.

That's good info. I am currently using set_user_sigmask() for it.
I'd really like to avoid having to pass in a sigset_t size for the
system call, however. What's the best way of achieving that? Can I get
away with doing something like this:

	if (in_compat_syscall()) {
		const compat_sigset_t __user *compat_sig;

		compat_sig = (const compat_sigset_t __user *) sig;
		ret = set_compat_user_sigmask(compat_sig, &ksigmask,
						&sigsaved, _NSIG_WORDS);
	} else {
		ret = set_user_sigmask(sig, &ksigmask, &sigsaved,
						_NSIG_WORDS);
	}

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 15:20           ` Jens Axboe
@ 2019-01-29 16:18             ` Arnd Bergmann
  2019-01-29 16:19               ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Arnd Bergmann @ 2019-01-29 16:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man,
	Deepa Dinamani

On Tue, Jan 29, 2019 at 4:20 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 4:58 AM, Arnd Bergmann wrote:
> > On Tue, Jan 29, 2019 at 7:30 AM Christoph Hellwig <hch@lst.de> wrote:
> >>> On Mon, Jan 28, 2019 at 11:25:12AM -0700, Jens Axboe wrote:
> >>>> Especially with poll support now in the series, don't we need a ѕigmask
> >>>> argument similar to pselect/ppoll/io_pgetevents now to deal with signal
> >>>> blocking during waiting for events?
> >>>
> >>> Is there any way to avoid passing in the sigset_t size? If it's just a
> >>> 32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
> >>> are there other cases that need to be catered to?
> >>
> >> As far as I can tell we never look at it, never looked at it and don't
> >> have any plans to look at it anytime soon.  But when I tried to omit
> >> it for io_pgetevents I got stong pushback and thus had to add the
> >> crazy double indirection calling convention.
>
> That's good info. I am currently using set_user_sigmask() for it.
> I'd really like to avoid having to pass in a sigset_t size for the
> system call, however.

I really wouldn't do it, given that all other signal handling interfaces
are prepared for longer signal masks. You /could/ probably extend
it later with a flags bit to signify a longer mask instead of using
the entire register to hold the bit length, it just seems really
inconsistent with all other system calls.

      Arnd




> What's the best way of achieving that? Can I get
> away with doing something like this:
>
>         if (in_compat_syscall()) {
>                 const compat_sigset_t __user *compat_sig;
>
>                 compat_sig = (const compat_sigset_t __user *) sig;
>                 ret = set_compat_user_sigmask(compat_sig, &ksigmask,
>                                                 &sigsaved, _NSIG_WORDS);
>         } else {
>                 ret = set_user_sigmask(sig, &ksigmask, &sigsaved,
>                                                 _NSIG_WORDS);
>         }

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 16:18             ` Arnd Bergmann
@ 2019-01-29 16:19               ` Jens Axboe
  2019-01-29 16:26                 ` Arnd Bergmann
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29 16:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man,
	Deepa Dinamani

On 1/29/19 9:18 AM, Arnd Bergmann wrote:
> On Tue, Jan 29, 2019 at 4:20 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 4:58 AM, Arnd Bergmann wrote:
>>> On Tue, Jan 29, 2019 at 7:30 AM Christoph Hellwig <hch@lst.de> wrote:
>>>>> On Mon, Jan 28, 2019 at 11:25:12AM -0700, Jens Axboe wrote:
>>>>>> Especially with poll support now in the series, don't we need a ѕigmask
>>>>>> argument similar to pselect/ppoll/io_pgetevents now to deal with signal
>>>>>> blocking during waiting for events?
>>>>>
>>>>> Is there any way to avoid passing in the sigset_t size? If it's just a
>>>>> 32-bit/64-bit thing, surely the in_compat_syscall() could cover it? Or
>>>>> are there other cases that need to be catered to?
>>>>
>>>> As far as I can tell we never look at it, never looked at it and don't
>>>> have any plans to look at it anytime soon.  But when I tried to omit
>>>> it for io_pgetevents I got stong pushback and thus had to add the
>>>> crazy double indirection calling convention.
>>
>> That's good info. I am currently using set_user_sigmask() for it.
>> I'd really like to avoid having to pass in a sigset_t size for the
>> system call, however.
> 
> I really wouldn't do it, given that all other signal handling interfaces
> are prepared for longer signal masks. You /could/ probably extend
> it later with a flags bit to signify a longer mask instead of using
> the entire register to hold the bit length, it just seems really
> inconsistent with all other system calls.

Damnit! OK, I'll keep what I currently have then.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 16:19               ` Jens Axboe
@ 2019-01-29 16:26                 ` Arnd Bergmann
  2019-01-29 16:28                   ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Arnd Bergmann @ 2019-01-29 16:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man,
	Deepa Dinamani

On Tue, Jan 29, 2019 at 5:19 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 9:18 AM, Arnd Bergmann wrote:
> > On Tue, Jan 29, 2019 at 4:20 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> That's good info. I am currently using set_user_sigmask() for it.
> >> I'd really like to avoid having to pass in a sigset_t size for the
> >> system call, however.
> >
> > I really wouldn't do it, given that all other signal handling interfaces
> > are prepared for longer signal masks. You /could/ probably extend
> > it later with a flags bit to signify a longer mask instead of using
> > the entire register to hold the bit length, it just seems really
> > inconsistent with all other system calls.
>
> Damnit! OK, I'll keep what I currently have then.

As long as you stay within the 6-argument syscall contraints,
the cost of passing the length is basically free, right?

Is there anything else you are worried about?

     Arnd

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 16:26                 ` Arnd Bergmann
@ 2019-01-29 16:28                   ` Jens Axboe
  2019-01-29 16:46                     ` Arnd Bergmann
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29 16:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man,
	Deepa Dinamani

On 1/29/19 9:26 AM, Arnd Bergmann wrote:
> On Tue, Jan 29, 2019 at 5:19 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 9:18 AM, Arnd Bergmann wrote:
>>> On Tue, Jan 29, 2019 at 4:20 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> That's good info. I am currently using set_user_sigmask() for it.
>>>> I'd really like to avoid having to pass in a sigset_t size for the
>>>> system call, however.
>>>
>>> I really wouldn't do it, given that all other signal handling interfaces
>>> are prepared for longer signal masks. You /could/ probably extend
>>> it later with a flags bit to signify a longer mask instead of using
>>> the entire register to hold the bit length, it just seems really
>>> inconsistent with all other system calls.
>>
>> Damnit! OK, I'll keep what I currently have then.
> 
> As long as you stay within the 6-argument syscall contraints,
> the cost of passing the length is basically free, right?
> 
> Is there anything else you are worried about?

My main worry is not the extra argument, more that we're at capacity
for the system call. If we wanted to add a timeout parameter, then we'd
need to bundle them up, which sucks.

But I think we're fine, I'll go with what I have.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 16:28                   ` Jens Axboe
@ 2019-01-29 16:46                     ` Arnd Bergmann
  0 siblings, 0 replies; 49+ messages in thread
From: Arnd Bergmann @ 2019-01-29 16:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man,
	Deepa Dinamani

On Tue, Jan 29, 2019 at 5:28 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/29/19 9:26 AM, Arnd Bergmann wrote:
> > On Tue, Jan 29, 2019 at 5:19 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 1/29/19 9:18 AM, Arnd Bergmann wrote:
> >>> On Tue, Jan 29, 2019 at 4:20 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> That's good info. I am currently using set_user_sigmask() for it.
> >>>> I'd really like to avoid having to pass in a sigset_t size for the
> >>>> system call, however.
> >>>
> >>> I really wouldn't do it, given that all other signal handling interfaces
> >>> are prepared for longer signal masks. You /could/ probably extend
> >>> it later with a flags bit to signify a longer mask instead of using
> >>> the entire register to hold the bit length, it just seems really
> >>> inconsistent with all other system calls.
> >>
> >> Damnit! OK, I'll keep what I currently have then.
> >
> > As long as you stay within the 6-argument syscall contraints,
> > the cost of passing the length is basically free, right?
> >
> > Is there anything else you are worried about?
>
> My main worry is not the extra argument, more that we're at capacity
> for the system call. If we wanted to add a timeout parameter, then we'd
> need to bundle them up, which sucks.

Ok, got it. If it turns out that we do need the timeout argument after all,
you could avoid one indirection level by grouping the sigset_t with
min_complete, timeout_ns and/or sigset_size, as long as the
sigset_t comes last (similar to struct sigaction).

Another (also awkward) trick might be to combine min_complete and
sigset_size into a 32-bit integer argument, as each of them can
fit into 16 bits.

      Arnd

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:20       ` Jens Axboe
  2019-01-29  6:45         ` Christoph Hellwig
@ 2019-01-31  5:11         ` Andy Lutomirski
  2019-01-31 16:37           ` Jens Axboe
  1 sibling, 1 reply; 49+ messages in thread
From: Andy Lutomirski @ 2019-01-31  5:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andy Lutomirski, Christoph Hellwig, Linux FS Devel, linux-aio,
	linux-block, Jeff Moyer, Avi Kivity, Linux API, linux-man

>>> On Jan 28, 2019, at 5:20 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 1/28/19 5:47 PM, Andy Lutomirski wrote:
>>> On Mon, Jan 28, 2019 at 6:57 AM Christoph Hellwig <hch@lst.de> wrote:
>>>
>>> [please make sure linux-api and linux-man are CCed on new syscalls
>>> so that we get API experts to review them]
>>
>>>> +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>>>> +                        const struct io_uring_sqe *sqe,
>>>> +                        struct iovec **iovec, struct iov_iter *iter)
>>>> +{
>>>> +     void __user *buf = u64_to_user_ptr(sqe->addr);
>>>> +
>>>> +#ifdef CONFIG_COMPAT
>>>> +     if (ctx->compat)
>>>> +             return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
>>>> +                                             iovec, iter);
>>>> +#endif
>>>
>>> I think we can just check in_compat_syscall() here, which means we
>>> can kill the ->compat member, and the separate compat version of the
>>> setup syscall.
>>
>> Since this whole API is new, I don't suppose you could introduce a
>> struct iovec64 or similar and just make the ABI be identical for
>> 64-bit and 32-bit code?
>
> Sure, that would be straight forward. Is there a strong reason to do
> so outside of "that would be nice"? It's not like it's a huge amount
> of code.

Here are some minor-ish benefits:

 - It avoids having a code path that is only used with 32 bit code on
64 bit kernels and is therefore rarely tested.  (In this particular
case, the code path doesn't diverge much, but for most compat
syscalls, it's almost an entirely separate implementation of the main
syscall code.)

 - It makes life easier for tools like strace.

 - It minimizes the chance of making a giant mess on x32, which isn't
really native or compat.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-31  5:11         ` Andy Lutomirski
@ 2019-01-31 16:37           ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-31 16:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Linux FS Devel, linux-aio, linux-block,
	Jeff Moyer, Avi Kivity, Linux API, linux-man

On 1/30/19 10:11 PM, Andy Lutomirski wrote:
>>>> On Jan 28, 2019, at 5:20 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 1/28/19 5:47 PM, Andy Lutomirski wrote:
>>>> On Mon, Jan 28, 2019 at 6:57 AM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> [please make sure linux-api and linux-man are CCed on new syscalls
>>>> so that we get API experts to review them]
>>>
>>>>> +static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>>>>> +                        const struct io_uring_sqe *sqe,
>>>>> +                        struct iovec **iovec, struct iov_iter *iter)
>>>>> +{
>>>>> +     void __user *buf = u64_to_user_ptr(sqe->addr);
>>>>> +
>>>>> +#ifdef CONFIG_COMPAT
>>>>> +     if (ctx->compat)
>>>>> +             return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
>>>>> +                                             iovec, iter);
>>>>> +#endif
>>>>
>>>> I think we can just check in_compat_syscall() here, which means we
>>>> can kill the ->compat member, and the separate compat version of the
>>>> setup syscall.
>>>
>>> Since this whole API is new, I don't suppose you could introduce a
>>> struct iovec64 or similar and just make the ABI be identical for
>>> 64-bit and 32-bit code?
>>
>> Sure, that would be straight forward. Is there a strong reason to do
>> so outside of "that would be nice"? It's not like it's a huge amount
>> of code.
> 
> Here are some minor-ish benefits:
> 
>  - It avoids having a code path that is only used with 32 bit code on
> 64 bit kernels and is therefore rarely tested.  (In this particular
> case, the code path doesn't diverge much, but for most compat
> syscalls, it's almost an entirely separate implementation of the main
> syscall code.)
> 
>  - It makes life easier for tools like strace.
> 
>  - It minimizes the chance of making a giant mess on x32, which isn't
> really native or compat.

Not really anything major here, at least not to the extent that
suffering the pain of having a different iovec for this is warranted.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-02-01 17:23             ` Jann Horn
@ 2019-02-01 18:05               ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-02-01 18:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matt Mullins, axboe, linux-fsdevel, linux-aio, linux-block,
	jmoyer, linux-api, hch, linux-man, avi

On Fri, Feb 01, 2019 at 06:23:27PM +0100, Jann Horn wrote:

> > Oh, yuck. Uuuh... can we make "struct files_struct" doubly-refcounted,
> > like "struct mm_struct"? One reference type to keep the contents
> > intact (the reference type you normally use, and the type used by
> > uring when the thread is running), and one reference type to just keep
> > the struct itself existing, but without preserving its contents
> > (reference held consistently by the uring thread)?
> 
> Something like this (completely untested); and then instead of the
> current get_files_struct(), you'd do get_files_struct_weak(), and
> while the thread is running, it protects the files_struct from dying
> with tryget_weak_files_struct() / put_files_struct().
> 
> Al, do you have opinions on this?

Yes, but they are not fit for polite company.  IMO the entire approach
is FUBAR; I'll post more detailed review, but what I'd seen so far is
a veto fodder.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-02-01 17:04           ` Jann Horn
@ 2019-02-01 17:23             ` Jann Horn
  2019-02-01 18:05               ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-02-01 17:23 UTC (permalink / raw)
  To: Matt Mullins, viro, axboe, linux-fsdevel
  Cc: linux-aio, linux-block, jmoyer, linux-api, hch, linux-man, avi

On Fri, Feb 1, 2019 at 6:04 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Feb 1, 2019 at 5:57 PM Matt Mullins <mmullins@fb.com> wrote:
> > On Tue, 2019-01-29 at 00:59 +0100, Jann Horn wrote:
> > > On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > > On 1/28/19 3:32 PM, Jann Horn wrote:
> > > > > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > > > > The submission queue (SQ) and completion queue (CQ) rings are shared
> > > > > > between the application and the kernel. This eliminates the need to
> > > > > > copy data back and forth to submit and complete IO.
> > > > > >
> > > > > > IO submissions use the io_uring_sqe data structure, and completions
> > > > > > are generated in the form of io_uring_sqe data structures. The SQ
> > > > > > ring is an index into the io_uring_sqe array, which makes it possible
> > > > > > to submit a batch of IOs without them being contiguous in the ring.
> > > > > > The CQ ring is always contiguous, as completion events are inherently
> > > > > > unordered, and hence any io_uring_cqe entry can point back to an
> > > > > > arbitrary submission.
> > > > > >
> > > > > > Two new system calls are added for this:
> > > > > >
> > > > > > io_uring_setup(entries, params)
> > > > > >         Sets up a context for doing async IO. On success, returns a file
> > > > > >         descriptor that the application can mmap to gain access to the
> > > > > >         SQ ring, CQ ring, and io_uring_sqes.
> > > > > >
> > > > > > io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
> > > > > >         Initiates IO against the rings mapped to this fd, or waits for
> > > > > >         them to complete, or both. The behavior is controlled by the
> > > > > >         parameters passed in. If 'to_submit' is non-zero, then we'll
> > > > > >         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> > > > > >         kernel will wait for 'min_complete' events, if they aren't
> > > > > >         already available. It's valid to set IORING_ENTER_GETEVENTS
> > > > > >         and 'min_complete' == 0 at the same time, this allows the
> > > > > >         kernel to return already completed events without waiting
> > > > > >         for them. This is useful only for polling, as for IRQ
> > > > > >         driven IO, the application can just check the CQ ring
> > > > > >         without entering the kernel.
> > > > > >
> > > > > > With this setup, it's possible to do async IO with a single system
> > > > > > call. Future developments will enable polled IO with this interface,
> > > > > > and polled submission as well. The latter will enable an application
> > > > > > to do IO without doing ANY system calls at all.
> > > > > >
> > > > > > For IRQ driven IO, an application only needs to enter the kernel for
> > > > > > completions if it wants to wait for them to occur.
> > > > > >
> > > > > > Each io_uring is backed by a workqueue, to support buffered async IO
> > > > > > as well. We will only punt to an async context if the command would
> > > > > > need to wait for IO on the device side. Any data that can be accessed
> > > > > > directly in the page cache is done inline. This avoids the slowness
> > > > > > issue of usual threadpools, since cached data is accessed as quickly
> > > > > > as a sync interface.
> > > > > >
> > > > > > Sample application: https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_fio_plain_t_io-5Furing.c&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pqM-eO4A2hNFhIFiX-7eGg&m=MGr14pOzNbC7Z-8_dV4GMiH3AbkkH0RSQoQ894Tu0yc&s=mgbcubzOMiCpFpnwW-HA3ey0YDYPkgMIZ7Bmy4w6Chc&e=
> > > > >
> > > > > [...]
> > > > > > +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > > > +                     bool force_nonblock)
> > > > > > +{
> > > > > > +       struct kiocb *kiocb = &req->rw;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       kiocb->ki_filp = fget(sqe->fd);
> > > > > > +       if (unlikely(!kiocb->ki_filp))
> > > > > > +               return -EBADF;
> > > > > > +       kiocb->ki_pos = sqe->off;
> > > > > > +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> > > > > > +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> > > > > > +       if (sqe->ioprio) {
> > > > > > +               ret = ioprio_check_cap(sqe->ioprio);
> > > > > > +               if (ret)
> > > > > > +                       goto out_fput;
> > > > > > +
> > > > > > +               kiocb->ki_ioprio = sqe->ioprio;
> > > > > > +       } else
> > > > > > +               kiocb->ki_ioprio = get_current_ioprio();
> > > > > > +
> > > > > > +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> > > > > > +       if (unlikely(ret))
> > > > > > +               goto out_fput;
> > > > > > +       if (force_nonblock) {
> > > > > > +               kiocb->ki_flags |= IOCB_NOWAIT;
> > > > > > +               req->flags |= REQ_F_FORCE_NONBLOCK;
> > > > > > +       }
> > > > > > +       if (kiocb->ki_flags & IOCB_HIPRI) {
> > > > > > +               ret = -EINVAL;
> > > > > > +               goto out_fput;
> > > > > > +       }
> > > > > > +
> > > > > > +       kiocb->ki_complete = io_complete_rw;
> > > > > > +       return 0;
> > > > > > +out_fput:
> > > > > > +       fput(kiocb->ki_filp);
> > > > > > +       return ret;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > > > +                      bool force_nonblock)
> > > > > > +{
> > > > > > +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> > > > > > +       struct kiocb *kiocb = &req->rw;
> > > > > > +       struct iov_iter iter;
> > > > > > +       struct file *file;
> > > > > > +       ssize_t ret;
> > > > > > +
> > > > > > +       ret = io_prep_rw(req, sqe, force_nonblock);
> > > > > > +       if (ret)
> > > > > > +               return ret;
> > > > > > +       file = kiocb->ki_filp;
> > > > > > +
> > > > > > +       ret = -EBADF;
> > > > > > +       if (unlikely(!(file->f_mode & FMODE_READ)))
> > > > > > +               goto out_fput;
> > > > > > +       ret = -EINVAL;
> > > > > > +       if (unlikely(!file->f_op->read_iter))
> > > > > > +               goto out_fput;
> > > > > > +
> > > > > > +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> > > > > > +       if (ret)
> > > > > > +               goto out_fput;
> > > > > > +
> > > > > > +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> > > > > > +       if (!ret) {
> > > > > > +               ssize_t ret2;
> > > > > > +
> > > > > > +               /* Catch -EAGAIN return for forced non-blocking submission */
> > > > > > +               ret2 = call_read_iter(file, kiocb, &iter);
> > > > > > +               if (!force_nonblock || ret2 != -EAGAIN)
> > > > > > +                       io_rw_done(kiocb, ret2);
> > > > > > +               else
> > > > > > +                       ret = -EAGAIN;
> > > > > > +       }
> > > > > > +       kfree(iovec);
> > > > > > +out_fput:
> > > > > > +       if (unlikely(ret))
> > > > > > +               fput(file);
> > > > > > +       return ret;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > > > > +                          struct sqe_submit *s, bool force_nonblock)
> > > > > > +{
> > > > > > +       const struct io_uring_sqe *sqe = s->sqe;
> > > > > > +       ssize_t ret;
> > > > > > +
> > > > > > +       if (unlikely(s->index >= ctx->sq_entries))
> > > > > > +               return -EINVAL;
> > > > > > +       req->user_data = sqe->user_data;
> > > > > > +
> > > > > > +       ret = -EINVAL;
> > > > > > +       switch (sqe->opcode) {
> > > > > > +       case IORING_OP_NOP:
> > > > > > +               ret = io_nop(req, sqe);
> > > > > > +               break;
> > > > > > +       case IORING_OP_READV:
> > > > > > +               ret = io_read(req, sqe, force_nonblock);
> > > > > > +               break;
> > > > > > +       case IORING_OP_WRITEV:
> > > > > > +               ret = io_write(req, sqe, force_nonblock);
> > > > > > +               break;
> > > > > > +       default:
> > > > > > +               ret = -EINVAL;
> > > > > > +               break;
> > > > > > +       }
> > > > > > +
> > > > > > +       return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void io_sq_wq_submit_work(struct work_struct *work)
> > > > > > +{
> > > > > > +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> > > > > > +       struct sqe_submit *s = &req->submit;
> > > > > > +       u64 user_data = s->sqe->user_data;
> > > > > > +       struct io_ring_ctx *ctx = req->ctx;
> > > > > > +       mm_segment_t old_fs = get_fs();
> > > > > > +       struct files_struct *old_files;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +        /* Ensure we clear previously set forced non-block flag */
> > > > > > +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
> > > > > > +
> > > > > > +       old_files = current->files;
> > > > > > +       current->files = ctx->sqo_files;
> > > > >
> > > > > I think you're not supposed to twiddle with current->files without
> > > > > holding task_lock(current).
> > > >
> > > > 'current' is the work queue item in this case, do we need to protect
> > > > against anything else? I can add the locking around the assignments
> > > > (both places).
> > >
> > > Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> > > reference to your current files_struct protected only by task_lock();
> > > and it doesn't use anything like READ_ONCE(), so even if the object
> > > lifetime is not a problem, get_files_struct() could potentially crash
> > > due to a double-read (reading task->files twice and assuming that the
> > > result will be the same). As far as I can tell, this procfs code also
> > > works on kernel threads.
> > >
> > > > > > +       if (!mmget_not_zero(ctx->sqo_mm)) {
> > > > > > +               ret = -EFAULT;
> > > > > > +               goto err;
> > > > > > +       }
> > > > > > +
> > > > > > +       use_mm(ctx->sqo_mm);
> > > > > > +       set_fs(USER_DS);
> > > > > > +
> > > > > > +       ret = __io_submit_sqe(ctx, req, s, false);
> > > > > > +
> > > > > > +       set_fs(old_fs);
> > > > > > +       unuse_mm(ctx->sqo_mm);
> > > > > > +       mmput(ctx->sqo_mm);
> > > > > > +err:
> > > > > > +       if (ret) {
> > > > > > +               io_cqring_add_event(ctx, user_data, ret, 0);
> > > > > > +               io_free_req(req);
> > > > > > +       }
> > > > > > +       current->files = old_files;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> > > > > > +{
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       ctx->sqo_mm = current->mm;
> > > > >
> > > > > What keeps this thing alive?
> > > >
> > > > I think we're deadling with the same thing as the files below, I'll
> > > > defer to that.
> > > >
> > > > > > +       /*
> > > > > > +        * This is safe since 'current' has the fd installed, and if that gets
> > > > > > +        * closed on exit, then fops->release() is invoked which waits for the
> > > > > > +        * async contexts to flush and exit before exiting.
> > > > > > +        */
> > > > > > +       ret = -EBADF;
> > > > > > +       ctx->sqo_files = current->files;
> > > > > > +       if (!ctx->sqo_files)
> > > > > > +               goto err;
> > > > >
> > > > > That's gnarly. Adding Al Viro to the thread.
> > > > >
> > > > > I think you misunderstand the semantics of f_op->release. The ->flush
> > > > > handler is invoked whenever a file descriptor is closed through
> > > > > filp_close() (via deletion of the files_struct, sys_close(),
> > > > > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > > > > work. But the ->release handler only runs when the _last_ reference to
> > > > > a struct file has been dropped - so you can, for example, fork() a
> > > > > child, then exit() in the parent, and the ->release handler isn't
> > > > > invoked. So I don't see how this can work.
> > > >
> > > > The anonfd is CLOEXEC. The idea is exactly that it only runs when the
> > > > last reference to the file has been dropped. Not sure why you think I
> > > > need ->flush() here?
> > >
> > > Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> > > Or send the fd via SCM_RIGHTS?
> > >
> > > > > But even if you had abused ->flush for this instead: close_files()
> > > > > currently has a comment in it that claims that "this is the last
> > > > > reference to the files structure"; this change would make that claim
> > > > > untrue.
> > > >
> > > > Let me see if I can explain my intent better than that comment... We
> > > > know the parent who set up the io_uring instance will be around for as
> > > > long as io_uring instance persists.
> > >
> > > That's the part that I think is wrong: As far as I can tell, the
> > > parent can go away and you won't notice.
> > >
> > > Also, note that "the parent" is different things for ->files and ->mm.
> > > You can have a multithreaded process whose threads don't have the same
> > > ->files, or multiple process that share ->files without sharing ->mm,
> > > ...
> >
> > This had actually been get_files_struct() in early versions, and I had
> > reported to Jens that it allows something like
> >
> > int main() {
> >   struct io_uring_params uring_params = {
> >         .flags = IORING_SETUP_SQPOLL,
> >   };
> >   int uring_fd = syscall(425 /* io_uring_setup */, 16, &uring_params);
> > }
> >
> > to leak both the files_struct and the kthread, as the files_struct and
> > the uring context form a circular reference.  I haven't really come up
> > with a good way to reconcile the requirements here; perhaps we need an
> > exit_uring() akin to exit_aio()?
>
> Oh, yuck. Uuuh... can we make "struct files_struct" doubly-refcounted,
> like "struct mm_struct"? One reference type to keep the contents
> intact (the reference type you normally use, and the type used by
> uring when the thread is running), and one reference type to just keep
> the struct itself existing, but without preserving its contents
> (reference held consistently by the uring thread)?

Something like this (completely untested); and then instead of the
current get_files_struct(), you'd do get_files_struct_weak(), and
while the thread is running, it protects the files_struct from dying
with tryget_weak_files_struct() / put_files_struct().

Al, do you have opinions on this?

===============
diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..fbf02ef2753d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -281,6 +281,7 @@ struct files_struct *dup_fd(struct files_struct
*oldf, int *errorp)
        if (!newf)
                goto out;

+       kref_init(&newf->weak_refs);
        atomic_set(&newf->count, 1);

        spin_lock_init(&newf->file_lock);
@@ -410,6 +411,26 @@ struct files_struct *get_files_struct(struct
task_struct *task)
        return files;
 }

+static void free_files_struct(struct kref *ref) {
+       struct files_struct *files =
+               container_of(ref, struct files_struct, weak_refs);
+       kmem_cache_free(files_cachep, files);
+}
+
+void put_files_struct_weak(struct files_struct *files) {
+       kref_put(&files->weak_refs, free_files_struct);
+}
+
+struct files_struct *get_files_struct_weak(struct task_struct *task)
+{
+       struct files_struct *files = get_files_struct(task);
+       if (files) {
+               kref_get(&files->weak_refs);
+               put_files_struct(files);
+       }
+       return files;
+}
+
 void put_files_struct(struct files_struct *files)
 {
        if (atomic_dec_and_test(&files->count)) {
@@ -418,10 +439,17 @@ void put_files_struct(struct files_struct *files)
                /* free the arrays if they are not embedded */
                if (fdt != &files->fdtab)
                        __free_fdtable(fdt);
-               kmem_cache_free(files_cachep, files);
+               put_files_struct_weak(files);
        }
 }

+struct files_struct *tryget_weak_files_struct(struct files_struct *fs) {
+       if (atomic_inc_not_zero(&fs->count)) {
+               return fs;
+       }
+       return NULL;
+}
+
 void reset_files_struct(struct files_struct *files)
 {
        struct task_struct *tsk = current;
@@ -448,6 +476,7 @@ void exit_files(struct task_struct *tsk)

 struct files_struct init_files = {
        .count          = ATOMIC_INIT(1),
+       .weak_refs      = KREF_INIT(1),
        .fdt            = &init_files.fdtab,
        .fdtab          = {
                .max_fds        = NR_OPEN_DEFAULT,
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..6ad95a95cc0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/kref.h>

 #include <linux/atomic.h>

@@ -50,6 +51,7 @@ struct files_struct {
    * read mostly part
    */
        atomic_t count;
+       struct kref weak_refs;
        bool resize_in_progress;
        wait_queue_head_t resize_wait;

@@ -107,6 +109,9 @@ struct task_struct;

 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
+void put_files_struct_weak(struct files_struct *files);
+struct files_struct *get_files_struct_weak(struct task_struct *);
+struct files_struct *tryget_weak_files_struct(struct files_struct *);
 void reset_files_struct(struct files_struct *);
 int unshare_files(struct files_struct **);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
===============

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-02-01 16:57         ` Matt Mullins
@ 2019-02-01 17:04           ` Jann Horn
  2019-02-01 17:23             ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-02-01 17:04 UTC (permalink / raw)
  To: Matt Mullins
  Cc: axboe, linux-aio, linux-block, jmoyer, linux-api, hch, viro,
	linux-man, avi

On Fri, Feb 1, 2019 at 5:57 PM Matt Mullins <mmullins@fb.com> wrote:
> On Tue, 2019-01-29 at 00:59 +0100, Jann Horn wrote:
> > On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > On 1/28/19 3:32 PM, Jann Horn wrote:
> > > > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > > > The submission queue (SQ) and completion queue (CQ) rings are shared
> > > > > between the application and the kernel. This eliminates the need to
> > > > > copy data back and forth to submit and complete IO.
> > > > >
> > > > > IO submissions use the io_uring_sqe data structure, and completions
> > > > > are generated in the form of io_uring_sqe data structures. The SQ
> > > > > ring is an index into the io_uring_sqe array, which makes it possible
> > > > > to submit a batch of IOs without them being contiguous in the ring.
> > > > > The CQ ring is always contiguous, as completion events are inherently
> > > > > unordered, and hence any io_uring_cqe entry can point back to an
> > > > > arbitrary submission.
> > > > >
> > > > > Two new system calls are added for this:
> > > > >
> > > > > io_uring_setup(entries, params)
> > > > >         Sets up a context for doing async IO. On success, returns a file
> > > > >         descriptor that the application can mmap to gain access to the
> > > > >         SQ ring, CQ ring, and io_uring_sqes.
> > > > >
> > > > > io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
> > > > >         Initiates IO against the rings mapped to this fd, or waits for
> > > > >         them to complete, or both. The behavior is controlled by the
> > > > >         parameters passed in. If 'to_submit' is non-zero, then we'll
> > > > >         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> > > > >         kernel will wait for 'min_complete' events, if they aren't
> > > > >         already available. It's valid to set IORING_ENTER_GETEVENTS
> > > > >         and 'min_complete' == 0 at the same time, this allows the
> > > > >         kernel to return already completed events without waiting
> > > > >         for them. This is useful only for polling, as for IRQ
> > > > >         driven IO, the application can just check the CQ ring
> > > > >         without entering the kernel.
> > > > >
> > > > > With this setup, it's possible to do async IO with a single system
> > > > > call. Future developments will enable polled IO with this interface,
> > > > > and polled submission as well. The latter will enable an application
> > > > > to do IO without doing ANY system calls at all.
> > > > >
> > > > > For IRQ driven IO, an application only needs to enter the kernel for
> > > > > completions if it wants to wait for them to occur.
> > > > >
> > > > > Each io_uring is backed by a workqueue, to support buffered async IO
> > > > > as well. We will only punt to an async context if the command would
> > > > > need to wait for IO on the device side. Any data that can be accessed
> > > > > directly in the page cache is done inline. This avoids the slowness
> > > > > issue of usual threadpools, since cached data is accessed as quickly
> > > > > as a sync interface.
> > > > >
> > > > > Sample application: https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_fio_plain_t_io-5Furing.c&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pqM-eO4A2hNFhIFiX-7eGg&m=MGr14pOzNbC7Z-8_dV4GMiH3AbkkH0RSQoQ894Tu0yc&s=mgbcubzOMiCpFpnwW-HA3ey0YDYPkgMIZ7Bmy4w6Chc&e=
> > > >
> > > > [...]
> > > > > +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > > +                     bool force_nonblock)
> > > > > +{
> > > > > +       struct kiocb *kiocb = &req->rw;
> > > > > +       int ret;
> > > > > +
> > > > > +       kiocb->ki_filp = fget(sqe->fd);
> > > > > +       if (unlikely(!kiocb->ki_filp))
> > > > > +               return -EBADF;
> > > > > +       kiocb->ki_pos = sqe->off;
> > > > > +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> > > > > +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> > > > > +       if (sqe->ioprio) {
> > > > > +               ret = ioprio_check_cap(sqe->ioprio);
> > > > > +               if (ret)
> > > > > +                       goto out_fput;
> > > > > +
> > > > > +               kiocb->ki_ioprio = sqe->ioprio;
> > > > > +       } else
> > > > > +               kiocb->ki_ioprio = get_current_ioprio();
> > > > > +
> > > > > +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> > > > > +       if (unlikely(ret))
> > > > > +               goto out_fput;
> > > > > +       if (force_nonblock) {
> > > > > +               kiocb->ki_flags |= IOCB_NOWAIT;
> > > > > +               req->flags |= REQ_F_FORCE_NONBLOCK;
> > > > > +       }
> > > > > +       if (kiocb->ki_flags & IOCB_HIPRI) {
> > > > > +               ret = -EINVAL;
> > > > > +               goto out_fput;
> > > > > +       }
> > > > > +
> > > > > +       kiocb->ki_complete = io_complete_rw;
> > > > > +       return 0;
> > > > > +out_fput:
> > > > > +       fput(kiocb->ki_filp);
> > > > > +       return ret;
> > > > > +}
> > > >
> > > > [...]
> > > > > +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > > +                      bool force_nonblock)
> > > > > +{
> > > > > +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> > > > > +       struct kiocb *kiocb = &req->rw;
> > > > > +       struct iov_iter iter;
> > > > > +       struct file *file;
> > > > > +       ssize_t ret;
> > > > > +
> > > > > +       ret = io_prep_rw(req, sqe, force_nonblock);
> > > > > +       if (ret)
> > > > > +               return ret;
> > > > > +       file = kiocb->ki_filp;
> > > > > +
> > > > > +       ret = -EBADF;
> > > > > +       if (unlikely(!(file->f_mode & FMODE_READ)))
> > > > > +               goto out_fput;
> > > > > +       ret = -EINVAL;
> > > > > +       if (unlikely(!file->f_op->read_iter))
> > > > > +               goto out_fput;
> > > > > +
> > > > > +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> > > > > +       if (ret)
> > > > > +               goto out_fput;
> > > > > +
> > > > > +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> > > > > +       if (!ret) {
> > > > > +               ssize_t ret2;
> > > > > +
> > > > > +               /* Catch -EAGAIN return for forced non-blocking submission */
> > > > > +               ret2 = call_read_iter(file, kiocb, &iter);
> > > > > +               if (!force_nonblock || ret2 != -EAGAIN)
> > > > > +                       io_rw_done(kiocb, ret2);
> > > > > +               else
> > > > > +                       ret = -EAGAIN;
> > > > > +       }
> > > > > +       kfree(iovec);
> > > > > +out_fput:
> > > > > +       if (unlikely(ret))
> > > > > +               fput(file);
> > > > > +       return ret;
> > > > > +}
> > > >
> > > > [...]
> > > > > +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > > > +                          struct sqe_submit *s, bool force_nonblock)
> > > > > +{
> > > > > +       const struct io_uring_sqe *sqe = s->sqe;
> > > > > +       ssize_t ret;
> > > > > +
> > > > > +       if (unlikely(s->index >= ctx->sq_entries))
> > > > > +               return -EINVAL;
> > > > > +       req->user_data = sqe->user_data;
> > > > > +
> > > > > +       ret = -EINVAL;
> > > > > +       switch (sqe->opcode) {
> > > > > +       case IORING_OP_NOP:
> > > > > +               ret = io_nop(req, sqe);
> > > > > +               break;
> > > > > +       case IORING_OP_READV:
> > > > > +               ret = io_read(req, sqe, force_nonblock);
> > > > > +               break;
> > > > > +       case IORING_OP_WRITEV:
> > > > > +               ret = io_write(req, sqe, force_nonblock);
> > > > > +               break;
> > > > > +       default:
> > > > > +               ret = -EINVAL;
> > > > > +               break;
> > > > > +       }
> > > > > +
> > > > > +       return ret;
> > > > > +}
> > > > > +
> > > > > +static void io_sq_wq_submit_work(struct work_struct *work)
> > > > > +{
> > > > > +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> > > > > +       struct sqe_submit *s = &req->submit;
> > > > > +       u64 user_data = s->sqe->user_data;
> > > > > +       struct io_ring_ctx *ctx = req->ctx;
> > > > > +       mm_segment_t old_fs = get_fs();
> > > > > +       struct files_struct *old_files;
> > > > > +       int ret;
> > > > > +
> > > > > +        /* Ensure we clear previously set forced non-block flag */
> > > > > +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
> > > > > +
> > > > > +       old_files = current->files;
> > > > > +       current->files = ctx->sqo_files;
> > > >
> > > > I think you're not supposed to twiddle with current->files without
> > > > holding task_lock(current).
> > >
> > > 'current' is the work queue item in this case, do we need to protect
> > > against anything else? I can add the locking around the assignments
> > > (both places).
> >
> > Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> > reference to your current files_struct protected only by task_lock();
> > and it doesn't use anything like READ_ONCE(), so even if the object
> > lifetime is not a problem, get_files_struct() could potentially crash
> > due to a double-read (reading task->files twice and assuming that the
> > result will be the same). As far as I can tell, this procfs code also
> > works on kernel threads.
> >
> > > > > +       if (!mmget_not_zero(ctx->sqo_mm)) {
> > > > > +               ret = -EFAULT;
> > > > > +               goto err;
> > > > > +       }
> > > > > +
> > > > > +       use_mm(ctx->sqo_mm);
> > > > > +       set_fs(USER_DS);
> > > > > +
> > > > > +       ret = __io_submit_sqe(ctx, req, s, false);
> > > > > +
> > > > > +       set_fs(old_fs);
> > > > > +       unuse_mm(ctx->sqo_mm);
> > > > > +       mmput(ctx->sqo_mm);
> > > > > +err:
> > > > > +       if (ret) {
> > > > > +               io_cqring_add_event(ctx, user_data, ret, 0);
> > > > > +               io_free_req(req);
> > > > > +       }
> > > > > +       current->files = old_files;
> > > > > +}
> > > >
> > > > [...]
> > > > > +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> > > > > +{
> > > > > +       int ret;
> > > > > +
> > > > > +       ctx->sqo_mm = current->mm;
> > > >
> > > > What keeps this thing alive?
> > >
> > > I think we're deadling with the same thing as the files below, I'll
> > > defer to that.
> > >
> > > > > +       /*
> > > > > +        * This is safe since 'current' has the fd installed, and if that gets
> > > > > +        * closed on exit, then fops->release() is invoked which waits for the
> > > > > +        * async contexts to flush and exit before exiting.
> > > > > +        */
> > > > > +       ret = -EBADF;
> > > > > +       ctx->sqo_files = current->files;
> > > > > +       if (!ctx->sqo_files)
> > > > > +               goto err;
> > > >
> > > > That's gnarly. Adding Al Viro to the thread.
> > > >
> > > > I think you misunderstand the semantics of f_op->release. The ->flush
> > > > handler is invoked whenever a file descriptor is closed through
> > > > filp_close() (via deletion of the files_struct, sys_close(),
> > > > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > > > work. But the ->release handler only runs when the _last_ reference to
> > > > a struct file has been dropped - so you can, for example, fork() a
> > > > child, then exit() in the parent, and the ->release handler isn't
> > > > invoked. So I don't see how this can work.
> > >
> > > The anonfd is CLOEXEC. The idea is exactly that it only runs when the
> > > last reference to the file has been dropped. Not sure why you think I
> > > need ->flush() here?
> >
> > Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> > Or send the fd via SCM_RIGHTS?
> >
> > > > But even if you had abused ->flush for this instead: close_files()
> > > > currently has a comment in it that claims that "this is the last
> > > > reference to the files structure"; this change would make that claim
> > > > untrue.
> > >
> > > Let me see if I can explain my intent better than that comment... We
> > > know the parent who set up the io_uring instance will be around for as
> > > long as io_uring instance persists.
> >
> > That's the part that I think is wrong: As far as I can tell, the
> > parent can go away and you won't notice.
> >
> > Also, note that "the parent" is different things for ->files and ->mm.
> > You can have a multithreaded process whose threads don't have the same
> > ->files, or multiple process that share ->files without sharing ->mm,
> > ...
>
> This had actually been get_files_struct() in early versions, and I had
> reported to Jens that it allows something like
>
> int main() {
>   struct io_uring_params uring_params = {
>         .flags = IORING_SETUP_SQPOLL,
>   };
>   int uring_fd = syscall(425 /* io_uring_setup */, 16, &uring_params);
> }
>
> to leak both the files_struct and the kthread, as the files_struct and
> the uring context form a circular reference.  I haven't really come up
> with a good way to reconcile the requirements here; perhaps we need an
> exit_uring() akin to exit_aio()?

Oh, yuck. Uuuh... can we make "struct files_struct" doubly-refcounted,
like "struct mm_struct"? One reference type to keep the contents
intact (the reference type you normally use, and the type used by
uring when the thread is running), and one reference type to just keep
the struct itself existing, but without preserving its contents
(reference held consistently by the uring thread)?

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 23:59       ` Jann Horn
  2019-01-29  0:03         ` Jens Axboe
@ 2019-02-01 16:57         ` Matt Mullins
  2019-02-01 17:04           ` Jann Horn
  1 sibling, 1 reply; 49+ messages in thread
From: Matt Mullins @ 2019-02-01 16:57 UTC (permalink / raw)
  To: jannh, axboe
  Cc: linux-aio, linux-block, jmoyer, linux-api, hch, viro, linux-man, avi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 13683 bytes --]

On Tue, 2019-01-29 at 00:59 +0100, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> > On 1/28/19 3:32 PM, Jann Horn wrote:
> > > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > > The submission queue (SQ) and completion queue (CQ) rings are shared
> > > > between the application and the kernel. This eliminates the need to
> > > > copy data back and forth to submit and complete IO.
> > > > 
> > > > IO submissions use the io_uring_sqe data structure, and completions
> > > > are generated in the form of io_uring_sqe data structures. The SQ
> > > > ring is an index into the io_uring_sqe array, which makes it possible
> > > > to submit a batch of IOs without them being contiguous in the ring.
> > > > The CQ ring is always contiguous, as completion events are inherently
> > > > unordered, and hence any io_uring_cqe entry can point back to an
> > > > arbitrary submission.
> > > > 
> > > > Two new system calls are added for this:
> > > > 
> > > > io_uring_setup(entries, params)
> > > >         Sets up a context for doing async IO. On success, returns a file
> > > >         descriptor that the application can mmap to gain access to the
> > > >         SQ ring, CQ ring, and io_uring_sqes.
> > > > 
> > > > io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
> > > >         Initiates IO against the rings mapped to this fd, or waits for
> > > >         them to complete, or both. The behavior is controlled by the
> > > >         parameters passed in. If 'to_submit' is non-zero, then we'll
> > > >         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> > > >         kernel will wait for 'min_complete' events, if they aren't
> > > >         already available. It's valid to set IORING_ENTER_GETEVENTS
> > > >         and 'min_complete' == 0 at the same time, this allows the
> > > >         kernel to return already completed events without waiting
> > > >         for them. This is useful only for polling, as for IRQ
> > > >         driven IO, the application can just check the CQ ring
> > > >         without entering the kernel.
> > > > 
> > > > With this setup, it's possible to do async IO with a single system
> > > > call. Future developments will enable polled IO with this interface,
> > > > and polled submission as well. The latter will enable an application
> > > > to do IO without doing ANY system calls at all.
> > > > 
> > > > For IRQ driven IO, an application only needs to enter the kernel for
> > > > completions if it wants to wait for them to occur.
> > > > 
> > > > Each io_uring is backed by a workqueue, to support buffered async IO
> > > > as well. We will only punt to an async context if the command would
> > > > need to wait for IO on the device side. Any data that can be accessed
> > > > directly in the page cache is done inline. This avoids the slowness
> > > > issue of usual threadpools, since cached data is accessed as quickly
> > > > as a sync interface.
> > > > 
> > > > Sample application: https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_fio_plain_t_io-5Furing.c&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pqM-eO4A2hNFhIFiX-7eGg&m=MGr14pOzNbC7Z-8_dV4GMiH3AbkkH0RSQoQ894Tu0yc&s=mgbcubzOMiCpFpnwW-HA3ey0YDYPkgMIZ7Bmy4w6Chc&e=
> > > 
> > > [...]
> > > > +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > +                     bool force_nonblock)
> > > > +{
> > > > +       struct kiocb *kiocb = &req->rw;
> > > > +       int ret;
> > > > +
> > > > +       kiocb->ki_filp = fget(sqe->fd);
> > > > +       if (unlikely(!kiocb->ki_filp))
> > > > +               return -EBADF;
> > > > +       kiocb->ki_pos = sqe->off;
> > > > +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> > > > +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> > > > +       if (sqe->ioprio) {
> > > > +               ret = ioprio_check_cap(sqe->ioprio);
> > > > +               if (ret)
> > > > +                       goto out_fput;
> > > > +
> > > > +               kiocb->ki_ioprio = sqe->ioprio;
> > > > +       } else
> > > > +               kiocb->ki_ioprio = get_current_ioprio();
> > > > +
> > > > +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> > > > +       if (unlikely(ret))
> > > > +               goto out_fput;
> > > > +       if (force_nonblock) {
> > > > +               kiocb->ki_flags |= IOCB_NOWAIT;
> > > > +               req->flags |= REQ_F_FORCE_NONBLOCK;
> > > > +       }
> > > > +       if (kiocb->ki_flags & IOCB_HIPRI) {
> > > > +               ret = -EINVAL;
> > > > +               goto out_fput;
> > > > +       }
> > > > +
> > > > +       kiocb->ki_complete = io_complete_rw;
> > > > +       return 0;
> > > > +out_fput:
> > > > +       fput(kiocb->ki_filp);
> > > > +       return ret;
> > > > +}
> > > 
> > > [...]
> > > > +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > +                      bool force_nonblock)
> > > > +{
> > > > +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> > > > +       struct kiocb *kiocb = &req->rw;
> > > > +       struct iov_iter iter;
> > > > +       struct file *file;
> > > > +       ssize_t ret;
> > > > +
> > > > +       ret = io_prep_rw(req, sqe, force_nonblock);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       file = kiocb->ki_filp;
> > > > +
> > > > +       ret = -EBADF;
> > > > +       if (unlikely(!(file->f_mode & FMODE_READ)))
> > > > +               goto out_fput;
> > > > +       ret = -EINVAL;
> > > > +       if (unlikely(!file->f_op->read_iter))
> > > > +               goto out_fput;
> > > > +
> > > > +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> > > > +       if (ret)
> > > > +               goto out_fput;
> > > > +
> > > > +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> > > > +       if (!ret) {
> > > > +               ssize_t ret2;
> > > > +
> > > > +               /* Catch -EAGAIN return for forced non-blocking submission */
> > > > +               ret2 = call_read_iter(file, kiocb, &iter);
> > > > +               if (!force_nonblock || ret2 != -EAGAIN)
> > > > +                       io_rw_done(kiocb, ret2);
> > > > +               else
> > > > +                       ret = -EAGAIN;
> > > > +       }
> > > > +       kfree(iovec);
> > > > +out_fput:
> > > > +       if (unlikely(ret))
> > > > +               fput(file);
> > > > +       return ret;
> > > > +}
> > > 
> > > [...]
> > > > +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > > +                          struct sqe_submit *s, bool force_nonblock)
> > > > +{
> > > > +       const struct io_uring_sqe *sqe = s->sqe;
> > > > +       ssize_t ret;
> > > > +
> > > > +       if (unlikely(s->index >= ctx->sq_entries))
> > > > +               return -EINVAL;
> > > > +       req->user_data = sqe->user_data;
> > > > +
> > > > +       ret = -EINVAL;
> > > > +       switch (sqe->opcode) {
> > > > +       case IORING_OP_NOP:
> > > > +               ret = io_nop(req, sqe);
> > > > +               break;
> > > > +       case IORING_OP_READV:
> > > > +               ret = io_read(req, sqe, force_nonblock);
> > > > +               break;
> > > > +       case IORING_OP_WRITEV:
> > > > +               ret = io_write(req, sqe, force_nonblock);
> > > > +               break;
> > > > +       default:
> > > > +               ret = -EINVAL;
> > > > +               break;
> > > > +       }
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static void io_sq_wq_submit_work(struct work_struct *work)
> > > > +{
> > > > +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> > > > +       struct sqe_submit *s = &req->submit;
> > > > +       u64 user_data = s->sqe->user_data;
> > > > +       struct io_ring_ctx *ctx = req->ctx;
> > > > +       mm_segment_t old_fs = get_fs();
> > > > +       struct files_struct *old_files;
> > > > +       int ret;
> > > > +
> > > > +        /* Ensure we clear previously set forced non-block flag */
> > > > +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
> > > > +
> > > > +       old_files = current->files;
> > > > +       current->files = ctx->sqo_files;
> > > 
> > > I think you're not supposed to twiddle with current->files without
> > > holding task_lock(current).
> > 
> > 'current' is the work queue item in this case, do we need to protect
> > against anything else? I can add the locking around the assignments
> > (both places).
> 
> Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> reference to your current files_struct protected only by task_lock();
> and it doesn't use anything like READ_ONCE(), so even if the object
> lifetime is not a problem, get_files_struct() could potentially crash
> due to a double-read (reading task->files twice and assuming that the
> result will be the same). As far as I can tell, this procfs code also
> works on kernel threads.
> 
> > > > +       if (!mmget_not_zero(ctx->sqo_mm)) {
> > > > +               ret = -EFAULT;
> > > > +               goto err;
> > > > +       }
> > > > +
> > > > +       use_mm(ctx->sqo_mm);
> > > > +       set_fs(USER_DS);
> > > > +
> > > > +       ret = __io_submit_sqe(ctx, req, s, false);
> > > > +
> > > > +       set_fs(old_fs);
> > > > +       unuse_mm(ctx->sqo_mm);
> > > > +       mmput(ctx->sqo_mm);
> > > > +err:
> > > > +       if (ret) {
> > > > +               io_cqring_add_event(ctx, user_data, ret, 0);
> > > > +               io_free_req(req);
> > > > +       }
> > > > +       current->files = old_files;
> > > > +}
> > > 
> > > [...]
> > > > +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       ctx->sqo_mm = current->mm;
> > > 
> > > What keeps this thing alive?
> > 
> > I think we're deadling with the same thing as the files below, I'll
> > defer to that.
> > 
> > > > +       /*
> > > > +        * This is safe since 'current' has the fd installed, and if that gets
> > > > +        * closed on exit, then fops->release() is invoked which waits for the
> > > > +        * async contexts to flush and exit before exiting.
> > > > +        */
> > > > +       ret = -EBADF;
> > > > +       ctx->sqo_files = current->files;
> > > > +       if (!ctx->sqo_files)
> > > > +               goto err;
> > > 
> > > That's gnarly. Adding Al Viro to the thread.
> > > 
> > > I think you misunderstand the semantics of f_op->release. The ->flush
> > > handler is invoked whenever a file descriptor is closed through
> > > filp_close() (via deletion of the files_struct, sys_close(),
> > > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > > work. But the ->release handler only runs when the _last_ reference to
> > > a struct file has been dropped - so you can, for example, fork() a
> > > child, then exit() in the parent, and the ->release handler isn't
> > > invoked. So I don't see how this can work.
> > 
> > The anonfd is CLOEXEC. The idea is exactly that it only runs when the
> > last reference to the file has been dropped. Not sure why you think I
> > need ->flush() here?
> 
> Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> Or send the fd via SCM_RIGHTS?
> 
> > > But even if you had abused ->flush for this instead: close_files()
> > > currently has a comment in it that claims that "this is the last
> > > reference to the files structure"; this change would make that claim
> > > untrue.
> > 
> > Let me see if I can explain my intent better than that comment... We
> > know the parent who set up the io_uring instance will be around for as
> > long as io_uring instance persists.
> 
> That's the part that I think is wrong: As far as I can tell, the
> parent can go away and you won't notice.
> 
> Also, note that "the parent" is different things for ->files and ->mm.
> You can have a multithreaded process whose threads don't have the same
> ->files, or multiple process that share ->files without sharing ->mm,
> ...

This had actually been get_files_struct() in early versions, and I had
reported to Jens that it allows something like

int main() {
  struct io_uring_params uring_params = {
  	.flags = IORING_SETUP_SQPOLL,
  };
  int uring_fd = syscall(425 /* io_uring_setup */, 16, &uring_params);
}

to leak both the files_struct and the kthread, as the files_struct and
the uring context form a circular reference.  I haven't really come up
with a good way to reconcile the requirements here; perhaps we need an
exit_uring() akin to exit_aio()?

> > When we are tearing down the
> > io_uring, then we wait for any async contexts (like the one above) to
> > exit. Once they are exited, it's safe to proceed with the exit and
> > teardown ->files[].
> 
> But you only do that teardown on ->release, right? And ->release
> doesn't have much to do with the process lifetime.
> 
> > That's the idea... Not trying to be clever, some of this dates back to
> > the aio weirdness where it was impossible to have cross references like
> > this, since it would lead to teardown deadlocks with how exit_aio() is
> > used. I can probably grab a struct files reference above, but currently
> > I don't see why it's needed.
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îŨ¨Š{ayº\x1dÊÚ&j:+v‰¨’öœ’Šà\x16Šæ¢·¢ú(œ¸§»\x10\b:Çž†Ûiÿü0ÂKÚrJ+ƒö¢£ðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 15:56         ` Jann Horn
@ 2019-01-29 16:06           ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29 16:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 8:56 AM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 4:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 7:21 PM, Jann Horn wrote:
>>> Please create a local copy of the request before parsing it to keep
>>> the data from changing under you. Additionally, it might make sense to
>>> annotate every pointer to shared memory with a comment, or something
>>> like that, to ensure that anyone looking at the code can immediately
>>> see for which pointers special caution is required on access.
>>
>> I took a look at the viability of NOT having to local copy the data, and
>> I don't think it's too bad. Local copy has a noticeable impact on the
>> performance, hence I'd really (REALLY) like to avoid it.
>>
>> Here's something on top of the current git branch. I think I even went a
>> bit too far in some areas, but it should hopefully catch the cases where
>> we might end up double evaluating the parts of the sqe that we depend
>> on. For most of the sqe reading we don't really care too much. For
>> instance, the sqe->user_data. If the app changes this field, then it
>> just gets whatever passed back in cqe->user_data. That's not a kernel
>> issue.
>>
>> For cases like addr/len etc validation, it should be sound. I'll double
>> check this in the morning as well, and obviously would need to be folded
>> in along the way.
>>
>> I'd appreciate your opinion on this part, if you see any major issues
>> with it, or if I missed something.
> 
> The io_sqe_needs_user() checks still look racy. If that helper sees a
> IORING_OP_READ_FIXED, but then __io_submit_sqe() sees a
> IORING_OP_READV - especially if this happens in io_sq_wq_submit_work()
> -, I think you could potentially end up in places like
> io_import_iovec() without having done the set_fs(USER_DS) and
> use_mm(), causing the access to potentially occur with KERNEL_DS and a
> lazy mm.

Indeed, for that case I think we should just copy the sqe. It's in the
async offload context anyway, so a copy won't really change anything
in terms of performance. And since the gap is so large between the two
problematic spots, it'd be trickier to fix.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  3:46       ` Jens Axboe
@ 2019-01-29 15:56         ` Jann Horn
  2019-01-29 16:06           ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-29 15:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 4:46 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 7:21 PM, Jann Horn wrote:
> > Please create a local copy of the request before parsing it to keep
> > the data from changing under you. Additionally, it might make sense to
> > annotate every pointer to shared memory with a comment, or something
> > like that, to ensure that anyone looking at the code can immediately
> > see for which pointers special caution is required on access.
>
> I took a look at the viability of NOT having to local copy the data, and
> I don't think it's too bad. Local copy has a noticeable impact on the
> performance, hence I'd really (REALLY) like to avoid it.
>
> Here's something on top of the current git branch. I think I even went a
> bit too far in some areas, but it should hopefully catch the cases where
> we might end up double evaluating the parts of the sqe that we depend
> on. For most of the sqe reading we don't really care too much. For
> instance, the sqe->user_data. If the app changes this field, then it
> just gets whatever passed back in cqe->user_data. That's not a kernel
> issue.
>
> For cases like addr/len etc validation, it should be sound. I'll double
> check this in the morning as well, and obviously would need to be folded
> in along the way.
>
> I'd appreciate your opinion on this part, if you see any major issues
> with it, or if I missed something.

The io_sqe_needs_user() checks still look racy. If that helper sees a
IORING_OP_READ_FIXED, but then __io_submit_sqe() sees a
IORING_OP_READV - especially if this happens in io_sq_wq_submit_work()
-, I think you could potentially end up in places like
io_import_iovec() without having done the set_fs(USER_DS) and
use_mm(), causing the access to potentially occur with KERNEL_DS and a
lazy mm.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29 12:12   ` Florian Weimer
@ 2019-01-29 13:35     ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29 13:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-aio, linux-block, linux-man, linux-api, hch, jmoyer, avi

On 1/29/19 5:12 AM, Florian Weimer wrote:
> * Jens Axboe:
> 
>> +#define IORING_MAX_ENTRIES	4096
> 
> Where does this constant come from?  Should it really be exposed to
> userspace?

Seems pretty handy for the application to know what the limit is?

>> +struct io_uring_params {
>> +	__u32 sq_entries;
>> +	__u32 cq_entries;
>> +	__u32 flags;
>> +	__u16 resv[10];
>> +	struct io_sqring_offsets sq_off;
>> +	struct io_cqring_offsets cq_off;
>> +};
> 
>> +struct io_sqring_offsets {
>> +	__u32 head;
>> +	__u32 tail;
>> +	__u32 ring_mask;
>> +	__u32 ring_entries;
>> +	__u32 flags;
>> +	__u32 dropped;
>> +	__u32 array;
>> +	__u32 resv[3];
>> +};
>> +
>> +struct io_cqring_offsets {
>> +	__u32 head;
>> +	__u32 tail;
>> +	__u32 ring_mask;
>> +	__u32 ring_entries;
>> +	__u32 overflow;
>> +	__u32 cqes;
>> +	__u32 resv[4];
>> +};
> 
> Should the reserved fields include a __u64 member, to increase struct
> alignment on architectures that might need it in the future?

Sure, I can do that.

>> +#define IORING_ENTER_GETEVENTS	(1 << 0)
> 
> Should this be unsigned, to match the u32 flags argument?  (Otherwise
> using 32 flags can be difficult).

Good point, I've changed them all to be unsigned.


-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
                     ` (4 preceding siblings ...)
  2019-01-29  7:12   ` Bert Wesarg
@ 2019-01-29 12:12   ` Florian Weimer
  2019-01-29 13:35     ` Jens Axboe
  5 siblings, 1 reply; 49+ messages in thread
From: Florian Weimer @ 2019-01-29 12:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, linux-man, linux-api, hch, jmoyer, avi

* Jens Axboe:

> +#define IORING_MAX_ENTRIES	4096

Where does this constant come from?  Should it really be exposed to
userspace?

> +struct io_uring_params {
> +	__u32 sq_entries;
> +	__u32 cq_entries;
> +	__u32 flags;
> +	__u16 resv[10];
> +	struct io_sqring_offsets sq_off;
> +	struct io_cqring_offsets cq_off;
> +};

> +struct io_sqring_offsets {
> +	__u32 head;
> +	__u32 tail;
> +	__u32 ring_mask;
> +	__u32 ring_entries;
> +	__u32 flags;
> +	__u32 dropped;
> +	__u32 array;
> +	__u32 resv[3];
> +};
> +
> +struct io_cqring_offsets {
> +	__u32 head;
> +	__u32 tail;
> +	__u32 ring_mask;
> +	__u32 ring_entries;
> +	__u32 overflow;
> +	__u32 cqes;
> +	__u32 resv[4];
> +};

Should the reserved fields include a __u64 member, to increase struct
alignment on architectures that might need it in the future?

> +#define IORING_ENTER_GETEVENTS	(1 << 0)

Should this be unsigned, to match the u32 flags argument?  (Otherwise
using 32 flags can be difficult).

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
                     ` (3 preceding siblings ...)
  2019-01-29  1:29   ` Jann Horn
@ 2019-01-29  7:12   ` Bert Wesarg
  2019-01-29 12:12   ` Florian Weimer
  5 siblings, 0 replies; 49+ messages in thread
From: Bert Wesarg @ 2019-01-29  7:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, linux-man, linux-api, hch, jmoyer, avi

On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> The submission queue (SQ) and completion queue (CQ) rings are shared
> between the application and the kernel. This eliminates the need to
> copy data back and forth to submit and complete IO.
>
> IO submissions use the io_uring_sqe data structure, and completions
> are generated in the form of io_uring_sqe data structures. The SQ
> ring is an index into the io_uring_sqe array, which makes it possible
> to submit a batch of IOs without them being contiguous in the ring.
> The CQ ring is always contiguous, as completion events are inherently
> unordered, and hence any io_uring_cqe entry can point back to an
> arbitrary submission.
>
> Two new system calls are added for this:
>
> io_uring_setup(entries, params)
>         Sets up a context for doing async IO. On success, returns a file
>         descriptor that the application can mmap to gain access to the
>         SQ ring, CQ ring, and io_uring_sqes.
>
> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>         Initiates IO against the rings mapped to this fd, or waits for
>         them to complete, or both. The behavior is controlled by the
>         parameters passed in. If 'to_submit' is non-zero, then we'll
>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>         kernel will wait for 'min_complete' events, if they aren't
>         already available. It's valid to set IORING_ENTER_GETEVENTS
>         and 'min_complete' == 0 at the same time, this allows the
>         kernel to return already completed events without waiting
>         for them. This is useful only for polling, as for IRQ
>         driven IO, the application can just check the CQ ring
>         without entering the kernel.
>
> With this setup, it's possible to do async IO with a single system
> call. Future developments will enable polled IO with this interface,
> and polled submission as well. The latter will enable an application
> to do IO without doing ANY system calls at all.
>
> For IRQ driven IO, an application only needs to enter the kernel for
> completions if it wants to wait for them to occur.
>
> Each io_uring is backed by a workqueue, to support buffered async IO
> as well. We will only punt to an async context if the command would
> need to wait for IO on the device side. Any data that can be accessed
> directly in the page cache is done inline. This avoids the slowness
> issue of usual threadpools, since cached data is accessed as quickly
> as a sync interface.
>
> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |    2 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    2 +
>  fs/Makefile                            |    1 +
>  fs/io_uring.c                          | 1090 ++++++++++++++++++++++++
>  include/linux/syscalls.h               |    6 +
>  include/uapi/asm-generic/unistd.h      |    6 +-
>  include/uapi/linux/io_uring.h          |   96 +++
>  init/Kconfig                           |    9 +
>  kernel/sys_ni.c                        |    2 +
>  9 files changed, 1213 insertions(+), 1 deletion(-)
>  create mode 100644 fs/io_uring.c
>  create mode 100644 include/uapi/linux/io_uring.h
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> new file mode 100644
> index 000000000000..ce65db9269a8
> --- /dev/null
> +++ b/include/uapi/linux/io_uring.h
> @@ -0,0 +1,96 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Header file for the io_uring interface.
> + *
> + * Copyright (C) 2019 Jens Axboe
> + * Copyright (C) 2019 Christoph Hellwig
> + */
> +#ifndef LINUX_IO_URING_H
> +#define LINUX_IO_URING_H
> +
> +#include <linux/fs.h>
> +#include <linux/types.h>
> +
> +#define IORING_MAX_ENTRIES     4096
> +
> +/*
> + * IO submission data structure (Submission Queue Entry)
> + */
> +struct io_uring_sqe {
> +       __u8    opcode;         /* type of operation for this sqe */
> +       __u8    flags;          /* as of now unused */
> +       __u16   ioprio;         /* ioprio for the request */
> +       __s32   fd;             /* file descriptor to do IO on */
> +       __u64   off;            /* offset into file */
> +       __u64   addr;           /* pointer to buffer or iovecs */
> +       __u32   len;            /* buffer size or number of iovecs */
> +       union {
> +               __kernel_rwf_t  rw_flags;
> +               __u32           __resv;
> +       };
> +       __u64   user_data;      /* data to be passed back at completion time */
> +       __u64   __pad2[3];
> +};
> +
> +#define IORING_OP_NOP          0
> +#define IORING_OP_READV                1
> +#define IORING_OP_WRITEV       2
> +
> +/*
> + * IO completion data structure (Completion Queue Entry)
> + */
> +struct io_uring_cqe {
> +       __u64   user_data;      /* sqe->data submission passed back */
> +       __s32   res;            /* result code for this event */
> +       __u32   flags;
> +};
> +
> +/*
> + * Magic offsets for the application to mmap the data it needs
> + */
> +#define IORING_OFF_SQ_RING             0ULL
> +#define IORING_OFF_CQ_RING             0x8000000ULL
> +#define IORING_OFF_SQES                        0x10000000ULL
> +
> +/*
> + * Filled with the offset for mmap(2)
> + */
> +struct io_sqring_offsets {
> +       __u32 head;
> +       __u32 tail;
> +       __u32 ring_mask;
> +       __u32 ring_entries;
> +       __u32 flags;
> +       __u32 dropped;
> +       __u32 array;
> +       __u32 resv[3];
> +};
> +
> +struct io_cqring_offsets {
> +       __u32 head;
> +       __u32 tail;
> +       __u32 ring_mask;
> +       __u32 ring_entries;
> +       __u32 overflow;
> +       __u32 cqes;
> +       __u32 resv[4];
> +};
> +
> +/*
> + * io_uring_enter(2) flags
> + */
> +#define IORING_ENTER_GETEVENTS (1 << 0)
> +
> +/*
> + * Passed in for io_uring_setup(2). Copied back with updated info on success
> + */
> +struct io_uring_params {
> +       __u32 sq_entries;
> +       __u32 cq_entries;
> +       __u32 flags;
> +       __u16 resv[10];
> +       struct io_sqring_offsets sq_off;
> +       struct io_cqring_offsets cq_off;
> +};
> +
> +#endif

from a user perspective, it should always be easier if all exported
symbols and macros have a common prefix. Here it seems particular
worrisome, because of the missing 'u' in in the defines.

Best,
Bert

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  2:21     ` Jann Horn
  2019-01-29  2:54       ` Jens Axboe
@ 2019-01-29  3:46       ` Jens Axboe
  2019-01-29 15:56         ` Jann Horn
  1 sibling, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  3:46 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 7:21 PM, Jann Horn wrote:
> Please create a local copy of the request before parsing it to keep
> the data from changing under you. Additionally, it might make sense to
> annotate every pointer to shared memory with a comment, or something
> like that, to ensure that anyone looking at the code can immediately
> see for which pointers special caution is required on access.

I took a look at the viability of NOT having to local copy the data, and
I don't think it's too bad. Local copy has a noticeable impact on the
performance, hence I'd really (REALLY) like to avoid it.

Here's something on top of the current git branch. I think I even went a
bit too far in some areas, but it should hopefully catch the cases where
we might end up double evaluating the parts of the sqe that we depend
on. For most of the sqe reading we don't really care too much. For
instance, the sqe->user_data. If the app changes this field, then it
just gets whatever passed back in cqe->user_data. That's not a kernel
issue.

For cases like addr/len etc validation, it should be sound. I'll double
check this in the morning as well, and obviously would need to be folded
in along the way.

I'd appreciate your opinion on this part, if you see any major issues
with it, or if I missed something.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index e8760ad02e82..64d090300990 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -668,31 +668,35 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	int ret;
+	unsigned flags, ioprio;
+	int fd, ret;
 
-	if (sqe->flags & IOSQE_FIXED_FILE) {
-		if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files))
+	flags = READ_ONCE(sqe->flags);
+	fd = READ_ONCE(sqe->fd);
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
 			return -EBADF;
-		kiocb->ki_filp = ctx->user_files[sqe->fd];
+		kiocb->ki_filp = ctx->user_files[fd];
 		req->flags |= REQ_F_FIXED_FILE;
 	} else {
-		kiocb->ki_filp = io_file_get(state, sqe->fd);
+		kiocb->ki_filp = io_file_get(state, fd);
 	}
 	if (unlikely(!kiocb->ki_filp))
 		return -EBADF;
-	kiocb->ki_pos = sqe->off;
+	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
-	if (sqe->ioprio) {
-		ret = ioprio_check_cap(sqe->ioprio);
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
 		if (ret)
 			goto out_fput;
 
-		kiocb->ki_ioprio = sqe->ioprio;
+		kiocb->ki_ioprio = ioprio;
 	} else
 		kiocb->ki_ioprio = get_current_ioprio();
 
-	ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
 	if (unlikely(ret))
 		goto out_fput;
 	if (force_nonblock) {
@@ -716,7 +720,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	if (!(sqe->flags & IOSQE_FIXED_FILE))
+	if (!(flags & IOSQE_FIXED_FILE))
 		io_file_put(state, kiocb->ki_filp);
 	return ret;
 }
@@ -746,28 +750,31 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
 			   const struct io_uring_sqe *sqe,
 			   struct iov_iter *iter)
 {
+	size_t len = READ_ONCE(sqe->len);
 	struct io_mapped_ubuf *imu;
-	size_t len = sqe->len;
+	int buf_index, index;
 	size_t offset;
-	int index;
+	u64 buf_addr;
 
 	/* attempt to use fixed buffers without having provided iovecs */
 	if (unlikely(!ctx->user_bufs))
 		return -EFAULT;
-	if (unlikely(sqe->buf_index >= ctx->nr_user_bufs))
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
 		return -EFAULT;
 
-	index = array_index_nospec(sqe->buf_index, ctx->sq_entries);
+	index = array_index_nospec(buf_index, ctx->sq_entries);
 	imu = &ctx->user_bufs[index];
-	if ((unsigned long) sqe->addr < imu->ubuf ||
-	    (unsigned long) sqe->addr + len > imu->ubuf + imu->len)
+	buf_addr = READ_ONCE(sqe->addr);
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
 		return -EFAULT;
 
 	/*
 	 * May not be a start of buffer, set size appropriately
 	 * and advance us to the beginning.
 	 */
-	offset = (unsigned long) sqe->addr - imu->ubuf;
+	offset = buf_addr - imu->ubuf;
 	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
 	if (offset)
 		iov_iter_advance(iter, offset);
@@ -778,10 +785,12 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct io_uring_sqe *sqe,
 			   struct iovec **iovec, struct iov_iter *iter)
 {
-	void __user *buf = u64_to_user_ptr(sqe->addr);
+	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	int opcode;
 
-	if (sqe->opcode == IORING_OP_READ_FIXED ||
-	    sqe->opcode == IORING_OP_WRITE_FIXED) {
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
 		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
 		*iovec = NULL;
 		return ret;
@@ -789,11 +798,12 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 
 #ifdef CONFIG_COMPAT
 	if (in_compat_syscall())
-		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
-						iovec, iter);
+		return compat_import_iovec(rw, buf, READ_ONCE(sqe->len),
+						UIO_FASTIOV, iovec, iter);
 #endif
 
-	return import_iovec(rw, buf, sqe->len, UIO_FASTIOV, iovec, iter);
+	return import_iovec(rw, buf, READ_ONCE(sqe->len), UIO_FASTIOV, iovec,
+				iter);
 }
 
 static void io_async_list_note(int rw, struct io_kiocb *req, size_t len)
@@ -939,14 +949,14 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 /*
  * IORING_OP_NOP just posts a completion event, nothing else.
  */
-static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_nop(struct io_kiocb *req, u64 user_data)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 
-	io_cqring_add_event(ctx, sqe->user_data, 0, 0);
+	io_cqring_add_event(ctx, user_data, 0, 0);
 	io_free_req(req);
 	return 0;
 }
@@ -955,9 +965,12 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		    bool force_nonblock)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	loff_t end = sqe->off + sqe->len;
+	loff_t sqe_off = READ_ONCE(sqe->off);
+	loff_t sqe_len = READ_ONCE(sqe->len);
+	loff_t end = sqe_off + sqe_len;
 	struct file *file;
-	int ret;
+	unsigned flags;
+	int ret, fd;
 
 	/* fsync always requires a blocking context */
 	if (force_nonblock)
@@ -970,21 +983,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC))
 		return -EINVAL;
 
-	if (sqe->flags & IOSQE_FIXED_FILE) {
-		if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files))
+	fd = READ_ONCE(sqe->fd);
+	flags = READ_ONCE(sqe->flags);
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
 			return -EBADF;
-		file = ctx->user_files[sqe->fd];
+		file = ctx->user_files[fd];
 	} else {
-		file = fget(sqe->fd);
+		file = fget(fd);
 	}
 
 	if (unlikely(!file))
 		return -EBADF;
 
-	ret = vfs_fsync_range(file, sqe->off, end > 0 ? end : LLONG_MAX,
+	ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
 			sqe->fsync_flags & IORING_FSYNC_DATASYNC);
 
-	if (!(sqe->flags & IOSQE_FIXED_FILE))
+	if (!(flags & IOSQE_FIXED_FILE))
 		fput(file);
 
 	io_cqring_add_event(ctx, sqe->user_data, ret, 0);
@@ -1037,7 +1052,7 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	spin_lock_irq(&ctx->completion_lock);
 	list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) {
-		if (sqe->addr == poll_req->user_data) {
+		if (READ_ONCE(sqe->addr) == poll_req->user_data) {
 			io_poll_remove_one(poll_req);
 			ret = 0;
 			break;
@@ -1145,7 +1160,10 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	struct io_poll_iocb *poll = &req->poll;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_poll_table ipt;
+	unsigned flags;
 	__poll_t mask;
+	u16 events;
+	int fd;
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
@@ -1153,15 +1171,18 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	INIT_WORK(&req->work, io_poll_complete_work);
-	poll->events = demangle_poll(sqe->poll_events) | EPOLLERR | EPOLLHUP;
+	events = READ_ONCE(sqe->poll_events);
+	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
 
-	if (sqe->flags & IOSQE_FIXED_FILE) {
-		if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files))
+	flags = READ_ONCE(sqe->flags);
+	fd = READ_ONCE(sqe->fd);
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
 			return -EBADF;
-		poll->file = ctx->user_files[sqe->fd];
+		poll->file = ctx->user_files[fd];
 		req->flags |= REQ_F_FIXED_FILE;
 	} else {
-		poll->file = fget(sqe->fd);
+		poll->file = fget(fd);
 	}
 	if (unlikely(!poll->file))
 		return -EBADF;
@@ -1207,7 +1228,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 out:
 	if (unlikely(ipt.error)) {
-		if (!(sqe->flags & IOSQE_FIXED_FILE))
+		if (!(flags & IOSQE_FIXED_FILE))
 			fput(poll->file);
 		return ipt.error;
 	}
@@ -1224,15 +1245,17 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 {
 	const struct io_uring_sqe *sqe = s->sqe;
 	ssize_t ret;
+	int opcode;
 
 	if (unlikely(s->index >= ctx->sq_entries))
 		return -EINVAL;
-	req->user_data = sqe->user_data;
+	req->user_data = READ_ONCE(sqe->user_data);
 
 	ret = -EINVAL;
-	switch (sqe->opcode) {
+	opcode = READ_ONCE(sqe->opcode);
+	switch (opcode) {
 	case IORING_OP_NOP:
-		ret = io_nop(req, sqe);
+		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
 		if (unlikely(sqe->buf_index))
@@ -1317,7 +1340,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 restart:
 	do {
 		struct sqe_submit *s = &req->submit;
-		u64 user_data = s->sqe->user_data;
+		u64 user_data = READ_ONCE(s->sqe->user_data);
 
 		/* Ensure we clear previously set forced non-block flag */
 		req->flags &= ~REQ_F_FORCE_NONBLOCK;

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  2:21     ` Jann Horn
@ 2019-01-29  2:54       ` Jens Axboe
  2019-01-29  3:46       ` Jens Axboe
  1 sibling, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  2:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 7:21 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 2:07 AM Jann Horn <jannh@google.com> wrote:
>> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> The submission queue (SQ) and completion queue (CQ) rings are shared
>>> between the application and the kernel. This eliminates the need to
>>> copy data back and forth to submit and complete IO.
>> [...]
>>> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>>> +{
>>> +       struct io_sq_ring *ring = ctx->sq_ring;
>>> +       unsigned head;
>>> +
>>> +       /*
>>> +        * The cached sq head (or cq tail) serves two purposes:
>>> +        *
>>> +        * 1) allows us to batch the cost of updating the user visible
>>> +        *    head updates.
>>> +        * 2) allows the kernel side to track the head on its own, even
>>> +        *    though the application is the one updating it.
>>> +        */
>>> +       head = ctx->cached_sq_head;
>>> +       smp_rmb();
>>> +       if (head == READ_ONCE(ring->r.tail))
>>> +               return false;
>>> +
>>> +       head = ring->array[head & ctx->sq_mask];
>>> +       if (head < ctx->sq_entries) {
>>> +               s->index = head;
>>> +               s->sqe = &ctx->sq_sqes[head];
>>
>> ring->array can be mapped writable into userspace, right? If so: This
>> looks like a double-read issue; the compiler might assume that
>> ring->array is not modified concurrently and perform separate memory
>> accesses for the "if (head < ctx->sq_entries)" check and the
>> "&ctx->sq_sqes[head]" computation. Please use READ_ONCE()/WRITE_ONCE()
>> for all accesses to memory that userspace could concurrently modify in
>> a malicious way.
>>
>> There have been some pretty severe security bugs caused by missing
>> READ_ONCE() annotations around accesses to shared memory; see, for
>> example, https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf
>> . Slides 35-48 show how the code "switch (op->cmd)", where "op" is a
>> pointer to shared memory, allowed an attacker to break out of a Xen
>> virtual machine because the compiler generated multiple memory
>> accesses.
> 
> Oh, actually, it's even worse (comments with "//" added by me):
> 
> io_sq_thread() does this:
> 
>         do {
>                 // sqes[i].sqe is pointer to shared memory, result of
>                 // io_sqe_needs_user() is unreliable
>                 if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
>                         all_fixed = false;
> 
>                 i++;
>                 if (i == ARRAY_SIZE(sqes))
>                         break;
>         } while (io_get_sqring(ctx, &sqes[i]));
>         // sqes[...].sqe are pointers to shared memory
> 
>         io_commit_sqring(ctx);
> 
>         /* Unless all new commands are FIXED regions, grab mm */
>         if (!all_fixed && !cur_mm) {
>                 mm_fault = !mmget_not_zero(ctx->sqo_mm);
>                 if (!mm_fault) {
>                         use_mm(ctx->sqo_mm);
>                         cur_mm = ctx->sqo_mm;
>                 }
>         }
> 
>         inflight += io_submit_sqes(ctx, sqes, i, mm_fault);
> 
> Then the shared memory pointers go into io_submit_sqes():
> 
> static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
>                           unsigned int nr, bool mm_fault)
> {
>         struct io_submit_state state, *statep = NULL;
>         int ret, i, submitted = 0;
>         // sqes[...].sqe are pointers to shared memory
>         [...]
>         for (i = 0; i < nr; i++) {
>                 if (unlikely(mm_fault))
>                         ret = -EFAULT;
>                 else
>                         ret = io_submit_sqe(ctx, &sqes[i], statep);
>                 [...]
>         }
>         [...]
> }
> 
> And on into io_submit_sqe():
> 
> static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
>                          struct io_submit_state *state)
> {
>         [...]
>         ret = __io_submit_sqe(ctx, req, s, true, state);
>         [...]
> }
> 
> And there it gets interesting:
> 
> static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>                            struct sqe_submit *s, bool force_nonblock,
>                            struct io_submit_state *state)
> {
>         // s->sqe is a pointer to shared memory
>         const struct io_uring_sqe *sqe = s->sqe;
>         // sqe is a pointer to shared memory
>         ssize_t ret;
> 
>         if (unlikely(s->index >= ctx->sq_entries))
>                 return -EINVAL;
>         req->user_data = sqe->user_data;
> 
>         ret = -EINVAL;
>         // switch() on read from shared memory, potential instruction pointer
>         // control
>         switch (sqe->opcode) {
>         [...]
>         case IORING_OP_READV:
>                 if (unlikely(sqe->buf_index))
>                         return -EINVAL;
>                 ret = io_read(req, sqe, force_nonblock, state);
>                 break;
>         [...]
>         case IORING_OP_READ_FIXED:
>                 ret = io_read(req, sqe, force_nonblock, state);
>                 break;
>         [...]
>         }
>         [...]
> }
> 
> On into io_read():
> 
> static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                        bool force_nonblock, struct io_submit_state *state)
> {
> [...]
>         // sqe is a pointer to shared memory
>         ret = io_prep_rw(req, sqe, force_nonblock, state);
>         [...]
> }
> 
> And then io_prep_rw() does multiple reads even in the source code:
> 
> static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                       bool force_nonblock, struct io_submit_state *state)
> {
>         struct io_ring_ctx *ctx = req->ctx;
>         struct kiocb *kiocb = &req->rw;
>         int ret;
> 
>         // sqe is a pointer to shared memory
> 
>         // double-read of sqe->flags, see end of function
>         if (sqe->flags & IOSQE_FIXED_FILE) {
>                 // double-read of sqe->fd for the bounds check and the
> array access, potential OOB pointer read
>                 if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files))
>                         return -EBADF;
>                 kiocb->ki_filp = ctx->user_files[sqe->fd];
>                 req->flags |= REQ_F_FIXED_FILE;
>         } else {
>                 kiocb->ki_filp = io_file_get(state, sqe->fd);
>         }
>         if (unlikely(!kiocb->ki_filp))
>                 return -EBADF;
>         kiocb->ki_pos = sqe->off;
>         kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
>         kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
>         // three reads of sqe->ioprio, bypassable capability check
>         if (sqe->ioprio) {
>                 ret = ioprio_check_cap(sqe->ioprio);
>                 if (ret)
>                         goto out_fput;
> 
>                 kiocb->ki_ioprio = sqe->ioprio;
>         } else
>                 kiocb->ki_ioprio = get_current_ioprio();
>         [...]
>         return 0;
> out_fput:
>         // double-read of sqe->flags, changed value can lead to
> unbalanced refcount
>         if (!(sqe->flags & IOSQE_FIXED_FILE))
>                 io_file_put(state, kiocb->ki_filp);
>         return ret;
> }
> 
> Please create a local copy of the request before parsing it to keep
> the data from changing under you. Additionally, it might make sense to
> annotate every pointer to shared memory with a comment, or something
> like that, to ensure that anyone looking at the code can immediately
> see for which pointers special caution is required on access.

Ugh, that's pretty dire. But good catch, I'll fix that up so the
application changing sqe malicously won't affect the kernel. I hope we
can get away with NOT copying the whole sqe, but we'll do that if we
have to.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:32       ` Jann Horn
@ 2019-01-29  2:23         ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  2:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 6:32 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 2:31 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 6:29 PM, Jann Horn wrote:
>>> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> The submission queue (SQ) and completion queue (CQ) rings are shared
>>>> between the application and the kernel. This eliminates the need to
>>>> copy data back and forth to submit and complete IO.
>>> [...]
>>>> +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
>>>> +{
>>>> +       struct io_kiocb *req;
>>>> +
>>>> +       /* safe to use the non tryget, as we're inside ring ref already */
>>>> +       percpu_ref_get(&ctx->refs);
>>>
>>> Is that true? In the path io_sq_thread() -> io_submit_sqes() ->
>>> io_submit_sqe() -> io_get_req(), I don't see anything that's already
>>> holding a reference for you. Is the worker thread holding a reference
>>> somewhere that I'm missing?
>>
>> If the thread is alive, then the ctx is alive. Before we drop the last
>> ref to the ctx (and kill it), we wait for the thread to exit.
> 
> Where in __io_uring_register() are you waiting for the thread to exit
> before killing it? As far as I can tell, you come straight in from
> syscall context, take a mutex, and do percpu_ref_kill().

I was just referring to the regular shutdown path. You are right, for
the later io_uring_register() more care needs to be taken. I'll just
switch to the tryget, it's not like the non-try is a big cycle saver
by any stretch.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:07   ` Jann Horn
  2019-01-29  2:21     ` Jann Horn
@ 2019-01-29  2:21     ` Jens Axboe
  1 sibling, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  2:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 6:07 PM, Jann Horn wrote:
> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>> The submission queue (SQ) and completion queue (CQ) rings are shared
>> between the application and the kernel. This eliminates the need to
>> copy data back and forth to submit and complete IO.
> [...]
>> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>> +{
>> +       struct io_sq_ring *ring = ctx->sq_ring;
>> +       unsigned head;
>> +
>> +       /*
>> +        * The cached sq head (or cq tail) serves two purposes:
>> +        *
>> +        * 1) allows us to batch the cost of updating the user visible
>> +        *    head updates.
>> +        * 2) allows the kernel side to track the head on its own, even
>> +        *    though the application is the one updating it.
>> +        */
>> +       head = ctx->cached_sq_head;
>> +       smp_rmb();
>> +       if (head == READ_ONCE(ring->r.tail))
>> +               return false;
>> +
>> +       head = ring->array[head & ctx->sq_mask];
>> +       if (head < ctx->sq_entries) {
>> +               s->index = head;
>> +               s->sqe = &ctx->sq_sqes[head];
> 
> ring->array can be mapped writable into userspace, right? If so: This
> looks like a double-read issue; the compiler might assume that
> ring->array is not modified concurrently and perform separate memory
> accesses for the "if (head < ctx->sq_entries)" check and the
> "&ctx->sq_sqes[head]" computation. Please use READ_ONCE()/WRITE_ONCE()
> for all accesses to memory that userspace could concurrently modify in
> a malicious way.
> 
> There have been some pretty severe security bugs caused by missing
> READ_ONCE() annotations around accesses to shared memory; see, for
> example, https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf
> . Slides 35-48 show how the code "switch (op->cmd)", where "op" is a
> pointer to shared memory, allowed an attacker to break out of a Xen
> virtual machine because the compiler generated multiple memory
> accesses.

Thanks, I'll update these to use READ/WRITE_ONCE.

>> +               ctx->cached_sq_head++;
>> +               return true;
>> +       }
>> +
>> +       /* drop invalid entries */
>> +       ctx->cached_sq_head++;
>> +       ring->dropped++;
>> +       smp_wmb();
>> +       return false;
>> +}
> [...]
>> +SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>> +               u32, min_complete, u32, flags, const sigset_t __user *, sig,
>> +               size_t, sigsz)
>> +{
>> +       struct io_ring_ctx *ctx;
>> +       long ret = -EBADF;
>> +       struct fd f;
>> +
>> +       f = fdget(fd);
>> +       if (!f.file)
>> +               return -EBADF;
>> +
>> +       ret = -EOPNOTSUPP;
>> +       if (f.file->f_op != &io_uring_fops)
>> +               goto out_fput;
> 
> Oh, by the way: If you feel like it, maybe you could add a helper
> fdget_typed(int fd, struct file_operations *f_op), or something like
> that, so that there is less boilerplate code for first doing fdget(),
> then checking ->f_op, and then coding an extra bailout path for that?
> But that doesn't really have much to do with your patchset, feel free
> to ignore this comment.

That's not a bad idea, I think this is a fairly common code pattern.
I'll look into it.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:07   ` Jann Horn
@ 2019-01-29  2:21     ` Jann Horn
  2019-01-29  2:54       ` Jens Axboe
  2019-01-29  3:46       ` Jens Axboe
  2019-01-29  2:21     ` Jens Axboe
  1 sibling, 2 replies; 49+ messages in thread
From: Jann Horn @ 2019-01-29  2:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 2:07 AM Jann Horn <jannh@google.com> wrote:
> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> > The submission queue (SQ) and completion queue (CQ) rings are shared
> > between the application and the kernel. This eliminates the need to
> > copy data back and forth to submit and complete IO.
> [...]
> > +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
> > +{
> > +       struct io_sq_ring *ring = ctx->sq_ring;
> > +       unsigned head;
> > +
> > +       /*
> > +        * The cached sq head (or cq tail) serves two purposes:
> > +        *
> > +        * 1) allows us to batch the cost of updating the user visible
> > +        *    head updates.
> > +        * 2) allows the kernel side to track the head on its own, even
> > +        *    though the application is the one updating it.
> > +        */
> > +       head = ctx->cached_sq_head;
> > +       smp_rmb();
> > +       if (head == READ_ONCE(ring->r.tail))
> > +               return false;
> > +
> > +       head = ring->array[head & ctx->sq_mask];
> > +       if (head < ctx->sq_entries) {
> > +               s->index = head;
> > +               s->sqe = &ctx->sq_sqes[head];
>
> ring->array can be mapped writable into userspace, right? If so: This
> looks like a double-read issue; the compiler might assume that
> ring->array is not modified concurrently and perform separate memory
> accesses for the "if (head < ctx->sq_entries)" check and the
> "&ctx->sq_sqes[head]" computation. Please use READ_ONCE()/WRITE_ONCE()
> for all accesses to memory that userspace could concurrently modify in
> a malicious way.
>
> There have been some pretty severe security bugs caused by missing
> READ_ONCE() annotations around accesses to shared memory; see, for
> example, https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf
> . Slides 35-48 show how the code "switch (op->cmd)", where "op" is a
> pointer to shared memory, allowed an attacker to break out of a Xen
> virtual machine because the compiler generated multiple memory
> accesses.

Oh, actually, it's even worse (comments with "//" added by me):

io_sq_thread() does this:

        do {
                // sqes[i].sqe is pointer to shared memory, result of
                // io_sqe_needs_user() is unreliable
                if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
                        all_fixed = false;

                i++;
                if (i == ARRAY_SIZE(sqes))
                        break;
        } while (io_get_sqring(ctx, &sqes[i]));
        // sqes[...].sqe are pointers to shared memory

        io_commit_sqring(ctx);

        /* Unless all new commands are FIXED regions, grab mm */
        if (!all_fixed && !cur_mm) {
                mm_fault = !mmget_not_zero(ctx->sqo_mm);
                if (!mm_fault) {
                        use_mm(ctx->sqo_mm);
                        cur_mm = ctx->sqo_mm;
                }
        }

        inflight += io_submit_sqes(ctx, sqes, i, mm_fault);

Then the shared memory pointers go into io_submit_sqes():

static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
                          unsigned int nr, bool mm_fault)
{
        struct io_submit_state state, *statep = NULL;
        int ret, i, submitted = 0;
        // sqes[...].sqe are pointers to shared memory
        [...]
        for (i = 0; i < nr; i++) {
                if (unlikely(mm_fault))
                        ret = -EFAULT;
                else
                        ret = io_submit_sqe(ctx, &sqes[i], statep);
                [...]
        }
        [...]
}

And on into io_submit_sqe():

static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
                         struct io_submit_state *state)
{
        [...]
        ret = __io_submit_sqe(ctx, req, s, true, state);
        [...]
}

And there it gets interesting:

static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
                           struct sqe_submit *s, bool force_nonblock,
                           struct io_submit_state *state)
{
        // s->sqe is a pointer to shared memory
        const struct io_uring_sqe *sqe = s->sqe;
        // sqe is a pointer to shared memory
        ssize_t ret;

        if (unlikely(s->index >= ctx->sq_entries))
                return -EINVAL;
        req->user_data = sqe->user_data;

        ret = -EINVAL;
        // switch() on read from shared memory, potential instruction pointer
        // control
        switch (sqe->opcode) {
        [...]
        case IORING_OP_READV:
                if (unlikely(sqe->buf_index))
                        return -EINVAL;
                ret = io_read(req, sqe, force_nonblock, state);
                break;
        [...]
        case IORING_OP_READ_FIXED:
                ret = io_read(req, sqe, force_nonblock, state);
                break;
        [...]
        }
        [...]
}

On into io_read():

static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
                       bool force_nonblock, struct io_submit_state *state)
{
[...]
        // sqe is a pointer to shared memory
        ret = io_prep_rw(req, sqe, force_nonblock, state);
        [...]
}

And then io_prep_rw() does multiple reads even in the source code:

static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
                      bool force_nonblock, struct io_submit_state *state)
{
        struct io_ring_ctx *ctx = req->ctx;
        struct kiocb *kiocb = &req->rw;
        int ret;

        // sqe is a pointer to shared memory

        // double-read of sqe->flags, see end of function
        if (sqe->flags & IOSQE_FIXED_FILE) {
                // double-read of sqe->fd for the bounds check and the
array access, potential OOB pointer read
                if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files))
                        return -EBADF;
                kiocb->ki_filp = ctx->user_files[sqe->fd];
                req->flags |= REQ_F_FIXED_FILE;
        } else {
                kiocb->ki_filp = io_file_get(state, sqe->fd);
        }
        if (unlikely(!kiocb->ki_filp))
                return -EBADF;
        kiocb->ki_pos = sqe->off;
        kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
        kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
        // three reads of sqe->ioprio, bypassable capability check
        if (sqe->ioprio) {
                ret = ioprio_check_cap(sqe->ioprio);
                if (ret)
                        goto out_fput;

                kiocb->ki_ioprio = sqe->ioprio;
        } else
                kiocb->ki_ioprio = get_current_ioprio();
        [...]
        return 0;
out_fput:
        // double-read of sqe->flags, changed value can lead to
unbalanced refcount
        if (!(sqe->flags & IOSQE_FIXED_FILE))
                io_file_put(state, kiocb->ki_filp);
        return ret;
}

Please create a local copy of the request before parsing it to keep
the data from changing under you. Additionally, it might make sense to
annotate every pointer to shared memory with a comment, or something
like that, to ensure that anyone looking at the code can immediately
see for which pointers special caution is required on access.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:31     ` Jens Axboe
@ 2019-01-29  1:32       ` Jann Horn
  2019-01-29  2:23         ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-29  1:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 2:31 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 6:29 PM, Jann Horn wrote:
> > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> The submission queue (SQ) and completion queue (CQ) rings are shared
> >> between the application and the kernel. This eliminates the need to
> >> copy data back and forth to submit and complete IO.
> > [...]
> >> +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
> >> +{
> >> +       struct io_kiocb *req;
> >> +
> >> +       /* safe to use the non tryget, as we're inside ring ref already */
> >> +       percpu_ref_get(&ctx->refs);
> >
> > Is that true? In the path io_sq_thread() -> io_submit_sqes() ->
> > io_submit_sqe() -> io_get_req(), I don't see anything that's already
> > holding a reference for you. Is the worker thread holding a reference
> > somewhere that I'm missing?
>
> If the thread is alive, then the ctx is alive. Before we drop the last
> ref to the ctx (and kill it), we wait for the thread to exit.

Where in __io_uring_register() are you waiting for the thread to exit
before killing it? As far as I can tell, you come straight in from
syscall context, take a mutex, and do percpu_ref_kill().

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  1:29   ` Jann Horn
@ 2019-01-29  1:31     ` Jens Axboe
  2019-01-29  1:32       ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  1:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 6:29 PM, Jann Horn wrote:
> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>> The submission queue (SQ) and completion queue (CQ) rings are shared
>> between the application and the kernel. This eliminates the need to
>> copy data back and forth to submit and complete IO.
> [...]
>> +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
>> +{
>> +       struct io_kiocb *req;
>> +
>> +       /* safe to use the non tryget, as we're inside ring ref already */
>> +       percpu_ref_get(&ctx->refs);
> 
> Is that true? In the path io_sq_thread() -> io_submit_sqes() ->
> io_submit_sqe() -> io_get_req(), I don't see anything that's already
> holding a reference for you. Is the worker thread holding a reference
> somewhere that I'm missing?

If the thread is alive, then the ctx is alive. Before we drop the last
ref to the ctx (and kill it), we wait for the thread to exit.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
                     ` (2 preceding siblings ...)
  2019-01-29  1:07   ` Jann Horn
@ 2019-01-29  1:29   ` Jann Horn
  2019-01-29  1:31     ` Jens Axboe
  2019-01-29  7:12   ` Bert Wesarg
  2019-01-29 12:12   ` Florian Weimer
  5 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-29  1:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> The submission queue (SQ) and completion queue (CQ) rings are shared
> between the application and the kernel. This eliminates the need to
> copy data back and forth to submit and complete IO.
[...]
> +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
> +{
> +       struct io_kiocb *req;
> +
> +       /* safe to use the non tryget, as we're inside ring ref already */
> +       percpu_ref_get(&ctx->refs);

Is that true? In the path io_sq_thread() -> io_submit_sqes() ->
io_submit_sqe() -> io_get_req(), I don't see anything that's already
holding a reference for you. Is the worker thread holding a reference
somewhere that I'm missing?

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
  2019-01-28 21:53   ` Jeff Moyer
  2019-01-28 22:32   ` Jann Horn
@ 2019-01-29  1:07   ` Jann Horn
  2019-01-29  2:21     ` Jann Horn
  2019-01-29  2:21     ` Jens Axboe
  2019-01-29  1:29   ` Jann Horn
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 49+ messages in thread
From: Jann Horn @ 2019-01-29  1:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> The submission queue (SQ) and completion queue (CQ) rings are shared
> between the application and the kernel. This eliminates the need to
> copy data back and forth to submit and complete IO.
[...]
> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
> +{
> +       struct io_sq_ring *ring = ctx->sq_ring;
> +       unsigned head;
> +
> +       /*
> +        * The cached sq head (or cq tail) serves two purposes:
> +        *
> +        * 1) allows us to batch the cost of updating the user visible
> +        *    head updates.
> +        * 2) allows the kernel side to track the head on its own, even
> +        *    though the application is the one updating it.
> +        */
> +       head = ctx->cached_sq_head;
> +       smp_rmb();
> +       if (head == READ_ONCE(ring->r.tail))
> +               return false;
> +
> +       head = ring->array[head & ctx->sq_mask];
> +       if (head < ctx->sq_entries) {
> +               s->index = head;
> +               s->sqe = &ctx->sq_sqes[head];

ring->array can be mapped writable into userspace, right? If so: This
looks like a double-read issue; the compiler might assume that
ring->array is not modified concurrently and perform separate memory
accesses for the "if (head < ctx->sq_entries)" check and the
"&ctx->sq_sqes[head]" computation. Please use READ_ONCE()/WRITE_ONCE()
for all accesses to memory that userspace could concurrently modify in
a malicious way.

There have been some pretty severe security bugs caused by missing
READ_ONCE() annotations around accesses to shared memory; see, for
example, https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf
. Slides 35-48 show how the code "switch (op->cmd)", where "op" is a
pointer to shared memory, allowed an attacker to break out of a Xen
virtual machine because the compiler generated multiple memory
accesses.

> +               ctx->cached_sq_head++;
> +               return true;
> +       }
> +
> +       /* drop invalid entries */
> +       ctx->cached_sq_head++;
> +       ring->dropped++;
> +       smp_wmb();
> +       return false;
> +}
[...]
> +SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
> +               u32, min_complete, u32, flags, const sigset_t __user *, sig,
> +               size_t, sigsz)
> +{
> +       struct io_ring_ctx *ctx;
> +       long ret = -EBADF;
> +       struct fd f;
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +
> +       ret = -EOPNOTSUPP;
> +       if (f.file->f_op != &io_uring_fops)
> +               goto out_fput;

Oh, by the way: If you feel like it, maybe you could add a helper
fdget_typed(int fd, struct file_operations *f_op), or something like
that, so that there is less boilerplate code for first doing fdget(),
then checking ->f_op, and then coding an extra bailout path for that?
But that doesn't really have much to do with your patchset, feel free
to ignore this comment.

[...]
> +out_fput:
> +       fdput(f);
> +       return ret;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:58                 ` Jann Horn
@ 2019-01-29  1:01                   ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  1:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On 1/28/19 5:58 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 1:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 5:34 PM, Jann Horn wrote:
>>> On Tue, Jan 29, 2019 at 1:32 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 1/28/19 5:03 PM, Jens Axboe wrote:
>>>>>> But you only do that teardown on ->release, right? And ->release
>>>>>> doesn't have much to do with the process lifetime.
>>>>>
>>>>> Yes, only on ->relase().
>>>>
>>>> OK, so I reworked the files struct to just grab it, then we ensure that
>>>> doesn't go away. For mm, it's a bit more tricky. I think the best
>>>> solution here is to add a fops->flush() and check for the process
>>>> exiting its files. If it does, we quiesce the async contexts and prevent
>>>> further use of that mm. We can't just keep holding a reference to the mm
>>>> like we do with the files.
>>>>
>>>> That should solve both cases.
>>>
>>> You still have to hold a reference on the mm though, I think (for
>>> example, because two tasks might be sharing the fd table without
>>> sharing the mm).
>>
>> Yes good point, except we can't hold a reference to it.
> 
> Why not? kvm_create_vm() does it, too:
> 
>     mmgrab(current->mm);
>     kvm->mm = current->mm;

I missed that helper, was only looking at mmget(). But yeah, that'll do it!

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:55               ` Jens Axboe
@ 2019-01-29  0:58                 ` Jann Horn
  2019-01-29  1:01                   ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-29  0:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 1:55 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 5:34 PM, Jann Horn wrote:
> > On Tue, Jan 29, 2019 at 1:32 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 1/28/19 5:03 PM, Jens Axboe wrote:
> >>>> But you only do that teardown on ->release, right? And ->release
> >>>> doesn't have much to do with the process lifetime.
> >>>
> >>> Yes, only on ->relase().
> >>
> >> OK, so I reworked the files struct to just grab it, then we ensure that
> >> doesn't go away. For mm, it's a bit more tricky. I think the best
> >> solution here is to add a fops->flush() and check for the process
> >> exiting its files. If it does, we quiesce the async contexts and prevent
> >> further use of that mm. We can't just keep holding a reference to the mm
> >> like we do with the files.
> >>
> >> That should solve both cases.
> >
> > You still have to hold a reference on the mm though, I think (for
> > example, because two tasks might be sharing the fd table without
> > sharing the mm).
>
> Yes good point, except we can't hold a reference to it.

Why not? kvm_create_vm() does it, too:

    mmgrab(current->mm);
    kvm->mm = current->mm;

> But I think
> we can get around this by using an mmu notifier instead. That eliminates
> the need for ->flush() as well.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:34             ` Jann Horn
@ 2019-01-29  0:55               ` Jens Axboe
  2019-01-29  0:58                 ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  0:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On 1/28/19 5:34 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 1:32 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 5:03 PM, Jens Axboe wrote:
>>>> But you only do that teardown on ->release, right? And ->release
>>>> doesn't have much to do with the process lifetime.
>>>
>>> Yes, only on ->relase().
>>
>> OK, so I reworked the files struct to just grab it, then we ensure that
>> doesn't go away. For mm, it's a bit more tricky. I think the best
>> solution here is to add a fops->flush() and check for the process
>> exiting its files. If it does, we quiesce the async contexts and prevent
>> further use of that mm. We can't just keep holding a reference to the mm
>> like we do with the files.
>>
>> That should solve both cases.
> 
> You still have to hold a reference on the mm though, I think (for
> example, because two tasks might be sharing the fd table without
> sharing the mm).

Yes good point, except we can't hold a reference to it. But I think
we can get around this by using an mmu notifier instead. That eliminates
the need for ->flush() as well.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:31           ` Jens Axboe
@ 2019-01-29  0:34             ` Jann Horn
  2019-01-29  0:55               ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-29  0:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 1:32 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 5:03 PM, Jens Axboe wrote:
> >> But you only do that teardown on ->release, right? And ->release
> >> doesn't have much to do with the process lifetime.
> >
> > Yes, only on ->relase().
>
> OK, so I reworked the files struct to just grab it, then we ensure that
> doesn't go away. For mm, it's a bit more tricky. I think the best
> solution here is to add a fops->flush() and check for the process
> exiting its files. If it does, we quiesce the async contexts and prevent
> further use of that mm. We can't just keep holding a reference to the mm
> like we do with the files.
>
> That should solve both cases.

You still have to hold a reference on the mm though, I think (for
example, because two tasks might be sharing the fd table without
sharing the mm).

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-29  0:03         ` Jens Axboe
@ 2019-01-29  0:31           ` Jens Axboe
  2019-01-29  0:34             ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  0:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On 1/28/19 5:03 PM, Jens Axboe wrote:
>> But you only do that teardown on ->release, right? And ->release
>> doesn't have much to do with the process lifetime.
> 
> Yes, only on ->relase().

OK, so I reworked the files struct to just grab it, then we ensure that
doesn't go away. For mm, it's a bit more tricky. I think the best
solution here is to add a fops->flush() and check for the process
exiting its files. If it does, we quiesce the async contexts and prevent
further use of that mm. We can't just keep holding a reference to the mm
like we do with the files.

That should solve both cases.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 23:59       ` Jann Horn
@ 2019-01-29  0:03         ` Jens Axboe
  2019-01-29  0:31           ` Jens Axboe
  2019-02-01 16:57         ` Matt Mullins
  1 sibling, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-29  0:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On 1/28/19 4:59 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 3:32 PM, Jann Horn wrote:
>>> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> The submission queue (SQ) and completion queue (CQ) rings are shared
>>>> between the application and the kernel. This eliminates the need to
>>>> copy data back and forth to submit and complete IO.
>>>>
>>>> IO submissions use the io_uring_sqe data structure, and completions
>>>> are generated in the form of io_uring_sqe data structures. The SQ
>>>> ring is an index into the io_uring_sqe array, which makes it possible
>>>> to submit a batch of IOs without them being contiguous in the ring.
>>>> The CQ ring is always contiguous, as completion events are inherently
>>>> unordered, and hence any io_uring_cqe entry can point back to an
>>>> arbitrary submission.
>>>>
>>>> Two new system calls are added for this:
>>>>
>>>> io_uring_setup(entries, params)
>>>>         Sets up a context for doing async IO. On success, returns a file
>>>>         descriptor that the application can mmap to gain access to the
>>>>         SQ ring, CQ ring, and io_uring_sqes.
>>>>
>>>> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>>>>         Initiates IO against the rings mapped to this fd, or waits for
>>>>         them to complete, or both. The behavior is controlled by the
>>>>         parameters passed in. If 'to_submit' is non-zero, then we'll
>>>>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>>>>         kernel will wait for 'min_complete' events, if they aren't
>>>>         already available. It's valid to set IORING_ENTER_GETEVENTS
>>>>         and 'min_complete' == 0 at the same time, this allows the
>>>>         kernel to return already completed events without waiting
>>>>         for them. This is useful only for polling, as for IRQ
>>>>         driven IO, the application can just check the CQ ring
>>>>         without entering the kernel.
>>>>
>>>> With this setup, it's possible to do async IO with a single system
>>>> call. Future developments will enable polled IO with this interface,
>>>> and polled submission as well. The latter will enable an application
>>>> to do IO without doing ANY system calls at all.
>>>>
>>>> For IRQ driven IO, an application only needs to enter the kernel for
>>>> completions if it wants to wait for them to occur.
>>>>
>>>> Each io_uring is backed by a workqueue, to support buffered async IO
>>>> as well. We will only punt to an async context if the command would
>>>> need to wait for IO on the device side. Any data that can be accessed
>>>> directly in the page cache is done inline. This avoids the slowness
>>>> issue of usual threadpools, since cached data is accessed as quickly
>>>> as a sync interface.
>>>>
>>>> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
>>> [...]
>>>> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>>> +                     bool force_nonblock)
>>>> +{
>>>> +       struct kiocb *kiocb = &req->rw;
>>>> +       int ret;
>>>> +
>>>> +       kiocb->ki_filp = fget(sqe->fd);
>>>> +       if (unlikely(!kiocb->ki_filp))
>>>> +               return -EBADF;
>>>> +       kiocb->ki_pos = sqe->off;
>>>> +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
>>>> +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
>>>> +       if (sqe->ioprio) {
>>>> +               ret = ioprio_check_cap(sqe->ioprio);
>>>> +               if (ret)
>>>> +                       goto out_fput;
>>>> +
>>>> +               kiocb->ki_ioprio = sqe->ioprio;
>>>> +       } else
>>>> +               kiocb->ki_ioprio = get_current_ioprio();
>>>> +
>>>> +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
>>>> +       if (unlikely(ret))
>>>> +               goto out_fput;
>>>> +       if (force_nonblock) {
>>>> +               kiocb->ki_flags |= IOCB_NOWAIT;
>>>> +               req->flags |= REQ_F_FORCE_NONBLOCK;
>>>> +       }
>>>> +       if (kiocb->ki_flags & IOCB_HIPRI) {
>>>> +               ret = -EINVAL;
>>>> +               goto out_fput;
>>>> +       }
>>>> +
>>>> +       kiocb->ki_complete = io_complete_rw;
>>>> +       return 0;
>>>> +out_fput:
>>>> +       fput(kiocb->ki_filp);
>>>> +       return ret;
>>>> +}
>>> [...]
>>>> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>>> +                      bool force_nonblock)
>>>> +{
>>>> +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>>>> +       struct kiocb *kiocb = &req->rw;
>>>> +       struct iov_iter iter;
>>>> +       struct file *file;
>>>> +       ssize_t ret;
>>>> +
>>>> +       ret = io_prep_rw(req, sqe, force_nonblock);
>>>> +       if (ret)
>>>> +               return ret;
>>>> +       file = kiocb->ki_filp;
>>>> +
>>>> +       ret = -EBADF;
>>>> +       if (unlikely(!(file->f_mode & FMODE_READ)))
>>>> +               goto out_fput;
>>>> +       ret = -EINVAL;
>>>> +       if (unlikely(!file->f_op->read_iter))
>>>> +               goto out_fput;
>>>> +
>>>> +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
>>>> +       if (ret)
>>>> +               goto out_fput;
>>>> +
>>>> +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
>>>> +       if (!ret) {
>>>> +               ssize_t ret2;
>>>> +
>>>> +               /* Catch -EAGAIN return for forced non-blocking submission */
>>>> +               ret2 = call_read_iter(file, kiocb, &iter);
>>>> +               if (!force_nonblock || ret2 != -EAGAIN)
>>>> +                       io_rw_done(kiocb, ret2);
>>>> +               else
>>>> +                       ret = -EAGAIN;
>>>> +       }
>>>> +       kfree(iovec);
>>>> +out_fput:
>>>> +       if (unlikely(ret))
>>>> +               fput(file);
>>>> +       return ret;
>>>> +}
>>> [...]
>>>> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>> +                          struct sqe_submit *s, bool force_nonblock)
>>>> +{
>>>> +       const struct io_uring_sqe *sqe = s->sqe;
>>>> +       ssize_t ret;
>>>> +
>>>> +       if (unlikely(s->index >= ctx->sq_entries))
>>>> +               return -EINVAL;
>>>> +       req->user_data = sqe->user_data;
>>>> +
>>>> +       ret = -EINVAL;
>>>> +       switch (sqe->opcode) {
>>>> +       case IORING_OP_NOP:
>>>> +               ret = io_nop(req, sqe);
>>>> +               break;
>>>> +       case IORING_OP_READV:
>>>> +               ret = io_read(req, sqe, force_nonblock);
>>>> +               break;
>>>> +       case IORING_OP_WRITEV:
>>>> +               ret = io_write(req, sqe, force_nonblock);
>>>> +               break;
>>>> +       default:
>>>> +               ret = -EINVAL;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static void io_sq_wq_submit_work(struct work_struct *work)
>>>> +{
>>>> +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>>> +       struct sqe_submit *s = &req->submit;
>>>> +       u64 user_data = s->sqe->user_data;
>>>> +       struct io_ring_ctx *ctx = req->ctx;
>>>> +       mm_segment_t old_fs = get_fs();
>>>> +       struct files_struct *old_files;
>>>> +       int ret;
>>>> +
>>>> +        /* Ensure we clear previously set forced non-block flag */
>>>> +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
>>>> +
>>>> +       old_files = current->files;
>>>> +       current->files = ctx->sqo_files;
>>>
>>> I think you're not supposed to twiddle with current->files without
>>> holding task_lock(current).
>>
>> 'current' is the work queue item in this case, do we need to protect
>> against anything else? I can add the locking around the assignments
>> (both places).
> 
> Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> reference to your current files_struct protected only by task_lock();
> and it doesn't use anything like READ_ONCE(), so even if the object
> lifetime is not a problem, get_files_struct() could potentially crash
> due to a double-read (reading task->files twice and assuming that the
> result will be the same). As far as I can tell, this procfs code also
> works on kernel threads.

OK, that does make sense. I've added the locking.

>>>> +       if (!mmget_not_zero(ctx->sqo_mm)) {
>>>> +               ret = -EFAULT;
>>>> +               goto err;
>>>> +       }
>>>> +
>>>> +       use_mm(ctx->sqo_mm);
>>>> +       set_fs(USER_DS);
>>>> +
>>>> +       ret = __io_submit_sqe(ctx, req, s, false);
>>>> +
>>>> +       set_fs(old_fs);
>>>> +       unuse_mm(ctx->sqo_mm);
>>>> +       mmput(ctx->sqo_mm);
>>>> +err:
>>>> +       if (ret) {
>>>> +               io_cqring_add_event(ctx, user_data, ret, 0);
>>>> +               io_free_req(req);
>>>> +       }
>>>> +       current->files = old_files;
>>>> +}
>>> [...]
>>>> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
>>>> +{
>>>> +       int ret;
>>>> +
>>>> +       ctx->sqo_mm = current->mm;
>>>
>>> What keeps this thing alive?
>>
>> I think we're deadling with the same thing as the files below, I'll
>> defer to that.
>>
>>>> +       /*
>>>> +        * This is safe since 'current' has the fd installed, and if that gets
>>>> +        * closed on exit, then fops->release() is invoked which waits for the
>>>> +        * async contexts to flush and exit before exiting.
>>>> +        */
>>>> +       ret = -EBADF;
>>>> +       ctx->sqo_files = current->files;
>>>> +       if (!ctx->sqo_files)
>>>> +               goto err;
>>>
>>> That's gnarly. Adding Al Viro to the thread.
>>>
>>> I think you misunderstand the semantics of f_op->release. The ->flush
>>> handler is invoked whenever a file descriptor is closed through
>>> filp_close() (via deletion of the files_struct, sys_close(),
>>> sys_dup2(), ...), so if you had used that one, _maybe_ this would
>>> work. But the ->release handler only runs when the _last_ reference to
>>> a struct file has been dropped - so you can, for example, fork() a
>>> child, then exit() in the parent, and the ->release handler isn't
>>> invoked. So I don't see how this can work.
>>
>> The anonfd is CLOEXEC. The idea is exactly that it only runs when the
>> last reference to the file has been dropped. Not sure why you think I
>> need ->flush() here?
> 
> Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> Or send the fd via SCM_RIGHTS?

That would obviously be a problem...

>>> But even if you had abused ->flush for this instead: close_files()
>>> currently has a comment in it that claims that "this is the last
>>> reference to the files structure"; this change would make that claim
>>> untrue.
>>
>> Let me see if I can explain my intent better than that comment... We
>> know the parent who set up the io_uring instance will be around for as
>> long as io_uring instance persists.
> 
> That's the part that I think is wrong: As far as I can tell, the
> parent can go away and you won't notice.

If that's the case, then the mm/files referencing needs to be looked
over for sure. It's currently relying on the fact that the parent stays
alive. If it can go away without ->release() being called, then we have
issues.

> Also, note that "the parent" is different things for ->files and ->mm.
> You can have a multithreaded process whose threads don't have the same
> ->files, or multiple process that share ->files without sharing ->mm,
> ...

Of course, I do realize that.

>> When we are tearing down the
>> io_uring, then we wait for any async contexts (like the one above) to
>> exit. Once they are exited, it's safe to proceed with the exit and
>> teardown ->files[].
> 
> But you only do that teardown on ->release, right? And ->release
> doesn't have much to do with the process lifetime.

Yes, only on ->relase().

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 23:46     ` Jens Axboe
@ 2019-01-28 23:59       ` Jann Horn
  2019-01-29  0:03         ` Jens Axboe
  2019-02-01 16:57         ` Matt Mullins
  0 siblings, 2 replies; 49+ messages in thread
From: Jann Horn @ 2019-01-28 23:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, linux-aio, linux-block, linux-man, Linux API, hch,
	jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 3:32 PM, Jann Horn wrote:
> > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> The submission queue (SQ) and completion queue (CQ) rings are shared
> >> between the application and the kernel. This eliminates the need to
> >> copy data back and forth to submit and complete IO.
> >>
> >> IO submissions use the io_uring_sqe data structure, and completions
> >> are generated in the form of io_uring_sqe data structures. The SQ
> >> ring is an index into the io_uring_sqe array, which makes it possible
> >> to submit a batch of IOs without them being contiguous in the ring.
> >> The CQ ring is always contiguous, as completion events are inherently
> >> unordered, and hence any io_uring_cqe entry can point back to an
> >> arbitrary submission.
> >>
> >> Two new system calls are added for this:
> >>
> >> io_uring_setup(entries, params)
> >>         Sets up a context for doing async IO. On success, returns a file
> >>         descriptor that the application can mmap to gain access to the
> >>         SQ ring, CQ ring, and io_uring_sqes.
> >>
> >> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
> >>         Initiates IO against the rings mapped to this fd, or waits for
> >>         them to complete, or both. The behavior is controlled by the
> >>         parameters passed in. If 'to_submit' is non-zero, then we'll
> >>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> >>         kernel will wait for 'min_complete' events, if they aren't
> >>         already available. It's valid to set IORING_ENTER_GETEVENTS
> >>         and 'min_complete' == 0 at the same time, this allows the
> >>         kernel to return already completed events without waiting
> >>         for them. This is useful only for polling, as for IRQ
> >>         driven IO, the application can just check the CQ ring
> >>         without entering the kernel.
> >>
> >> With this setup, it's possible to do async IO with a single system
> >> call. Future developments will enable polled IO with this interface,
> >> and polled submission as well. The latter will enable an application
> >> to do IO without doing ANY system calls at all.
> >>
> >> For IRQ driven IO, an application only needs to enter the kernel for
> >> completions if it wants to wait for them to occur.
> >>
> >> Each io_uring is backed by a workqueue, to support buffered async IO
> >> as well. We will only punt to an async context if the command would
> >> need to wait for IO on the device side. Any data that can be accessed
> >> directly in the page cache is done inline. This avoids the slowness
> >> issue of usual threadpools, since cached data is accessed as quickly
> >> as a sync interface.
> >>
> >> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
> > [...]
> >> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> >> +                     bool force_nonblock)
> >> +{
> >> +       struct kiocb *kiocb = &req->rw;
> >> +       int ret;
> >> +
> >> +       kiocb->ki_filp = fget(sqe->fd);
> >> +       if (unlikely(!kiocb->ki_filp))
> >> +               return -EBADF;
> >> +       kiocb->ki_pos = sqe->off;
> >> +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> >> +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> >> +       if (sqe->ioprio) {
> >> +               ret = ioprio_check_cap(sqe->ioprio);
> >> +               if (ret)
> >> +                       goto out_fput;
> >> +
> >> +               kiocb->ki_ioprio = sqe->ioprio;
> >> +       } else
> >> +               kiocb->ki_ioprio = get_current_ioprio();
> >> +
> >> +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> >> +       if (unlikely(ret))
> >> +               goto out_fput;
> >> +       if (force_nonblock) {
> >> +               kiocb->ki_flags |= IOCB_NOWAIT;
> >> +               req->flags |= REQ_F_FORCE_NONBLOCK;
> >> +       }
> >> +       if (kiocb->ki_flags & IOCB_HIPRI) {
> >> +               ret = -EINVAL;
> >> +               goto out_fput;
> >> +       }
> >> +
> >> +       kiocb->ki_complete = io_complete_rw;
> >> +       return 0;
> >> +out_fput:
> >> +       fput(kiocb->ki_filp);
> >> +       return ret;
> >> +}
> > [...]
> >> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> >> +                      bool force_nonblock)
> >> +{
> >> +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> >> +       struct kiocb *kiocb = &req->rw;
> >> +       struct iov_iter iter;
> >> +       struct file *file;
> >> +       ssize_t ret;
> >> +
> >> +       ret = io_prep_rw(req, sqe, force_nonblock);
> >> +       if (ret)
> >> +               return ret;
> >> +       file = kiocb->ki_filp;
> >> +
> >> +       ret = -EBADF;
> >> +       if (unlikely(!(file->f_mode & FMODE_READ)))
> >> +               goto out_fput;
> >> +       ret = -EINVAL;
> >> +       if (unlikely(!file->f_op->read_iter))
> >> +               goto out_fput;
> >> +
> >> +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> >> +       if (ret)
> >> +               goto out_fput;
> >> +
> >> +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> >> +       if (!ret) {
> >> +               ssize_t ret2;
> >> +
> >> +               /* Catch -EAGAIN return for forced non-blocking submission */
> >> +               ret2 = call_read_iter(file, kiocb, &iter);
> >> +               if (!force_nonblock || ret2 != -EAGAIN)
> >> +                       io_rw_done(kiocb, ret2);
> >> +               else
> >> +                       ret = -EAGAIN;
> >> +       }
> >> +       kfree(iovec);
> >> +out_fput:
> >> +       if (unlikely(ret))
> >> +               fput(file);
> >> +       return ret;
> >> +}
> > [...]
> >> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >> +                          struct sqe_submit *s, bool force_nonblock)
> >> +{
> >> +       const struct io_uring_sqe *sqe = s->sqe;
> >> +       ssize_t ret;
> >> +
> >> +       if (unlikely(s->index >= ctx->sq_entries))
> >> +               return -EINVAL;
> >> +       req->user_data = sqe->user_data;
> >> +
> >> +       ret = -EINVAL;
> >> +       switch (sqe->opcode) {
> >> +       case IORING_OP_NOP:
> >> +               ret = io_nop(req, sqe);
> >> +               break;
> >> +       case IORING_OP_READV:
> >> +               ret = io_read(req, sqe, force_nonblock);
> >> +               break;
> >> +       case IORING_OP_WRITEV:
> >> +               ret = io_write(req, sqe, force_nonblock);
> >> +               break;
> >> +       default:
> >> +               ret = -EINVAL;
> >> +               break;
> >> +       }
> >> +
> >> +       return ret;
> >> +}
> >> +
> >> +static void io_sq_wq_submit_work(struct work_struct *work)
> >> +{
> >> +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> >> +       struct sqe_submit *s = &req->submit;
> >> +       u64 user_data = s->sqe->user_data;
> >> +       struct io_ring_ctx *ctx = req->ctx;
> >> +       mm_segment_t old_fs = get_fs();
> >> +       struct files_struct *old_files;
> >> +       int ret;
> >> +
> >> +        /* Ensure we clear previously set forced non-block flag */
> >> +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
> >> +
> >> +       old_files = current->files;
> >> +       current->files = ctx->sqo_files;
> >
> > I think you're not supposed to twiddle with current->files without
> > holding task_lock(current).
>
> 'current' is the work queue item in this case, do we need to protect
> against anything else? I can add the locking around the assignments
> (both places).

Stuff like proc_fd_link() uses get_files_struct(), which grabs a
reference to your current files_struct protected only by task_lock();
and it doesn't use anything like READ_ONCE(), so even if the object
lifetime is not a problem, get_files_struct() could potentially crash
due to a double-read (reading task->files twice and assuming that the
result will be the same). As far as I can tell, this procfs code also
works on kernel threads.

> >> +       if (!mmget_not_zero(ctx->sqo_mm)) {
> >> +               ret = -EFAULT;
> >> +               goto err;
> >> +       }
> >> +
> >> +       use_mm(ctx->sqo_mm);
> >> +       set_fs(USER_DS);
> >> +
> >> +       ret = __io_submit_sqe(ctx, req, s, false);
> >> +
> >> +       set_fs(old_fs);
> >> +       unuse_mm(ctx->sqo_mm);
> >> +       mmput(ctx->sqo_mm);
> >> +err:
> >> +       if (ret) {
> >> +               io_cqring_add_event(ctx, user_data, ret, 0);
> >> +               io_free_req(req);
> >> +       }
> >> +       current->files = old_files;
> >> +}
> > [...]
> >> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> >> +{
> >> +       int ret;
> >> +
> >> +       ctx->sqo_mm = current->mm;
> >
> > What keeps this thing alive?
>
> I think we're deadling with the same thing as the files below, I'll
> defer to that.
>
> >> +       /*
> >> +        * This is safe since 'current' has the fd installed, and if that gets
> >> +        * closed on exit, then fops->release() is invoked which waits for the
> >> +        * async contexts to flush and exit before exiting.
> >> +        */
> >> +       ret = -EBADF;
> >> +       ctx->sqo_files = current->files;
> >> +       if (!ctx->sqo_files)
> >> +               goto err;
> >
> > That's gnarly. Adding Al Viro to the thread.
> >
> > I think you misunderstand the semantics of f_op->release. The ->flush
> > handler is invoked whenever a file descriptor is closed through
> > filp_close() (via deletion of the files_struct, sys_close(),
> > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > work. But the ->release handler only runs when the _last_ reference to
> > a struct file has been dropped - so you can, for example, fork() a
> > child, then exit() in the parent, and the ->release handler isn't
> > invoked. So I don't see how this can work.
>
> The anonfd is CLOEXEC. The idea is exactly that it only runs when the
> last reference to the file has been dropped. Not sure why you think I
> need ->flush() here?

Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
Or send the fd via SCM_RIGHTS?

> > But even if you had abused ->flush for this instead: close_files()
> > currently has a comment in it that claims that "this is the last
> > reference to the files structure"; this change would make that claim
> > untrue.
>
> Let me see if I can explain my intent better than that comment... We
> know the parent who set up the io_uring instance will be around for as
> long as io_uring instance persists.

That's the part that I think is wrong: As far as I can tell, the
parent can go away and you won't notice.

Also, note that "the parent" is different things for ->files and ->mm.
You can have a multithreaded process whose threads don't have the same
->files, or multiple process that share ->files without sharing ->mm,
...

> When we are tearing down the
> io_uring, then we wait for any async contexts (like the one above) to
> exit. Once they are exited, it's safe to proceed with the exit and
> teardown ->files[].

But you only do that teardown on ->release, right? And ->release
doesn't have much to do with the process lifetime.

> That's the idea... Not trying to be clever, some of this dates back to
> the aio weirdness where it was impossible to have cross references like
> this, since it would lead to teardown deadlocks with how exit_aio() is
> used. I can probably grab a struct files reference above, but currently
> I don't see why it's needed.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 22:32   ` Jann Horn
@ 2019-01-28 23:46     ` Jens Axboe
  2019-01-28 23:59       ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 23:46 UTC (permalink / raw)
  To: Jann Horn, Al Viro
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 3:32 PM, Jann Horn wrote:
> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>> The submission queue (SQ) and completion queue (CQ) rings are shared
>> between the application and the kernel. This eliminates the need to
>> copy data back and forth to submit and complete IO.
>>
>> IO submissions use the io_uring_sqe data structure, and completions
>> are generated in the form of io_uring_sqe data structures. The SQ
>> ring is an index into the io_uring_sqe array, which makes it possible
>> to submit a batch of IOs without them being contiguous in the ring.
>> The CQ ring is always contiguous, as completion events are inherently
>> unordered, and hence any io_uring_cqe entry can point back to an
>> arbitrary submission.
>>
>> Two new system calls are added for this:
>>
>> io_uring_setup(entries, params)
>>         Sets up a context for doing async IO. On success, returns a file
>>         descriptor that the application can mmap to gain access to the
>>         SQ ring, CQ ring, and io_uring_sqes.
>>
>> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>>         Initiates IO against the rings mapped to this fd, or waits for
>>         them to complete, or both. The behavior is controlled by the
>>         parameters passed in. If 'to_submit' is non-zero, then we'll
>>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>>         kernel will wait for 'min_complete' events, if they aren't
>>         already available. It's valid to set IORING_ENTER_GETEVENTS
>>         and 'min_complete' == 0 at the same time, this allows the
>>         kernel to return already completed events without waiting
>>         for them. This is useful only for polling, as for IRQ
>>         driven IO, the application can just check the CQ ring
>>         without entering the kernel.
>>
>> With this setup, it's possible to do async IO with a single system
>> call. Future developments will enable polled IO with this interface,
>> and polled submission as well. The latter will enable an application
>> to do IO without doing ANY system calls at all.
>>
>> For IRQ driven IO, an application only needs to enter the kernel for
>> completions if it wants to wait for them to occur.
>>
>> Each io_uring is backed by a workqueue, to support buffered async IO
>> as well. We will only punt to an async context if the command would
>> need to wait for IO on the device side. Any data that can be accessed
>> directly in the page cache is done inline. This avoids the slowness
>> issue of usual threadpools, since cached data is accessed as quickly
>> as a sync interface.
>>
>> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
> [...]
>> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>> +                     bool force_nonblock)
>> +{
>> +       struct kiocb *kiocb = &req->rw;
>> +       int ret;
>> +
>> +       kiocb->ki_filp = fget(sqe->fd);
>> +       if (unlikely(!kiocb->ki_filp))
>> +               return -EBADF;
>> +       kiocb->ki_pos = sqe->off;
>> +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
>> +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
>> +       if (sqe->ioprio) {
>> +               ret = ioprio_check_cap(sqe->ioprio);
>> +               if (ret)
>> +                       goto out_fput;
>> +
>> +               kiocb->ki_ioprio = sqe->ioprio;
>> +       } else
>> +               kiocb->ki_ioprio = get_current_ioprio();
>> +
>> +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
>> +       if (unlikely(ret))
>> +               goto out_fput;
>> +       if (force_nonblock) {
>> +               kiocb->ki_flags |= IOCB_NOWAIT;
>> +               req->flags |= REQ_F_FORCE_NONBLOCK;
>> +       }
>> +       if (kiocb->ki_flags & IOCB_HIPRI) {
>> +               ret = -EINVAL;
>> +               goto out_fput;
>> +       }
>> +
>> +       kiocb->ki_complete = io_complete_rw;
>> +       return 0;
>> +out_fput:
>> +       fput(kiocb->ki_filp);
>> +       return ret;
>> +}
> [...]
>> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>> +                      bool force_nonblock)
>> +{
>> +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>> +       struct kiocb *kiocb = &req->rw;
>> +       struct iov_iter iter;
>> +       struct file *file;
>> +       ssize_t ret;
>> +
>> +       ret = io_prep_rw(req, sqe, force_nonblock);
>> +       if (ret)
>> +               return ret;
>> +       file = kiocb->ki_filp;
>> +
>> +       ret = -EBADF;
>> +       if (unlikely(!(file->f_mode & FMODE_READ)))
>> +               goto out_fput;
>> +       ret = -EINVAL;
>> +       if (unlikely(!file->f_op->read_iter))
>> +               goto out_fput;
>> +
>> +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
>> +       if (ret)
>> +               goto out_fput;
>> +
>> +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
>> +       if (!ret) {
>> +               ssize_t ret2;
>> +
>> +               /* Catch -EAGAIN return for forced non-blocking submission */
>> +               ret2 = call_read_iter(file, kiocb, &iter);
>> +               if (!force_nonblock || ret2 != -EAGAIN)
>> +                       io_rw_done(kiocb, ret2);
>> +               else
>> +                       ret = -EAGAIN;
>> +       }
>> +       kfree(iovec);
>> +out_fput:
>> +       if (unlikely(ret))
>> +               fput(file);
>> +       return ret;
>> +}
> [...]
>> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>> +                          struct sqe_submit *s, bool force_nonblock)
>> +{
>> +       const struct io_uring_sqe *sqe = s->sqe;
>> +       ssize_t ret;
>> +
>> +       if (unlikely(s->index >= ctx->sq_entries))
>> +               return -EINVAL;
>> +       req->user_data = sqe->user_data;
>> +
>> +       ret = -EINVAL;
>> +       switch (sqe->opcode) {
>> +       case IORING_OP_NOP:
>> +               ret = io_nop(req, sqe);
>> +               break;
>> +       case IORING_OP_READV:
>> +               ret = io_read(req, sqe, force_nonblock);
>> +               break;
>> +       case IORING_OP_WRITEV:
>> +               ret = io_write(req, sqe, force_nonblock);
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +
>> +       return ret;
>> +}
>> +
>> +static void io_sq_wq_submit_work(struct work_struct *work)
>> +{
>> +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>> +       struct sqe_submit *s = &req->submit;
>> +       u64 user_data = s->sqe->user_data;
>> +       struct io_ring_ctx *ctx = req->ctx;
>> +       mm_segment_t old_fs = get_fs();
>> +       struct files_struct *old_files;
>> +       int ret;
>> +
>> +        /* Ensure we clear previously set forced non-block flag */
>> +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
>> +
>> +       old_files = current->files;
>> +       current->files = ctx->sqo_files;
> 
> I think you're not supposed to twiddle with current->files without
> holding task_lock(current).

'current' is the work queue item in this case, do we need to protect
against anything else? I can add the locking around the assignments
(both places).

>> +       if (!mmget_not_zero(ctx->sqo_mm)) {
>> +               ret = -EFAULT;
>> +               goto err;
>> +       }
>> +
>> +       use_mm(ctx->sqo_mm);
>> +       set_fs(USER_DS);
>> +
>> +       ret = __io_submit_sqe(ctx, req, s, false);
>> +
>> +       set_fs(old_fs);
>> +       unuse_mm(ctx->sqo_mm);
>> +       mmput(ctx->sqo_mm);
>> +err:
>> +       if (ret) {
>> +               io_cqring_add_event(ctx, user_data, ret, 0);
>> +               io_free_req(req);
>> +       }
>> +       current->files = old_files;
>> +}
> [...]
>> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
>> +{
>> +       int ret;
>> +
>> +       ctx->sqo_mm = current->mm;
> 
> What keeps this thing alive?

I think we're deadling with the same thing as the files below, I'll
defer to that.

>> +       /*
>> +        * This is safe since 'current' has the fd installed, and if that gets
>> +        * closed on exit, then fops->release() is invoked which waits for the
>> +        * async contexts to flush and exit before exiting.
>> +        */
>> +       ret = -EBADF;
>> +       ctx->sqo_files = current->files;
>> +       if (!ctx->sqo_files)
>> +               goto err;
> 
> That's gnarly. Adding Al Viro to the thread.
> 
> I think you misunderstand the semantics of f_op->release. The ->flush
> handler is invoked whenever a file descriptor is closed through
> filp_close() (via deletion of the files_struct, sys_close(),
> sys_dup2(), ...), so if you had used that one, _maybe_ this would
> work. But the ->release handler only runs when the _last_ reference to
> a struct file has been dropped - so you can, for example, fork() a
> child, then exit() in the parent, and the ->release handler isn't
> invoked. So I don't see how this can work.

The anonfd is CLOEXEC. The idea is exactly that it only runs when the
last reference to the file has been dropped. Not sure why you think I
need ->flush() here?

> But even if you had abused ->flush for this instead: close_files()
> currently has a comment in it that claims that "this is the last
> reference to the files structure"; this change would make that claim
> untrue.

Let me see if I can explain my intent better than that comment... We
know the parent who set up the io_uring instance will be around for as
long as io_uring instance persists. When we are tearing down the
io_uring, then we wait for any async contexts (like the one above) to
exit. Once they are exited, it's safe to proceed with the exit and
teardown ->files[].

That's the idea... Not trying to be clever, some of this dates back to
the aio weirdness where it was impossible to have cross references like
this, since it would lead to teardown deadlocks with how exit_aio() is
used. I can probably grab a struct files reference above, but currently
I don't see why it's needed.

>> +       /* Do QD, or 2 * CPUS, whatever is smallest */
>> +       ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
>> +                       min(ctx->sq_entries - 1, 2 * num_online_cpus()));
>> +       if (!ctx->sqo_wq) {
>> +               ret = -ENOMEM;
>> +               goto err;
>> +       }
>> +
>> +       return 0;
>> +err:
>> +       if (ctx->sqo_files)
>> +               ctx->sqo_files = NULL;
>> +       ctx->sqo_mm = NULL;
>> +       return ret;
>> +}
> [...]
>> +static const struct file_operations io_uring_fops = {
>> +       .release        = io_uring_release,
>> +       .mmap           = io_uring_mmap,
>> +       .poll           = io_uring_poll,
>> +       .fasync         = io_uring_fasync,
>> +};
> [...]
> 


-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
  2019-01-28 21:53   ` Jeff Moyer
@ 2019-01-28 22:32   ` Jann Horn
  2019-01-28 23:46     ` Jens Axboe
  2019-01-29  1:07   ` Jann Horn
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2019-01-28 22:32 UTC (permalink / raw)
  To: Jens Axboe, Al Viro
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> The submission queue (SQ) and completion queue (CQ) rings are shared
> between the application and the kernel. This eliminates the need to
> copy data back and forth to submit and complete IO.
>
> IO submissions use the io_uring_sqe data structure, and completions
> are generated in the form of io_uring_sqe data structures. The SQ
> ring is an index into the io_uring_sqe array, which makes it possible
> to submit a batch of IOs without them being contiguous in the ring.
> The CQ ring is always contiguous, as completion events are inherently
> unordered, and hence any io_uring_cqe entry can point back to an
> arbitrary submission.
>
> Two new system calls are added for this:
>
> io_uring_setup(entries, params)
>         Sets up a context for doing async IO. On success, returns a file
>         descriptor that the application can mmap to gain access to the
>         SQ ring, CQ ring, and io_uring_sqes.
>
> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>         Initiates IO against the rings mapped to this fd, or waits for
>         them to complete, or both. The behavior is controlled by the
>         parameters passed in. If 'to_submit' is non-zero, then we'll
>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>         kernel will wait for 'min_complete' events, if they aren't
>         already available. It's valid to set IORING_ENTER_GETEVENTS
>         and 'min_complete' == 0 at the same time, this allows the
>         kernel to return already completed events without waiting
>         for them. This is useful only for polling, as for IRQ
>         driven IO, the application can just check the CQ ring
>         without entering the kernel.
>
> With this setup, it's possible to do async IO with a single system
> call. Future developments will enable polled IO with this interface,
> and polled submission as well. The latter will enable an application
> to do IO without doing ANY system calls at all.
>
> For IRQ driven IO, an application only needs to enter the kernel for
> completions if it wants to wait for them to occur.
>
> Each io_uring is backed by a workqueue, to support buffered async IO
> as well. We will only punt to an async context if the command would
> need to wait for IO on the device side. Any data that can be accessed
> directly in the page cache is done inline. This avoids the slowness
> issue of usual threadpools, since cached data is accessed as quickly
> as a sync interface.
>
> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
[...]
> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                     bool force_nonblock)
> +{
> +       struct kiocb *kiocb = &req->rw;
> +       int ret;
> +
> +       kiocb->ki_filp = fget(sqe->fd);
> +       if (unlikely(!kiocb->ki_filp))
> +               return -EBADF;
> +       kiocb->ki_pos = sqe->off;
> +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> +       if (sqe->ioprio) {
> +               ret = ioprio_check_cap(sqe->ioprio);
> +               if (ret)
> +                       goto out_fput;
> +
> +               kiocb->ki_ioprio = sqe->ioprio;
> +       } else
> +               kiocb->ki_ioprio = get_current_ioprio();
> +
> +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> +       if (unlikely(ret))
> +               goto out_fput;
> +       if (force_nonblock) {
> +               kiocb->ki_flags |= IOCB_NOWAIT;
> +               req->flags |= REQ_F_FORCE_NONBLOCK;
> +       }
> +       if (kiocb->ki_flags & IOCB_HIPRI) {
> +               ret = -EINVAL;
> +               goto out_fput;
> +       }
> +
> +       kiocb->ki_complete = io_complete_rw;
> +       return 0;
> +out_fput:
> +       fput(kiocb->ki_filp);
> +       return ret;
> +}
[...]
> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                      bool force_nonblock)
> +{
> +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> +       struct kiocb *kiocb = &req->rw;
> +       struct iov_iter iter;
> +       struct file *file;
> +       ssize_t ret;
> +
> +       ret = io_prep_rw(req, sqe, force_nonblock);
> +       if (ret)
> +               return ret;
> +       file = kiocb->ki_filp;
> +
> +       ret = -EBADF;
> +       if (unlikely(!(file->f_mode & FMODE_READ)))
> +               goto out_fput;
> +       ret = -EINVAL;
> +       if (unlikely(!file->f_op->read_iter))
> +               goto out_fput;
> +
> +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> +       if (ret)
> +               goto out_fput;
> +
> +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> +       if (!ret) {
> +               ssize_t ret2;
> +
> +               /* Catch -EAGAIN return for forced non-blocking submission */
> +               ret2 = call_read_iter(file, kiocb, &iter);
> +               if (!force_nonblock || ret2 != -EAGAIN)
> +                       io_rw_done(kiocb, ret2);
> +               else
> +                       ret = -EAGAIN;
> +       }
> +       kfree(iovec);
> +out_fput:
> +       if (unlikely(ret))
> +               fput(file);
> +       return ret;
> +}
[...]
> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> +                          struct sqe_submit *s, bool force_nonblock)
> +{
> +       const struct io_uring_sqe *sqe = s->sqe;
> +       ssize_t ret;
> +
> +       if (unlikely(s->index >= ctx->sq_entries))
> +               return -EINVAL;
> +       req->user_data = sqe->user_data;
> +
> +       ret = -EINVAL;
> +       switch (sqe->opcode) {
> +       case IORING_OP_NOP:
> +               ret = io_nop(req, sqe);
> +               break;
> +       case IORING_OP_READV:
> +               ret = io_read(req, sqe, force_nonblock);
> +               break;
> +       case IORING_OP_WRITEV:
> +               ret = io_write(req, sqe, force_nonblock);
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +static void io_sq_wq_submit_work(struct work_struct *work)
> +{
> +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> +       struct sqe_submit *s = &req->submit;
> +       u64 user_data = s->sqe->user_data;
> +       struct io_ring_ctx *ctx = req->ctx;
> +       mm_segment_t old_fs = get_fs();
> +       struct files_struct *old_files;
> +       int ret;
> +
> +        /* Ensure we clear previously set forced non-block flag */
> +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
> +
> +       old_files = current->files;
> +       current->files = ctx->sqo_files;

I think you're not supposed to twiddle with current->files without
holding task_lock(current).

> +       if (!mmget_not_zero(ctx->sqo_mm)) {
> +               ret = -EFAULT;
> +               goto err;
> +       }
> +
> +       use_mm(ctx->sqo_mm);
> +       set_fs(USER_DS);
> +
> +       ret = __io_submit_sqe(ctx, req, s, false);
> +
> +       set_fs(old_fs);
> +       unuse_mm(ctx->sqo_mm);
> +       mmput(ctx->sqo_mm);
> +err:
> +       if (ret) {
> +               io_cqring_add_event(ctx, user_data, ret, 0);
> +               io_free_req(req);
> +       }
> +       current->files = old_files;
> +}
[...]
> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> +{
> +       int ret;
> +
> +       ctx->sqo_mm = current->mm;

What keeps this thing alive?

> +       /*
> +        * This is safe since 'current' has the fd installed, and if that gets
> +        * closed on exit, then fops->release() is invoked which waits for the
> +        * async contexts to flush and exit before exiting.
> +        */
> +       ret = -EBADF;
> +       ctx->sqo_files = current->files;
> +       if (!ctx->sqo_files)
> +               goto err;

That's gnarly. Adding Al Viro to the thread.

I think you misunderstand the semantics of f_op->release. The ->flush
handler is invoked whenever a file descriptor is closed through
filp_close() (via deletion of the files_struct, sys_close(),
sys_dup2(), ...), so if you had used that one, _maybe_ this would
work. But the ->release handler only runs when the _last_ reference to
a struct file has been dropped - so you can, for example, fork() a
child, then exit() in the parent, and the ->release handler isn't
invoked. So I don't see how this can work.

But even if you had abused ->flush for this instead: close_files()
currently has a comment in it that claims that "this is the last
reference to the files structure"; this change would make that claim
untrue.

> +       /* Do QD, or 2 * CPUS, whatever is smallest */
> +       ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
> +                       min(ctx->sq_entries - 1, 2 * num_online_cpus()));
> +       if (!ctx->sqo_wq) {
> +               ret = -ENOMEM;
> +               goto err;
> +       }
> +
> +       return 0;
> +err:
> +       if (ctx->sqo_files)
> +               ctx->sqo_files = NULL;
> +       ctx->sqo_mm = NULL;
> +       return ret;
> +}
[...]
> +static const struct file_operations io_uring_fops = {
> +       .release        = io_uring_release,
> +       .mmap           = io_uring_mmap,
> +       .poll           = io_uring_poll,
> +       .fasync         = io_uring_fasync,
> +};
[...]

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:53   ` Jeff Moyer
@ 2019-01-28 21:56     ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 21:56 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, linux-block, linux-man, linux-api, hch, avi

On 1/28/19 2:53 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
>> +			    unsigned min_complete, unsigned flags,
>> +			    const sigset_t __user *sig, size_t sigsz)
>> +{
>> +	int submitted, ret;
>> +
>> +	submitted = ret = 0;
>> +	if (to_submit) {
>> +		submitted = io_ring_submit(ctx, to_submit);
>> +		if (submitted < 0)
>> +			return submitted;
>> +	}
>> +	if (flags & IORING_ENTER_GETEVENTS) {
> 
> Do we want to disallow any unsupported flags?

Yes good point, we do that in other spots. Fixed.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
@ 2019-01-28 21:53   ` Jeff Moyer
  2019-01-28 21:56     ` Jens Axboe
  2019-01-28 22:32   ` Jann Horn
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 49+ messages in thread
From: Jeff Moyer @ 2019-01-28 21:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, linux-man, linux-api, hch, avi

Jens Axboe <axboe@kernel.dk> writes:

> +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
> +			    unsigned min_complete, unsigned flags,
> +			    const sigset_t __user *sig, size_t sigsz)
> +{
> +	int submitted, ret;
> +
> +	submitted = ret = 0;
> +	if (to_submit) {
> +		submitted = io_ring_submit(ctx, to_submit);
> +		if (submitted < 0)
> +			return submitted;
> +	}
> +	if (flags & IORING_ENTER_GETEVENTS) {

Do we want to disallow any unsupported flags?

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 05/18] Add io_uring IO interface
  2019-01-28 21:35 [PATCHSET v8] " Jens Axboe
@ 2019-01-28 21:35 ` Jens Axboe
  2019-01-28 21:53   ` Jeff Moyer
                     ` (5 more replies)
  0 siblings, 6 replies; 49+ messages in thread
From: Jens Axboe @ 2019-01-28 21:35 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-man, linux-api; +Cc: hch, jmoyer, avi, Jens Axboe

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_sqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up a context for doing async IO. On success, returns a file
	descriptor that the application can mmap to gain access to the
	SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 fs/Makefile                            |    1 +
 fs/io_uring.c                          | 1090 ++++++++++++++++++++++++
 include/linux/syscalls.h               |    6 +
 include/uapi/asm-generic/unistd.h      |    6 +-
 include/uapi/linux/io_uring.h          |   96 +++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    2 +
 9 files changed, 1213 insertions(+), 1 deletion(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..481c126259e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,5 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
+426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..6a32a430c8e0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,8 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+425	common	io_uring_setup		__x64_sys_io_uring_setup
+426	common	io_uring_enter		__x64_sys_io_uring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..8e15d6fc4340 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/fs/io_uring.c b/fs/io_uring.c
new file mode 100644
index 000000000000..309eed3629d2
--- /dev/null
+++ b/fs/io_uring.c
@@ -0,0 +1,1090 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared application/kernel submission and completion ring pairs, for
+ * supporting fast/efficient IO.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/refcount.h>
+#include <linux/uio.h>
+
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/blkdev.h>
+#include <linux/anon_inodes.h>
+#include <linux/sched/mm.h>
+
+#include <linux/uaccess.h>
+#include <linux/nospec.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "internal.h"
+
+struct io_uring {
+	u32 head ____cacheline_aligned_in_smp;
+	u32 tail ____cacheline_aligned_in_smp;
+};
+
+struct io_sq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			dropped;
+	u32			flags;
+	u32			array[];
+};
+
+struct io_cq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			overflow;
+	struct io_uring_cqe	cqes[];
+};
+
+struct io_ring_ctx {
+	struct {
+		struct percpu_ref	refs;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		unsigned int		flags;
+
+		/* SQ ring */
+		struct io_sq_ring	*sq_ring;
+		unsigned		cached_sq_head;
+		unsigned		sq_entries;
+		unsigned		sq_mask;
+		unsigned		sq_thread_cpu;
+		struct io_uring_sqe	*sq_sqes;
+	} ____cacheline_aligned_in_smp;
+
+	/* IO offload */
+	struct workqueue_struct	*sqo_wq;
+	struct mm_struct	*sqo_mm;
+	struct files_struct	*sqo_files;
+
+	struct {
+		/* CQ ring */
+		struct io_cq_ring	*cq_ring;
+		unsigned		cached_cq_tail;
+		unsigned		cq_entries;
+		unsigned		cq_mask;
+		struct wait_queue_head	cq_wait;
+		struct fasync_struct	*cq_fasync;
+	} ____cacheline_aligned_in_smp;
+
+	struct user_struct	*user;
+
+	struct completion	ctx_done;
+
+	struct {
+		struct mutex		uring_lock;
+		wait_queue_head_t	wait;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		spinlock_t		completion_lock;
+	} ____cacheline_aligned_in_smp;
+};
+
+struct sqe_submit {
+	const struct io_uring_sqe	*sqe;
+	unsigned			index;
+};
+
+struct io_kiocb {
+	union {
+		struct kiocb		rw;
+		struct sqe_submit	submit;
+	};
+
+	struct io_ring_ctx	*ctx;
+	struct list_head	list;
+	unsigned int		flags;
+#define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+	u64			user_data;
+
+	struct work_struct	work;
+};
+
+#define IO_PLUG_THRESHOLD		2
+
+static struct kmem_cache *req_cachep;
+
+static const struct file_operations io_uring_fops;
+
+static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+{
+	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
+
+	complete(&ctx->ctx_done);
+}
+
+static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) {
+		kfree(ctx);
+		return NULL;
+	}
+
+	ctx->flags = p->flags;
+	init_waitqueue_head(&ctx->cq_wait);
+	init_completion(&ctx->ctx_done);
+	mutex_init(&ctx->uring_lock);
+	init_waitqueue_head(&ctx->wait);
+	spin_lock_init(&ctx->completion_lock);
+	return ctx;
+}
+
+static void io_commit_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+
+	if (ctx->cached_cq_tail != ring->r.tail) {
+		/* order cqe stores with ring update */
+		smp_wmb();
+		ring->r.tail = ctx->cached_cq_tail;
+		/* write side barrier of tail update, app has read side */
+		smp_wmb();
+
+		if (wq_has_sleeper(&ctx->cq_wait)) {
+			wake_up_interruptible(&ctx->cq_wait);
+			kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
+		}
+	}
+}
+
+static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	unsigned tail;
+
+	tail = ctx->cached_cq_tail;
+	smp_rmb();
+	if (tail + 1 == READ_ONCE(ring->r.head))
+		return NULL;
+
+	ctx->cached_cq_tail++;
+	return &ring->cqes[tail & ctx->cq_mask];
+}
+
+static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				  long res, unsigned ev_flags)
+{
+	struct io_uring_cqe *cqe;
+
+	/*
+	 * If we can't get a cq entry, userspace overflowed the
+	 * submission (by quite a lot). Increment the overflow count in
+	 * the ring.
+	 */
+	cqe = io_get_cqring(ctx);
+	if (cqe) {
+		cqe->user_data = ki_user_data;
+		cqe->res = res;
+		cqe->flags = ev_flags;
+		io_commit_cqring(ctx);
+	} else
+		ctx->cq_ring->overflow++;
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				long res, unsigned ev_flags)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	__io_cqring_add_event(ctx, ki_user_data, res, ev_flags);
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+}
+
+static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
+{
+	percpu_ref_put_many(&ctx->refs, refs);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	/* safe to use the non tryget, as we're inside ring ref already */
+	percpu_ref_get(&ctx->refs);
+
+	req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN);
+	if (req) {
+		req->ctx = ctx;
+		req->flags = 0;
+		return req;
+	}
+
+	io_ring_drop_ctx_refs(ctx, 1);
+	return NULL;
+}
+
+static void io_free_req(struct io_kiocb *req)
+{
+	io_ring_drop_ctx_refs(req->ctx, 1);
+	kmem_cache_free(req_cachep, req);
+}
+
+static void kiocb_end_write(struct kiocb *kiocb)
+{
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct inode *inode = file_inode(kiocb->ki_filp);
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		if (S_ISREG(inode->i_mode))
+			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+		file_end_write(kiocb->ki_filp);
+	}
+}
+
+static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	fput(kiocb->ki_filp);
+	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+	io_free_req(req);
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		      bool force_nonblock)
+{
+	struct kiocb *kiocb = &req->rw;
+	int ret;
+
+	kiocb->ki_filp = fget(sqe->fd);
+	if (unlikely(!kiocb->ki_filp))
+		return -EBADF;
+	kiocb->ki_pos = sqe->off;
+	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
+	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
+	if (sqe->ioprio) {
+		ret = ioprio_check_cap(sqe->ioprio);
+		if (ret)
+			goto out_fput;
+
+		kiocb->ki_ioprio = sqe->ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
+	ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
+	if (unlikely(ret))
+		goto out_fput;
+	if (force_nonblock) {
+		kiocb->ki_flags |= IOCB_NOWAIT;
+		req->flags |= REQ_F_FORCE_NONBLOCK;
+	}
+	if (kiocb->ki_flags & IOCB_HIPRI) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	kiocb->ki_complete = io_complete_rw;
+	return 0;
+out_fput:
+	fput(kiocb->ki_filp);
+	return ret;
+}
+
+static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
+{
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/*
+		 * We can't just restart the syscall, since previously
+		 * submitted sqes may already be in progress. Just fail this
+		 * IO with EINTR.
+		 */
+		ret = -EINTR;
+		/* fall through */
+	default:
+		kiocb->ki_complete(kiocb, ret, 0);
+	}
+}
+
+static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iovec **iovec, struct iov_iter *iter)
+{
+	void __user *buf = u64_to_user_ptr(sqe->addr);
+
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
+						iovec, iter);
+#endif
+
+	return import_iovec(rw, buf, sqe->len, UIO_FASTIOV, iovec, iter);
+}
+
+static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		       bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->read_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	if (!ret) {
+		ssize_t ret2;
+
+		/* Catch -EAGAIN return for forced non-blocking submission */
+		ret2 = call_read_iter(file, kiocb, &iter);
+		if (!force_nonblock || ret2 != -EAGAIN)
+			io_rw_done(kiocb, ret2);
+		else
+			ret = -EAGAIN;
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
+		goto out_fput;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->write_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, WRITE, sqe, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
+				iov_iter_count(&iter));
+	if (!ret) {
+		/*
+		 * Open-code file_start_write here to grab freeze protection,
+		 * which will be released by another thread in
+		 * io_complete_rw().  Fool lockdep by telling it the lock got
+		 * released so that it doesn't complain about the held lock when
+		 * we return to userspace.
+		 */
+		if (S_ISREG(file_inode(file)->i_mode)) {
+			__sb_start_write(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE, true);
+			__sb_writers_release(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE);
+		}
+		kiocb->ki_flags |= IOCB_WRITE;
+		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+/*
+ * IORING_OP_NOP just posts a completion event, nothing else.
+ */
+static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	io_cqring_add_event(ctx, sqe->user_data, 0, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			   struct sqe_submit *s, bool force_nonblock)
+{
+	const struct io_uring_sqe *sqe = s->sqe;
+	ssize_t ret;
+
+	if (unlikely(s->index >= ctx->sq_entries))
+		return -EINVAL;
+	req->user_data = sqe->user_data;
+
+	ret = -EINVAL;
+	switch (sqe->opcode) {
+	case IORING_OP_NOP:
+		ret = io_nop(req, sqe);
+		break;
+	case IORING_OP_READV:
+		ret = io_read(req, sqe, force_nonblock);
+		break;
+	case IORING_OP_WRITEV:
+		ret = io_write(req, sqe, force_nonblock);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static void io_sq_wq_submit_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct sqe_submit *s = &req->submit;
+	u64 user_data = s->sqe->user_data;
+	struct io_ring_ctx *ctx = req->ctx;
+	mm_segment_t old_fs = get_fs();
+	struct files_struct *old_files;
+	int ret;
+
+	 /* Ensure we clear previously set forced non-block flag */
+	req->flags &= ~REQ_F_FORCE_NONBLOCK;
+
+	old_files = current->files;
+	current->files = ctx->sqo_files;
+
+	if (!mmget_not_zero(ctx->sqo_mm)) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	use_mm(ctx->sqo_mm);
+	set_fs(USER_DS);
+
+	ret = __io_submit_sqe(ctx, req, s, false);
+
+	set_fs(old_fs);
+	unuse_mm(ctx->sqo_mm);
+	mmput(ctx->sqo_mm);
+err:
+	if (ret) {
+		io_cqring_add_event(ctx, user_data, ret, 0);
+		io_free_req(req);
+	}
+	current->files = old_files;
+}
+
+static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_kiocb *req;
+	ssize_t ret;
+
+	/* enforce forwards compatibility on users */
+	if (unlikely(s->sqe->flags))
+		return -EINVAL;
+
+	req = io_get_req(ctx);
+	if (unlikely(!req))
+		return -EAGAIN;
+
+	ret = __io_submit_sqe(ctx, req, s, true);
+	if (ret == -EAGAIN) {
+		memcpy(&req->submit, s, sizeof(*s));
+		INIT_WORK(&req->work, io_sq_wq_submit_work);
+		queue_work(ctx->sqo_wq, &req->work);
+		ret = 0;
+	}
+	if (ret)
+		io_free_req(req);
+
+	return ret;
+}
+
+static void io_commit_sqring(struct io_ring_ctx *ctx)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+
+	if (ctx->cached_sq_head != ring->r.head) {
+		ring->r.head = ctx->cached_sq_head;
+		/* write side barrier of head update, app has read side */
+		smp_wmb();
+	}
+}
+
+/*
+ * Undo last io_get_sqring()
+ */
+static void io_drop_sqring(struct io_ring_ctx *ctx)
+{
+	ctx->cached_sq_head--;
+}
+
+static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+	unsigned head;
+
+	/*
+	 * The cached sq head (or cq tail) serves two purposes:
+	 *
+	 * 1) allows us to batch the cost of updating the user visible
+	 *    head updates.
+	 * 2) allows the kernel side to track the head on its own, even
+	 *    though the application is the one updating it.
+	 */
+	head = ctx->cached_sq_head;
+	smp_rmb();
+	if (head == READ_ONCE(ring->r.tail))
+		return false;
+
+	head = ring->array[head & ctx->sq_mask];
+	if (head < ctx->sq_entries) {
+		s->index = head;
+		s->sqe = &ctx->sq_sqes[head];
+		ctx->cached_sq_head++;
+		return true;
+	}
+
+	/* drop invalid entries */
+	ctx->cached_sq_head++;
+	ring->dropped++;
+	smp_wmb();
+	return false;
+}
+
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	int i, ret = 0, submit = 0;
+	struct blk_plug plug;
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
+
+	for (i = 0; i < to_submit; i++) {
+		struct sqe_submit s;
+
+		if (!io_get_sqring(ctx, &s))
+			break;
+
+		ret = io_submit_sqe(ctx, &s);
+		if (ret) {
+			io_drop_sqring(ctx);
+			break;
+		}
+
+		submit++;
+	}
+	io_commit_sqring(ctx);
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
+
+	return submit ? submit : ret;
+}
+
+/*
+ * Wait until events become available, if we don't already have some. The
+ * application must reap them itself, as they reside on the shared cq ring.
+ */
+static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
+			  const sigset_t __user *sig, size_t sigsz)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	sigset_t ksigmask, sigsaved;
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	smp_rmb();
+	if (ring->r.head != ring->r.tail)
+		return 0;
+	if (!min_events)
+		return 0;
+
+	if (sig) {
+		ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz);
+		if (ret)
+			return ret;
+	}
+
+	do {
+		prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
+
+		ret = 0;
+		smp_rmb();
+		if (ring->r.head != ring->r.tail)
+			break;
+
+		schedule();
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+	} while (1);
+
+	finish_wait(&ctx->wait, &wait);
+
+	if (sig)
+		restore_user_sigmask(sig, &sigsaved);
+
+	return ring->r.head == ring->r.tail ? ret : 0;
+}
+
+static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
+			    unsigned min_complete, unsigned flags,
+			    const sigset_t __user *sig, size_t sigsz)
+{
+	int submitted, ret;
+
+	submitted = ret = 0;
+	if (to_submit) {
+		submitted = io_ring_submit(ctx, to_submit);
+		if (submitted < 0)
+			return submitted;
+	}
+	if (flags & IORING_ENTER_GETEVENTS) {
+		/*
+		 * The application could have included the 'to_submit' count
+		 * in how many events it wanted to wait for. If we failed to
+		 * submit the desired count, we may need to adjust the number
+		 * of events to poll/wait for.
+		 */
+		if (submitted < to_submit)
+			min_complete = min_t(unsigned, submitted, min_complete);
+
+		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+	}
+
+	return submitted ? submitted : ret;
+}
+
+static int io_sq_offload_start(struct io_ring_ctx *ctx)
+{
+	int ret;
+
+	ctx->sqo_mm = current->mm;
+
+	/*
+	 * This is safe since 'current' has the fd installed, and if that gets
+	 * closed on exit, then fops->release() is invoked which waits for the
+	 * async contexts to flush and exit before exiting.
+	 */
+	ret = -EBADF;
+	ctx->sqo_files = current->files;
+	if (!ctx->sqo_files)
+		goto err;
+
+	/* Do QD, or 2 * CPUS, whatever is smallest */
+	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
+			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
+	if (!ctx->sqo_wq) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+err:
+	if (ctx->sqo_files)
+		ctx->sqo_files = NULL;
+	ctx->sqo_mm = NULL;
+	return ret;
+}
+
+static void __io_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		__io_unaccount_mem(ctx->user, nr_pages);
+}
+
+static int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	/* Don't allow more pages than we can safely lock */
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+					new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void io_mem_free(void *ptr)
+{
+	struct page *page = virt_to_head_page(ptr);
+
+	if (put_page_testzero(page))
+		free_compound_page(page);
+}
+
+static void *io_mem_alloc(size_t size)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
+				__GFP_NORETRY;
+
+	return (void *) __get_free_pages(gfp_flags, get_order(size));
+}
+
+static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t bytes;
+
+	bytes = struct_size(sq_ring, array, sq_entries);
+	bytes += array_size(sizeof(struct io_uring_sqe), sq_entries);
+	bytes += struct_size(cq_ring, cqes, cq_entries);
+
+	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+}
+
+static void io_ring_ctx_free(struct io_ring_ctx *ctx)
+{
+	destroy_workqueue(ctx->sqo_wq);
+
+	io_mem_free(ctx->sq_ring);
+	io_mem_free(ctx->sq_sqes);
+	io_mem_free(ctx->cq_ring);
+
+	percpu_ref_exit(&ctx->refs);
+	io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries));
+	kfree(ctx);
+}
+
+static __poll_t io_uring_poll(struct file *file, poll_table *wait)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &ctx->cq_wait, wait);
+	smp_rmb();
+	if (ctx->sq_ring->r.tail + 1 != ctx->cached_sq_head)
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	if (ctx->cq_ring->r.head != ctx->cached_cq_tail)
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int io_uring_fasync(int fd, struct file *file, int on)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	return fasync_helper(fd, file, on, &ctx->cq_fasync);
+}
+
+static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
+{
+	mutex_lock(&ctx->uring_lock);
+	percpu_ref_kill(&ctx->refs);
+	mutex_unlock(&ctx->uring_lock);
+
+	wait_for_completion(&ctx->ctx_done);
+	io_ring_ctx_free(ctx);
+}
+
+static int io_uring_release(struct inode *inode, struct file *file)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	file->private_data = NULL;
+	io_ring_ctx_wait_and_kill(ctx);
+	return 0;
+}
+
+static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long sz = vma->vm_end - vma->vm_start;
+	struct io_ring_ctx *ctx = file->private_data;
+	unsigned long pfn;
+	struct page *page;
+	void *ptr;
+
+	switch (offset) {
+	case IORING_OFF_SQ_RING:
+		ptr = ctx->sq_ring;
+		break;
+	case IORING_OFF_SQES:
+		ptr = ctx->sq_sqes;
+		break;
+	case IORING_OFF_CQ_RING:
+		ptr = ctx->cq_ring;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > (PAGE_SIZE << compound_order(page)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
+		u32, min_complete, u32, flags, const sigset_t __user *, sig,
+		size_t, sigsz)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	ret = -EBUSY;
+	if (!mutex_trylock(&ctx->uring_lock))
+		goto out_ctx;
+
+	ret = __io_uring_enter(ctx, to_submit, min_complete, flags, sig, sigsz);
+	mutex_unlock(&ctx->uring_lock);
+out_ctx:
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
+static const struct file_operations io_uring_fops = {
+	.release	= io_uring_release,
+	.mmap		= io_uring_mmap,
+	.poll		= io_uring_poll,
+	.fasync		= io_uring_fasync,
+};
+
+static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+				  struct io_uring_params *p)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t size;
+
+	sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries));
+	if (!sq_ring)
+		return -ENOMEM;
+
+	ctx->sq_ring = sq_ring;
+	sq_ring->ring_mask = p->sq_entries - 1;
+	sq_ring->ring_entries = p->sq_entries;
+	ctx->sq_mask = sq_ring->ring_mask;
+	ctx->sq_entries = sq_ring->ring_entries;
+
+	size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
+	if (size == SIZE_MAX)
+		return -EOVERFLOW;
+
+	ctx->sq_sqes = io_mem_alloc(size);
+	if (!ctx->sq_sqes) {
+		io_mem_free(ctx->sq_ring);
+		return -ENOMEM;
+	}
+
+	cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries));
+	if (!cq_ring) {
+		io_mem_free(ctx->sq_ring);
+		io_mem_free(ctx->sq_sqes);
+		return -ENOMEM;
+	}
+
+	ctx->cq_ring = cq_ring;
+	cq_ring->ring_mask = p->cq_entries - 1;
+	cq_ring->ring_entries = p->cq_entries;
+	ctx->cq_mask = cq_ring->ring_mask;
+	ctx->cq_entries = cq_ring->ring_entries;
+	return 0;
+}
+
+static int io_uring_create(unsigned entries, struct io_uring_params *p)
+{
+	struct user_struct *user = NULL;
+	struct io_ring_ctx *ctx;
+	int ret;
+
+	if (!entries || entries > IORING_MAX_ENTRIES)
+		return -EINVAL;
+
+	/*
+	 * Use twice as many entries for the CQ ring. It's possible for the
+	 * application to drive a higher depth than the size of the SQ ring,
+	 * since the sqes are only used at submission time. This allows for
+	 * some flexibility in overcommitting a bit.
+	 */
+	p->sq_entries = roundup_pow_of_two(entries);
+	p->cq_entries = 2 * p->sq_entries;
+
+	if (!capable(CAP_IPC_LOCK)) {
+		user = get_uid(current_user());
+		ret = __io_account_mem(user, ring_pages(p->sq_entries,
+							p->cq_entries));
+		if (ret) {
+			free_uid(user);
+			return ret;
+		}
+	}
+
+	ctx = io_ring_ctx_alloc(p);
+	if (!ctx) {
+		__io_unaccount_mem(user, ring_pages(p->sq_entries,
+							p->cq_entries));
+		free_uid(user);
+		return -ENOMEM;
+	}
+	ctx->user = user;
+
+	ret = io_allocate_scq_urings(ctx, p);
+	if (ret)
+		goto err;
+
+	ret = io_sq_offload_start(ctx);
+	if (ret)
+		goto err;
+
+	ret = anon_inode_getfd("[io_uring]", &io_uring_fops, ctx,
+				O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto err;
+
+	memset(&p->sq_off, 0, sizeof(p->sq_off));
+	p->sq_off.head = offsetof(struct io_sq_ring, r.head);
+	p->sq_off.tail = offsetof(struct io_sq_ring, r.tail);
+	p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask);
+	p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries);
+	p->sq_off.flags = offsetof(struct io_sq_ring, flags);
+	p->sq_off.dropped = offsetof(struct io_sq_ring, dropped);
+	p->sq_off.array = offsetof(struct io_sq_ring, array);
+
+	memset(&p->cq_off, 0, sizeof(p->cq_off));
+	p->cq_off.head = offsetof(struct io_cq_ring, r.head);
+	p->cq_off.tail = offsetof(struct io_cq_ring, r.tail);
+	p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask);
+	p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries);
+	p->cq_off.overflow = offsetof(struct io_cq_ring, overflow);
+	p->cq_off.cqes = offsetof(struct io_cq_ring, cqes);
+	return ret;
+err:
+	io_ring_ctx_wait_and_kill(ctx);
+	return ret;
+}
+
+/*
+ * Sets up an aio uring context, and returns the fd. Applications asks for a
+ * ring size, we return the actual sq/cq ring sizes (among other things) in the
+ * params structure passed in.
+ */
+static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
+{
+	struct io_uring_params p;
+	long ret;
+	int i;
+
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
+		if (p.resv[i])
+			return -EINVAL;
+	}
+
+	if (p.flags)
+		return -EINVAL;
+
+	ret = io_uring_create(entries, &p);
+	if (ret < 0)
+		return ret;
+
+	if (copy_to_user(params, &p, sizeof(p)))
+		return -EFAULT;
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(io_uring_setup, u32, entries,
+		struct io_uring_params __user *, params)
+{
+	return io_uring_setup(entries, params);
+}
+
+static int __init io_uring_init(void)
+{
+	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return 0;
+};
+__initcall(io_uring_init);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..3072dbaa7869 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags,
+				const sigset_t __user *sig, size_t sigsz);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d90127298f12..87871e7b7ea7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 __SYSCALL(__NR_rseq, sys_rseq)
 #define __NR_kexec_file_load 294
 __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
+#define __NR_io_uring_setup 425
+__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+#define __NR_io_uring_enter 426
+__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 
 #undef __NR_syscalls
-#define __NR_syscalls 295
+#define __NR_syscalls 427
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..ce65db9269a8
--- /dev/null
+++ b/include/uapi/linux/io_uring.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+#define IORING_MAX_ENTRIES	4096
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* as of now unused */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+	__u64	off;		/* offset into file */
+	__u64	addr;		/* pointer to buffer or iovecs */
+	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		__resv;
+	};
+	__u64	user_data;	/* data to be passed back at completion time */
+	__u64	__pad2[3];
+};
+
+#define IORING_OP_NOP		0
+#define IORING_OP_READV		1
+#define IORING_OP_WRITEV	2
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+	__u64	user_data;	/* sqe->data submission passed back */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_SQES			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv[3];
+};
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 cqes;
+	__u32 resv[4];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1 << 0)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u16 resv[10];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 513fa544a134..0cf723867e69 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1403,6 +1403,15 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config IO_URING
+	bool "Enable IO uring support" if EXPERT
+	select ANON_INODES
+	default y
+	help
+	  This option enables support for the io_uring interface, enabling
+	  applications to submit and completion IO through submission and
+	  completion rings that are shared between the kernel and application.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ab9d0e3c6d50..ee5e523564bb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
 
 /* fs/xattr.c */
 
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2019-02-01 18:05 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190123153536.7081-1-axboe@kernel.dk>
     [not found] ` <20190123153536.7081-6-axboe@kernel.dk>
2019-01-28 14:57   ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
2019-01-28 16:26     ` Jens Axboe
2019-01-28 16:34       ` Christoph Hellwig
2019-01-28 19:32         ` Jens Axboe
2019-01-28 18:25     ` Jens Axboe
2019-01-29  6:30       ` Christoph Hellwig
2019-01-29 11:58         ` Arnd Bergmann
2019-01-29 15:20           ` Jens Axboe
2019-01-29 16:18             ` Arnd Bergmann
2019-01-29 16:19               ` Jens Axboe
2019-01-29 16:26                 ` Arnd Bergmann
2019-01-29 16:28                   ` Jens Axboe
2019-01-29 16:46                     ` Arnd Bergmann
2019-01-29  0:47     ` Andy Lutomirski
2019-01-29  1:20       ` Jens Axboe
2019-01-29  6:45         ` Christoph Hellwig
2019-01-29 12:05           ` Arnd Bergmann
2019-01-31  5:11         ` Andy Lutomirski
2019-01-31 16:37           ` Jens Axboe
2019-01-28 21:35 [PATCHSET v8] " Jens Axboe
2019-01-28 21:35 ` [PATCH 05/18] Add " Jens Axboe
2019-01-28 21:53   ` Jeff Moyer
2019-01-28 21:56     ` Jens Axboe
2019-01-28 22:32   ` Jann Horn
2019-01-28 23:46     ` Jens Axboe
2019-01-28 23:59       ` Jann Horn
2019-01-29  0:03         ` Jens Axboe
2019-01-29  0:31           ` Jens Axboe
2019-01-29  0:34             ` Jann Horn
2019-01-29  0:55               ` Jens Axboe
2019-01-29  0:58                 ` Jann Horn
2019-01-29  1:01                   ` Jens Axboe
2019-02-01 16:57         ` Matt Mullins
2019-02-01 17:04           ` Jann Horn
2019-02-01 17:23             ` Jann Horn
2019-02-01 18:05               ` Al Viro
2019-01-29  1:07   ` Jann Horn
2019-01-29  2:21     ` Jann Horn
2019-01-29  2:54       ` Jens Axboe
2019-01-29  3:46       ` Jens Axboe
2019-01-29 15:56         ` Jann Horn
2019-01-29 16:06           ` Jens Axboe
2019-01-29  2:21     ` Jens Axboe
2019-01-29  1:29   ` Jann Horn
2019-01-29  1:31     ` Jens Axboe
2019-01-29  1:32       ` Jann Horn
2019-01-29  2:23         ` Jens Axboe
2019-01-29  7:12   ` Bert Wesarg
2019-01-29 12:12   ` Florian Weimer
2019-01-29 13:35     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).