io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] io_uring fixes for 5.5
@ 2019-12-17 22:54 Jens Axboe
  2019-12-17 22:54 ` [PATCH 1/7] io_uring: fix stale comment and a few typos Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring

Mostly minor stuff, the exception is the deferred prep handling to
ensure we have a stable SQE when doing issue. Most of that is just
churn that moves code into a prep/finish handler. I think it makes
it easier to follow, too, and it fixes a real problem that got
reported.

 fs/io-wq.c    |   2 +-
 fs/io_uring.c | 513 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 339 insertions(+), 176 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] io_uring: fix stale comment and a few typos
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 22:54 ` [PATCH 2/7] io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Brian Gianforcaro, Jens Axboe

From: Brian Gianforcaro <b.gianfo@gmail.com>

- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.

Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    | 2 +-
 fs/io_uring.c | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 90c4978781fb..11e80b7252a8 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -948,7 +948,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe,
 	/*
 	 * Now check if a free (going busy) or busy worker has the work
 	 * currently running. If we find it there, we'll return CANCEL_RUNNING
-	 * as an indication that we attempte to signal cancellation. The
+	 * as an indication that we attempt to signal cancellation. The
 	 * completion will run normally in this case.
 	 */
 	rcu_read_lock();
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b1833fedc5c..04cff3870b3b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1178,7 +1178,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
 }
 
 /*
- * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a
+ * Poll for a minimum of 'min' events. Note that if min == 0 we consider that a
  * non-spinning poll check - we'll still enter the driver poll loop, but only
  * as a non-spinning completion check.
  */
@@ -2573,7 +2573,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 
 		/*
 		 * Adjust the reqs sequence before the current one because it
-		 * will consume a slot in the cq_ring and the the cq_tail
+		 * will consume a slot in the cq_ring and the cq_tail
 		 * pointer will be increased, otherwise other timeout reqs may
 		 * return in advance without waiting for enough wait_nr.
 		 */
@@ -3430,7 +3430,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
 }
 
 /*
- * Fetch an sqe, if one is available. Note that s->sqe will point to memory
+ * Fetch an sqe, if one is available. Note that req->sqe will point to memory
  * that is mapped by userspace. This means that care needs to be taken to
  * ensure that reads are stable, as we cannot rely on userspace always
  * being a good citizen. If members of the sqe are validated and then later
@@ -3694,7 +3694,7 @@ static inline bool io_should_wake(struct io_wait_queue *iowq, bool noflush)
 	struct io_ring_ctx *ctx = iowq->ctx;
 
 	/*
-	 * Wake up if we have enough events, or if a timeout occured since we
+	 * Wake up if we have enough events, or if a timeout occurred since we
 	 * started waiting. For timeouts, we always want to return to userspace,
 	 * regardless of event count.
 	 */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
  2019-12-17 22:54 ` [PATCH 1/7] io_uring: fix stale comment and a few typos Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 22:54 ` [PATCH 3/7] io_uring: don't wait when under-submitting Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, 李通洲

If we have to punt the recvmsg to async context, we copy all the
context.  But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 04cff3870b3b..0e01cdc8a120 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2041,6 +2041,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      struct io_kiocb **nxt, bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	struct io_async_msghdr *kmsg = NULL;
 	struct socket *sock;
 	int ret;
 
@@ -2051,7 +2052,6 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (sock) {
 		struct io_async_ctx io, *copy;
 		struct sockaddr_storage addr;
-		struct msghdr *kmsg;
 		unsigned flags;
 
 		flags = READ_ONCE(sqe->msg_flags);
@@ -2061,17 +2061,21 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			flags |= MSG_DONTWAIT;
 
 		if (req->io) {
-			kmsg = &req->io->msg.msg;
-			kmsg->msg_name = &addr;
+			kmsg = &req->io->msg;
+			kmsg->msg.msg_name = &addr;
+			/* if iov is set, it's allocated already */
+			if (!kmsg->iov)
+				kmsg->iov = kmsg->fast_iov;
+			kmsg->msg.msg_iter.iov = kmsg->iov;
 		} else {
-			kmsg = &io.msg.msg;
-			kmsg->msg_name = &addr;
+			kmsg = &io.msg;
+			kmsg->msg.msg_name = &addr;
 			ret = io_sendmsg_prep(req, &io);
 			if (ret)
 				goto out;
 		}
 
-		ret = __sys_sendmsg_sock(sock, kmsg, flags);
+		ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags);
 		if (force_nonblock && ret == -EAGAIN) {
 			copy = kmalloc(sizeof(*copy), GFP_KERNEL);
 			if (!copy) {
@@ -2082,13 +2086,15 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			req->io = copy;
 			memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe));
 			req->sqe = &req->io->sqe;
-			return ret;
+			return -EAGAIN;
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 	}
 
 out:
+	if (kmsg && kmsg->iov != kmsg->fast_iov)
+		kfree(kmsg->iov);
 	io_cqring_add_event(req, ret);
 	if (ret < 0)
 		req_set_fail_links(req);
@@ -2120,6 +2126,7 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      struct io_kiocb **nxt, bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	struct io_async_msghdr *kmsg = NULL;
 	struct socket *sock;
 	int ret;
 
@@ -2131,7 +2138,6 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		struct user_msghdr __user *msg;
 		struct io_async_ctx io, *copy;
 		struct sockaddr_storage addr;
-		struct msghdr *kmsg;
 		unsigned flags;
 
 		flags = READ_ONCE(sqe->msg_flags);
@@ -2143,17 +2149,21 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		msg = (struct user_msghdr __user *) (unsigned long)
 			READ_ONCE(sqe->addr);
 		if (req->io) {
-			kmsg = &req->io->msg.msg;
-			kmsg->msg_name = &addr;
+			kmsg = &req->io->msg;
+			kmsg->msg.msg_name = &addr;
+			/* if iov is set, it's allocated already */
+			if (!kmsg->iov)
+				kmsg->iov = kmsg->fast_iov;
+			kmsg->msg.msg_iter.iov = kmsg->iov;
 		} else {
-			kmsg = &io.msg.msg;
-			kmsg->msg_name = &addr;
+			kmsg = &io.msg;
+			kmsg->msg.msg_name = &addr;
 			ret = io_recvmsg_prep(req, &io);
 			if (ret)
 				goto out;
 		}
 
-		ret = __sys_recvmsg_sock(sock, kmsg, msg, io.msg.uaddr, flags);
+		ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags);
 		if (force_nonblock && ret == -EAGAIN) {
 			copy = kmalloc(sizeof(*copy), GFP_KERNEL);
 			if (!copy) {
@@ -2164,13 +2174,15 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			req->io = copy;
 			memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe));
 			req->sqe = &req->io->sqe;
-			return ret;
+			return -EAGAIN;
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 	}
 
 out:
+	if (kmsg && kmsg->iov != kmsg->fast_iov)
+		kfree(kmsg->iov);
 	io_cqring_add_event(req, ret);
 	if (ret < 0)
 		req_set_fail_links(req);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
  2019-12-17 22:54 ` [PATCH 1/7] io_uring: fix stale comment and a few typos Jens Axboe
  2019-12-17 22:54 ` [PATCH 2/7] io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 23:55   ` Jann Horn
  2019-12-17 22:54 ` [PATCH 4/7] io_uring: fix pre-prepped issue with force_nonblock == true Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Pavel Begunkov, Jens Axboe

From: Pavel Begunkov <asml.silence@gmail.com>

There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

In such cases adjust min_complete, so it won't wait for more than
what have been submitted in the current io_uring_enter() call. It
may be less than total in-flight, but that up to a user to handle.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e01cdc8a120..0085cf86dbd8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4896,11 +4896,15 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		submitted = io_submit_sqes(ctx, to_submit, f.file, fd,
 					   &cur_mm, false);
 		mutex_unlock(&ctx->uring_lock);
+		if (submitted <= 0)
+			goto done;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
 		unsigned nr_events = 0;
 
 		min_complete = min(min_complete, ctx->cq_entries);
+		if (submitted != to_submit)
+			min_complete = min(min_complete, (u32)submitted);
 
 		if (ctx->flags & IORING_SETUP_IOPOLL) {
 			ret = io_iopoll_check(ctx, &nr_events, min_complete);
@@ -4908,7 +4912,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
 		}
 	}
-
+done:
 	percpu_ref_put(&ctx->refs);
 out_fput:
 	fdput(f);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] io_uring: fix pre-prepped issue with force_nonblock == true
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
                   ` (2 preceding siblings ...)
  2019-12-17 22:54 ` [PATCH 3/7] io_uring: don't wait when under-submitting Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 22:54 ` [PATCH 5/7] io_uring: remove 'sqe' parameter to the OP helpers that take it Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Finally, add an async context ->free_req() to handle the case where a
request is cancelled. We still need to free private data outside of
req->io for that case. To avoid the indirect free call for the fast
path, do the free manually and clear ->free_req() for that case.

Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 168 ++++++++++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 74 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0085cf86dbd8..ff89fde0c606 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -324,6 +324,7 @@ struct io_async_rw {
 
 struct io_async_ctx {
 	struct io_uring_sqe		sqe;
+	void (*free_req)(struct io_kiocb *);
 	union {
 		struct io_async_rw	rw;
 		struct io_async_msghdr	msg;
@@ -879,8 +880,11 @@ static void __io_free_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (req->io)
+	if (req->io) {
+		if (req->io->free_req)
+			req->io->free_req(req);
 		kfree(req->io);
+	}
 	if (req->file && !(req->flags & REQ_F_FIXED_FILE))
 		fput(req->file);
 	if (req->flags & REQ_F_INFLIGHT) {
@@ -1701,7 +1705,7 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb,
 	return ret;
 }
 
-static void io_req_map_io(struct io_kiocb *req, ssize_t io_size,
+static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size,
 			  struct iovec *iovec, struct iovec *fast_iov,
 			  struct iov_iter *iter)
 {
@@ -1715,19 +1719,35 @@ static void io_req_map_io(struct io_kiocb *req, ssize_t io_size,
 	}
 }
 
-static int io_setup_async_io(struct io_kiocb *req, ssize_t io_size,
-			     struct iovec *iovec, struct iovec *fast_iov,
-			     struct iov_iter *iter)
+static int io_alloc_async_ctx(struct io_kiocb *req)
 {
 	req->io = kmalloc(sizeof(*req->io), GFP_KERNEL);
 	if (req->io) {
-		io_req_map_io(req, io_size, iovec, fast_iov, iter);
 		memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe));
 		req->sqe = &req->io->sqe;
+		req->io->free_req = NULL;
 		return 0;
 	}
 
-	return -ENOMEM;
+	return 1;
+}
+
+static void io_rw_free_req(struct io_kiocb *req)
+{
+	if (req->io->rw.iov != req->io->rw.fast_iov)
+		kfree(req->io->rw.iov);
+}
+
+static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size,
+			     struct iovec *iovec, struct iovec *fast_iov,
+			     struct iov_iter *iter)
+{
+	if (!req->io && io_alloc_async_ctx(req))
+		return -ENOMEM;
+
+	io_req_map_rw(req, io_size, iovec, fast_iov, iter);
+	req->io->free_req = io_rw_free_req;
+	return 0;
 }
 
 static int io_read_prep(struct io_kiocb *req, struct iovec **iovec,
@@ -1806,7 +1826,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 			kiocb_done(kiocb, ret2, nxt, req->in_async);
 		} else {
 copy_iov:
-			ret = io_setup_async_io(req, io_size, iovec,
+			ret = io_setup_async_rw(req, io_size, iovec,
 						inline_vecs, &iter);
 			if (ret)
 				goto out_free;
@@ -1815,6 +1835,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 	}
 out_free:
 	kfree(iovec);
+	if (req->io)
+		req->io->free_req = NULL;
 	return ret;
 }
 
@@ -1900,7 +1922,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 			kiocb_done(kiocb, ret2, nxt, req->in_async);
 		} else {
 copy_iov:
-			ret = io_setup_async_io(req, io_size, iovec,
+			ret = io_setup_async_rw(req, io_size, iovec,
 						inline_vecs, &iter);
 			if (ret)
 				goto out_free;
@@ -1909,6 +1931,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 	}
 out_free:
 	kfree(iovec);
+	if (req->io)
+		req->io->free_req = NULL;
 	return ret;
 }
 
@@ -2021,6 +2045,12 @@ static int io_sync_file_range(struct io_kiocb *req,
 	return 0;
 }
 
+static void io_sendrecv_msg_free(struct io_kiocb *req)
+{
+	if (req->io->msg.iov != req->io->msg.fast_iov)
+		kfree(req->io->msg.iov);
+}
+
 static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io)
 {
 #if defined(CONFIG_NET)
@@ -2050,7 +2080,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	sock = sock_from_file(req->file, &ret);
 	if (sock) {
-		struct io_async_ctx io, *copy;
+		struct io_async_ctx io;
 		struct sockaddr_storage addr;
 		unsigned flags;
 
@@ -2077,15 +2107,12 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 		ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags);
 		if (force_nonblock && ret == -EAGAIN) {
-			copy = kmalloc(sizeof(*copy), GFP_KERNEL);
-			if (!copy) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			memcpy(&copy->msg, &io.msg, sizeof(copy->msg));
-			req->io = copy;
-			memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe));
-			req->sqe = &req->io->sqe;
+			if (req->io)
+				return -EAGAIN;
+			if (io_alloc_async_ctx(req))
+				return -ENOMEM;
+			memcpy(&req->io->msg, &io.msg, sizeof(io.msg));
+			req->io->free_req = io_sendrecv_msg_free;
 			return -EAGAIN;
 		}
 		if (ret == -ERESTARTSYS)
@@ -2095,6 +2122,8 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 out:
 	if (kmsg && kmsg->iov != kmsg->fast_iov)
 		kfree(kmsg->iov);
+	if (req->io)
+		req->io->free_req = NULL;
 	io_cqring_add_event(req, ret);
 	if (ret < 0)
 		req_set_fail_links(req);
@@ -2136,7 +2165,7 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	sock = sock_from_file(req->file, &ret);
 	if (sock) {
 		struct user_msghdr __user *msg;
-		struct io_async_ctx io, *copy;
+		struct io_async_ctx io;
 		struct sockaddr_storage addr;
 		unsigned flags;
 
@@ -2165,15 +2194,12 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 		ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags);
 		if (force_nonblock && ret == -EAGAIN) {
-			copy = kmalloc(sizeof(*copy), GFP_KERNEL);
-			if (!copy) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			memcpy(copy, &io, sizeof(*copy));
-			req->io = copy;
-			memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe));
-			req->sqe = &req->io->sqe;
+			if (req->io)
+				return -EAGAIN;
+			if (io_alloc_async_ctx(req))
+				return -ENOMEM;
+			memcpy(&req->io->msg, &io.msg, sizeof(io.msg));
+			req->io->free_req = io_sendrecv_msg_free;
 			return -EAGAIN;
 		}
 		if (ret == -ERESTARTSYS)
@@ -2183,6 +2209,8 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 out:
 	if (kmsg && kmsg->iov != kmsg->fast_iov)
 		kfree(kmsg->iov);
+	if (req->io)
+		req->io->free_req = NULL;
 	io_cqring_add_event(req, ret);
 	if (ret < 0)
 		req_set_fail_links(req);
@@ -2272,15 +2300,13 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	ret = __sys_connect_file(req->file, &io->connect.address, addr_len,
 					file_flags);
 	if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) {
-		io = kmalloc(sizeof(*io), GFP_KERNEL);
-		if (!io) {
+		if (req->io)
+			return -EAGAIN;
+		if (io_alloc_async_ctx(req)) {
 			ret = -ENOMEM;
 			goto out;
 		}
-		memcpy(&io->connect, &__io.connect, sizeof(io->connect));
-		req->io = io;
-		memcpy(&io->sqe, req->sqe, sizeof(*req->sqe));
-		req->sqe = &io->sqe;
+		memcpy(&req->io->connect, &__io.connect, sizeof(__io.connect));
 		return -EAGAIN;
 	}
 	if (ret == -ERESTARTSYS)
@@ -2511,7 +2537,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (!poll->file)
 		return -EBADF;
 
-	req->io = NULL;
 	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	events = READ_ONCE(sqe->poll_events);
 	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
@@ -2692,7 +2717,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
 		data->mode = HRTIMER_MODE_REL;
 
 	hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode);
-	req->io = io;
 	return 0;
 }
 
@@ -2701,22 +2725,16 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	unsigned count;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_timeout_data *data;
-	struct io_async_ctx *io;
 	struct list_head *entry;
 	unsigned span = 0;
+	int ret;
 
-	io = req->io;
-	if (!io) {
-		int ret;
-
-		io = kmalloc(sizeof(*io), GFP_KERNEL);
-		if (!io)
+	if (!req->io) {
+		if (io_alloc_async_ctx(req))
 			return -ENOMEM;
-		ret = io_timeout_prep(req, io, false);
-		if (ret) {
-			kfree(io);
+		ret = io_timeout_prep(req, req->io, false);
+		if (ret)
 			return ret;
-		}
 	}
 	data = &req->io->timeout;
 
@@ -2858,23 +2876,35 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
-static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io)
+static int io_req_defer_prep(struct io_kiocb *req)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct io_async_ctx *io = req->io;
 	struct iov_iter iter;
 	ssize_t ret;
 
-	memcpy(&io->sqe, req->sqe, sizeof(io->sqe));
-	req->sqe = &io->sqe;
-
 	switch (io->sqe.opcode) {
 	case IORING_OP_READV:
 	case IORING_OP_READ_FIXED:
+		/* ensure prep does right import */
+		req->io = NULL;
 		ret = io_read_prep(req, &iovec, &iter, true);
+		req->io = io;
+		if (ret < 0)
+			break;
+		io_req_map_rw(req, ret, iovec, inline_vecs, &iter);
+		ret = 0;
 		break;
 	case IORING_OP_WRITEV:
 	case IORING_OP_WRITE_FIXED:
+		/* ensure prep does right import */
+		req->io = NULL;
 		ret = io_write_prep(req, &iovec, &iter, true);
+		req->io = io;
+		if (ret < 0)
+			break;
+		io_req_map_rw(req, ret, iovec, inline_vecs, &iter);
+		ret = 0;
 		break;
 	case IORING_OP_SENDMSG:
 		ret = io_sendmsg_prep(req, io);
@@ -2886,41 +2916,34 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io)
 		ret = io_connect_prep(req, io);
 		break;
 	case IORING_OP_TIMEOUT:
-		return io_timeout_prep(req, io, false);
+		ret = io_timeout_prep(req, io, false);
+		break;
 	case IORING_OP_LINK_TIMEOUT:
-		return io_timeout_prep(req, io, true);
+		ret = io_timeout_prep(req, io, true);
+		break;
 	default:
-		req->io = io;
-		return 0;
+		ret = 0;
+		break;
 	}
 
-	if (ret < 0)
-		return ret;
-
-	req->io = io;
-	io_req_map_io(req, ret, iovec, inline_vecs, &iter);
-	return 0;
+	return ret;
 }
 
 static int io_req_defer(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_async_ctx *io;
 	int ret;
 
 	/* Still need defer if there is pending req in defer list. */
 	if (!req_need_defer(req) && list_empty(&ctx->defer_list))
 		return 0;
 
-	io = kmalloc(sizeof(*io), GFP_KERNEL);
-	if (!io)
+	if (io_alloc_async_ctx(req))
 		return -EAGAIN;
 
-	ret = io_req_defer_prep(req, io);
-	if (ret < 0) {
-		kfree(io);
+	ret = io_req_defer_prep(req);
+	if (ret < 0)
 		return ret;
-	}
 
 	spin_lock_irq(&ctx->completion_lock);
 	if (!req_need_defer(req) && list_empty(&ctx->defer_list)) {
@@ -3366,7 +3389,6 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state,
 	 */
 	if (*link) {
 		struct io_kiocb *prev = *link;
-		struct io_async_ctx *io;
 
 		if (req->sqe->flags & IOSQE_IO_DRAIN)
 			(*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
@@ -3374,15 +3396,13 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state,
 		if (req->sqe->flags & IOSQE_IO_HARDLINK)
 			req->flags |= REQ_F_HARDLINK;
 
-		io = kmalloc(sizeof(*io), GFP_KERNEL);
-		if (!io) {
+		if (io_alloc_async_ctx(req)) {
 			ret = -EAGAIN;
 			goto err_req;
 		}
 
-		ret = io_req_defer_prep(req, io);
+		ret = io_req_defer_prep(req);
 		if (ret) {
-			kfree(io);
 			/* fail even hard links since we don't submit */
 			prev->flags |= REQ_F_FAIL_LINK;
 			goto err_req;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/7] io_uring: remove 'sqe' parameter to the OP helpers that take it
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
                   ` (3 preceding siblings ...)
  2019-12-17 22:54 ` [PATCH 4/7] io_uring: fix pre-prepped issue with force_nonblock == true Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 22:54 ` [PATCH 6/7] io_uring: any deferred command must have stable sqe data Jens Axboe
  2019-12-17 22:54 ` [PATCH 7/7] io_uring: make HARDLINK imply LINK Jens Axboe
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 80 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ff89fde0c606..5f6dd4d3215e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1951,8 +1951,9 @@ static int io_nop(struct io_kiocb *req)
 	return 0;
 }
 
-static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_prep_fsync(struct io_kiocb *req)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (!req->file)
@@ -1966,9 +1967,10 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return 0;
 }
 
-static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		    struct io_kiocb **nxt, bool force_nonblock)
+static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt,
+		    bool force_nonblock)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	loff_t sqe_off = READ_ONCE(sqe->off);
 	loff_t sqe_len = READ_ONCE(sqe->len);
 	loff_t end = sqe_off + sqe_len;
@@ -1979,7 +1981,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
 		return -EINVAL;
 
-	ret = io_prep_fsync(req, sqe);
+	ret = io_prep_fsync(req);
 	if (ret)
 		return ret;
 
@@ -1998,8 +2000,9 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
-static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_prep_sfr(struct io_kiocb *req)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret = 0;
 
@@ -2014,17 +2017,16 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return ret;
 }
 
-static int io_sync_file_range(struct io_kiocb *req,
-			      const struct io_uring_sqe *sqe,
-			      struct io_kiocb **nxt,
+static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt,
 			      bool force_nonblock)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	loff_t sqe_off;
 	loff_t sqe_len;
 	unsigned flags;
 	int ret;
 
-	ret = io_prep_sfr(req, sqe);
+	ret = io_prep_sfr(req);
 	if (ret)
 		return ret;
 
@@ -2067,10 +2069,11 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io)
 #endif
 }
 
-static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      struct io_kiocb **nxt, bool force_nonblock)
+static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt,
+		      bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_async_msghdr *kmsg = NULL;
 	struct socket *sock;
 	int ret;
@@ -2151,10 +2154,11 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io)
 #endif
 }
 
-static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      struct io_kiocb **nxt, bool force_nonblock)
+static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
+		      bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_async_msghdr *kmsg = NULL;
 	struct socket *sock;
 	int ret;
@@ -2221,10 +2225,11 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 #endif
 }
 
-static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		     struct io_kiocb **nxt, bool force_nonblock)
+static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
+		     bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct sockaddr __user *addr;
 	int __user *addr_len;
 	unsigned file_flags;
@@ -2272,10 +2277,11 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io)
 #endif
 }
 
-static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      struct io_kiocb **nxt, bool force_nonblock)
+static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
+		      bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_async_ctx __io, *io;
 	unsigned file_flags;
 	int addr_len, ret;
@@ -2373,8 +2379,9 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
  * Find a running poll command that matches one specified in sqe->addr,
  * and remove it if found.
  */
-static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_poll_remove(struct io_kiocb *req)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret;
 
@@ -2520,9 +2527,9 @@ static void io_poll_req_insert(struct io_kiocb *req)
 	hlist_add_head(&req->hash_node, list);
 }
 
-static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		       struct io_kiocb **nxt)
+static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_poll_iocb *poll = &req->poll;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_poll_table ipt;
@@ -2659,9 +2666,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
 /*
  * Remove or update an existing timeout command
  */
-static int io_timeout_remove(struct io_kiocb *req,
-			     const struct io_uring_sqe *sqe)
+static int io_timeout_remove(struct io_kiocb *req)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned flags;
 	int ret;
@@ -2720,8 +2727,9 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
 	return 0;
 }
 
-static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_timeout(struct io_kiocb *req)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	unsigned count;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_timeout_data *data;
@@ -2861,9 +2869,9 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
 	io_put_req_find_next(req, nxt);
 }
 
-static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-			   struct io_kiocb **nxt)
+static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt)
 {
+	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
@@ -2986,37 +2994,37 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt,
 		ret = io_write(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_FSYNC:
-		ret = io_fsync(req, req->sqe, nxt, force_nonblock);
+		ret = io_fsync(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_POLL_ADD:
-		ret = io_poll_add(req, req->sqe, nxt);
+		ret = io_poll_add(req, nxt);
 		break;
 	case IORING_OP_POLL_REMOVE:
-		ret = io_poll_remove(req, req->sqe);
+		ret = io_poll_remove(req);
 		break;
 	case IORING_OP_SYNC_FILE_RANGE:
-		ret = io_sync_file_range(req, req->sqe, nxt, force_nonblock);
+		ret = io_sync_file_range(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_SENDMSG:
-		ret = io_sendmsg(req, req->sqe, nxt, force_nonblock);
+		ret = io_sendmsg(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_RECVMSG:
-		ret = io_recvmsg(req, req->sqe, nxt, force_nonblock);
+		ret = io_recvmsg(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_TIMEOUT:
-		ret = io_timeout(req, req->sqe);
+		ret = io_timeout(req);
 		break;
 	case IORING_OP_TIMEOUT_REMOVE:
-		ret = io_timeout_remove(req, req->sqe);
+		ret = io_timeout_remove(req);
 		break;
 	case IORING_OP_ACCEPT:
-		ret = io_accept(req, req->sqe, nxt, force_nonblock);
+		ret = io_accept(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_CONNECT:
-		ret = io_connect(req, req->sqe, nxt, force_nonblock);
+		ret = io_connect(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_ASYNC_CANCEL:
-		ret = io_async_cancel(req, req->sqe, nxt);
+		ret = io_async_cancel(req, nxt);
 		break;
 	default:
 		ret = -EINVAL;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] io_uring: any deferred command must have stable sqe data
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
                   ` (4 preceding siblings ...)
  2019-12-17 22:54 ` [PATCH 5/7] io_uring: remove 'sqe' parameter to the OP helpers that take it Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  2019-12-17 22:54 ` [PATCH 7/7] io_uring: make HARDLINK imply LINK Jens Axboe
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 219 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 169 insertions(+), 50 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5f6dd4d3215e..667785634848 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -304,6 +304,20 @@ struct io_timeout_data {
 	u32				seq_offset;
 };
 
+struct io_accept {
+	struct file			*file;
+	struct sockaddr __user		*addr;
+	int __user			*addr_len;
+	int				flags;
+};
+
+struct io_sync {
+	struct file			*file;
+	loff_t				len;
+	loff_t				off;
+	int				flags;
+};
+
 struct io_async_connect {
 	struct sockaddr_storage		address;
 };
@@ -344,6 +358,8 @@ struct io_kiocb {
 		struct file		*file;
 		struct kiocb		rw;
 		struct io_poll_iocb	poll;
+		struct io_accept	accept;
+		struct io_sync		sync;
 	};
 
 	const struct io_uring_sqe	*sqe;
@@ -379,6 +395,7 @@ struct io_kiocb {
 #define REQ_F_INFLIGHT		16384	/* on inflight list */
 #define REQ_F_COMP_LOCKED	32768	/* completion under lock */
 #define REQ_F_HARDLINK		65536	/* doesn't sever on completion < 0 */
+#define REQ_F_PREPPED		131072	/* request already opcode prepared */
 	u64			user_data;
 	u32			result;
 	u32			sequence;
@@ -1956,6 +1973,8 @@ static int io_prep_fsync(struct io_kiocb *req)
 	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (req->flags & REQ_F_PREPPED)
+		return 0;
 	if (!req->file)
 		return -EBADF;
 
@@ -1964,39 +1983,70 @@ static int io_prep_fsync(struct io_kiocb *req)
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
+	req->sync.flags = READ_ONCE(sqe->fsync_flags);
+	if (unlikely(req->sync.flags & ~IORING_FSYNC_DATASYNC))
+		return -EINVAL;
+
+	req->sync.off = READ_ONCE(sqe->off);
+	req->sync.len = READ_ONCE(sqe->len);
+	req->flags |= REQ_F_PREPPED;
 	return 0;
 }
 
+static bool io_req_cancelled(struct io_kiocb *req)
+{
+	if (req->work.flags & IO_WQ_WORK_CANCEL) {
+		req_set_fail_links(req);
+		io_cqring_add_event(req, -ECANCELED);
+		io_put_req(req);
+		return true;
+	}
+
+	return false;
+}
+
+static void io_fsync_finish(struct io_wq_work **workptr)
+{
+	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
+	loff_t end = req->sync.off + req->sync.len;
+	struct io_kiocb *nxt = NULL;
+	int ret;
+
+	if (io_req_cancelled(req))
+		return;
+
+	ret = vfs_fsync_range(req->rw.ki_filp, req->sync.off,
+				end > 0 ? end : LLONG_MAX,
+				req->sync.flags & IORING_FSYNC_DATASYNC);
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_cqring_add_event(req, ret);
+	io_put_req_find_next(req, &nxt);
+	if (nxt)
+		*workptr = &nxt->work;
+}
+
 static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt,
 		    bool force_nonblock)
 {
-	const struct io_uring_sqe *sqe = req->sqe;
-	loff_t sqe_off = READ_ONCE(sqe->off);
-	loff_t sqe_len = READ_ONCE(sqe->len);
-	loff_t end = sqe_off + sqe_len;
-	unsigned fsync_flags;
+	struct io_wq_work *work, *old_work;
 	int ret;
 
-	fsync_flags = READ_ONCE(sqe->fsync_flags);
-	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
-		return -EINVAL;
-
 	ret = io_prep_fsync(req);
 	if (ret)
 		return ret;
 
 	/* fsync always requires a blocking context */
-	if (force_nonblock)
+	if (force_nonblock) {
+		io_put_req(req);
+		req->work.func = io_fsync_finish;
 		return -EAGAIN;
+	}
 
-	ret = vfs_fsync_range(req->rw.ki_filp, sqe_off,
-				end > 0 ? end : LLONG_MAX,
-				fsync_flags & IORING_FSYNC_DATASYNC);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_cqring_add_event(req, ret);
-	io_put_req_find_next(req, nxt);
+	work = old_work = &req->work;
+	io_fsync_finish(&work);
+	if (work && work != old_work)
+		*nxt = container_of(work, struct io_kiocb, work);
 	return 0;
 }
 
@@ -2004,8 +2054,9 @@ static int io_prep_sfr(struct io_kiocb *req)
 {
 	const struct io_uring_sqe *sqe = req->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
-	int ret = 0;
 
+	if (req->flags & REQ_F_PREPPED)
+		return 0;
 	if (!req->file)
 		return -EBADF;
 
@@ -2014,16 +2065,36 @@ static int io_prep_sfr(struct io_kiocb *req)
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
-	return ret;
+	req->sync.off = READ_ONCE(sqe->off);
+	req->sync.len = READ_ONCE(sqe->len);
+	req->sync.flags = READ_ONCE(sqe->sync_range_flags);
+	req->flags |= REQ_F_PREPPED;
+	return 0;
+}
+
+static void io_sync_file_range_finish(struct io_wq_work **workptr)
+{
+	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
+	struct io_kiocb *nxt = NULL;
+	int ret;
+
+	if (io_req_cancelled(req))
+		return;
+
+	ret = sync_file_range(req->rw.ki_filp, req->sync.off, req->sync.len,
+				req->sync.flags);
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_cqring_add_event(req, ret);
+	io_put_req_find_next(req, &nxt);
+	if (nxt)
+		*workptr = &nxt->work;
 }
 
 static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt,
 			      bool force_nonblock)
 {
-	const struct io_uring_sqe *sqe = req->sqe;
-	loff_t sqe_off;
-	loff_t sqe_len;
-	unsigned flags;
+	struct io_wq_work *work, *old_work;
 	int ret;
 
 	ret = io_prep_sfr(req);
@@ -2031,19 +2102,16 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt,
 		return ret;
 
 	/* sync_file_range always requires a blocking context */
-	if (force_nonblock)
+	if (force_nonblock) {
+		io_put_req(req);
+		req->work.func = io_sync_file_range_finish;
 		return -EAGAIN;
+	}
 
-	sqe_off = READ_ONCE(sqe->off);
-	sqe_len = READ_ONCE(sqe->len);
-	flags = READ_ONCE(sqe->sync_range_flags);
-
-	ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_cqring_add_event(req, ret);
-	io_put_req_find_next(req, nxt);
+	work = old_work = &req->work;
+	io_sync_file_range_finish(&work);
+	if (work && work != old_work)
+		*nxt = container_of(work, struct io_kiocb, work);
 	return 0;
 }
 
@@ -2225,31 +2293,40 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
 #endif
 }
 
-static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
-		     bool force_nonblock)
-{
 #if defined(CONFIG_NET)
+static int io_accept_prep(struct io_kiocb *req)
+{
 	const struct io_uring_sqe *sqe = req->sqe;
-	struct sockaddr __user *addr;
-	int __user *addr_len;
-	unsigned file_flags;
-	int flags, ret;
+	struct io_accept *accept = &req->accept;
+
+	if (req->flags & REQ_F_PREPPED)
+		return 0;
 
 	if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL)))
 		return -EINVAL;
 	if (sqe->ioprio || sqe->len || sqe->buf_index)
 		return -EINVAL;
 
-	addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr);
-	addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2);
-	flags = READ_ONCE(sqe->accept_flags);
-	file_flags = force_nonblock ? O_NONBLOCK : 0;
+	accept->addr = (struct sockaddr __user *)
+				(unsigned long) READ_ONCE(sqe->addr);
+	accept->addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2);
+	accept->flags = READ_ONCE(sqe->accept_flags);
+	req->flags |= REQ_F_PREPPED;
+	return 0;
+}
 
-	ret = __sys_accept4_file(req->file, file_flags, addr, addr_len, flags);
-	if (ret == -EAGAIN && force_nonblock) {
-		req->work.flags |= IO_WQ_WORK_NEEDS_FILES;
+static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
+		       bool force_nonblock)
+{
+	struct io_accept *accept = &req->accept;
+	unsigned file_flags;
+	int ret;
+
+	file_flags = force_nonblock ? O_NONBLOCK : 0;
+	ret = __sys_accept4_file(req->file, file_flags, accept->addr,
+					accept->addr_len, accept->flags);
+	if (ret == -EAGAIN && force_nonblock)
 		return -EAGAIN;
-	}
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;
 	if (ret < 0)
@@ -2257,6 +2334,39 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
 	io_cqring_add_event(req, ret);
 	io_put_req_find_next(req, nxt);
 	return 0;
+}
+
+static void io_accept_finish(struct io_wq_work **workptr)
+{
+	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
+	struct io_kiocb *nxt = NULL;
+
+	if (io_req_cancelled(req))
+		return;
+	__io_accept(req, &nxt, false);
+	if (nxt)
+		*workptr = &nxt->work;
+}
+#endif
+
+static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
+		     bool force_nonblock)
+{
+#if defined(CONFIG_NET)
+	int ret;
+
+	ret = io_accept_prep(req);
+	if (ret)
+		return ret;
+
+	ret = __io_accept(req, nxt, force_nonblock);
+	if (ret == -EAGAIN && force_nonblock) {
+		req->work.func = io_accept_finish;
+		req->work.flags |= IO_WQ_WORK_NEEDS_FILES;
+		io_put_req(req);
+		return -EAGAIN;
+	}
+	return 0;
 #else
 	return -EOPNOTSUPP;
 #endif
@@ -2914,6 +3024,12 @@ static int io_req_defer_prep(struct io_kiocb *req)
 		io_req_map_rw(req, ret, iovec, inline_vecs, &iter);
 		ret = 0;
 		break;
+	case IORING_OP_FSYNC:
+		ret = io_prep_fsync(req);
+		break;
+	case IORING_OP_SYNC_FILE_RANGE:
+		ret = io_prep_sfr(req);
+		break;
 	case IORING_OP_SENDMSG:
 		ret = io_sendmsg_prep(req, io);
 		break;
@@ -2929,6 +3045,9 @@ static int io_req_defer_prep(struct io_kiocb *req)
 	case IORING_OP_LINK_TIMEOUT:
 		ret = io_timeout_prep(req, io, true);
 		break;
+	case IORING_OP_ACCEPT:
+		ret = io_accept_prep(req);
+		break;
 	default:
 		ret = 0;
 		break;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] io_uring: make HARDLINK imply LINK
  2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
                   ` (5 preceding siblings ...)
  2019-12-17 22:54 ` [PATCH 6/7] io_uring: any deferred command must have stable sqe data Jens Axboe
@ 2019-12-17 22:54 ` Jens Axboe
  6 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-12-17 22:54 UTC (permalink / raw)
  To: io-uring; +Cc: Pavel Begunkov, Jens Axboe

From: Pavel Begunkov <asml.silence@gmail.com>

The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 667785634848..5643701d70d6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3693,7 +3693,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 		 * If previous wasn't linked and we have a linked command,
 		 * that's the end of the chain. Submit the previous link.
 		 */
-		if (!(sqe_flags & IOSQE_IO_LINK) && link) {
+		if (!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) && link) {
 			io_queue_link_head(link);
 			link = NULL;
 		}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-17 22:54 ` [PATCH 3/7] io_uring: don't wait when under-submitting Jens Axboe
@ 2019-12-17 23:55   ` Jann Horn
  2019-12-18  0:06     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jann Horn @ 2019-12-17 23:55 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring

On Tue, Dec 17, 2019 at 11:54 PM Jens Axboe <axboe@kernel.dk> wrote:
> There is no reliable way to submit and wait in a single syscall, as
> io_submit_sqes() may under-consume sqes (in case of an early error).
> Then it will wait for not-yet-submitted requests, deadlocking the user
> in most cases.
>
> In such cases adjust min_complete, so it won't wait for more than
> what have been submitted in the current io_uring_enter() call. It
> may be less than total in-flight, but that up to a user to handle.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
[...]
>         if (flags & IORING_ENTER_GETEVENTS) {
>                 unsigned nr_events = 0;
>
>                 min_complete = min(min_complete, ctx->cq_entries);
> +               if (submitted != to_submit)
> +                       min_complete = min(min_complete, (u32)submitted);

Hm. Let's say someone submits two requests, first an ACCEPT request
that might stall indefinitely and then a WRITE to a file on disk that
is expected to complete quickly; and the caller uses min_complete=1
because they want to wait for the WRITE op. But now the submission of
the WRITE fails, io_uring_enter() computes min_complete=min(1, 1)=1,
and it blocks on the ACCEPT op. That would be bad, right?

If the usecase I described is valid, I think it might make more sense
to do something like this:

u32 missing_submissions = to_submit - submitted;
min_complete = min(min_complete, ctx->cq_entries);
if ((flags & IORING_ENTER_GETEVENTS) && missing_submissions < min_complete) {
  min_complete -= missing_submissions;
  [...]
}

In other words: If we do a partially successful submission, only wait
as long as we know that userspace definitely wants us to wait for one
of the pending requests; and once we can't tell whether userspace
intended to wait longer, return to userspace and let the user decide.

Or it might make sense to just ignore IORING_ENTER_GETEVENTS
completely in the partial submission case, in case userspace wants to
immediately react to the failed request by writing out an error
message to a socket or whatever. This case probably isn't
performance-critical, right? And it would simplify things a bit.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-17 23:55   ` Jann Horn
@ 2019-12-18  0:06     ` Jens Axboe
  2019-12-18  9:38       ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2019-12-18  0:06 UTC (permalink / raw)
  To: Jann Horn, Pavel Begunkov; +Cc: io-uring

On 12/17/19 4:55 PM, Jann Horn wrote:
> On Tue, Dec 17, 2019 at 11:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>> There is no reliable way to submit and wait in a single syscall, as
>> io_submit_sqes() may under-consume sqes (in case of an early error).
>> Then it will wait for not-yet-submitted requests, deadlocking the user
>> in most cases.
>>
>> In such cases adjust min_complete, so it won't wait for more than
>> what have been submitted in the current io_uring_enter() call. It
>> may be less than total in-flight, but that up to a user to handle.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> [...]
>>         if (flags & IORING_ENTER_GETEVENTS) {
>>                 unsigned nr_events = 0;
>>
>>                 min_complete = min(min_complete, ctx->cq_entries);
>> +               if (submitted != to_submit)
>> +                       min_complete = min(min_complete, (u32)submitted);
> 
> Hm. Let's say someone submits two requests, first an ACCEPT request
> that might stall indefinitely and then a WRITE to a file on disk that
> is expected to complete quickly; and the caller uses min_complete=1
> because they want to wait for the WRITE op. But now the submission of
> the WRITE fails, io_uring_enter() computes min_complete=min(1, 1)=1,
> and it blocks on the ACCEPT op. That would be bad, right?
> 
> If the usecase I described is valid, I think it might make more sense
> to do something like this:
> 
> u32 missing_submissions = to_submit - submitted;
> min_complete = min(min_complete, ctx->cq_entries);
> if ((flags & IORING_ENTER_GETEVENTS) && missing_submissions < min_complete) {
>   min_complete -= missing_submissions;
>   [...]
> }
> 
> In other words: If we do a partially successful submission, only wait
> as long as we know that userspace definitely wants us to wait for one
> of the pending requests; and once we can't tell whether userspace
> intended to wait longer, return to userspace and let the user decide.
> 
> Or it might make sense to just ignore IORING_ENTER_GETEVENTS
> completely in the partial submission case, in case userspace wants to
> immediately react to the failed request by writing out an error
> message to a socket or whatever. This case probably isn't
> performance-critical, right? And it would simplify things a bit.

That's a good point, and Pavel's first patch actually did that. I
didn't consider the different request type case, which might be
uncommon but definitely valid.

Probably the safest bet here is just to not wait at all if we fail
submitting all of them. This isn't a fast path, there was an error
somehow which meant we didn't submit it all. So just return the
submit count (including 0, not -EAGAIN) if we fail submitting,
and ignore IORING_ENTER_GETEVENTS for that case.

Pavel, care to submit a new one? I'll drop this one now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-18  0:06     ` Jens Axboe
@ 2019-12-18  9:38       ` Pavel Begunkov
  2019-12-18 13:02         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2019-12-18  9:38 UTC (permalink / raw)
  To: Jens Axboe, Jann Horn; +Cc: io-uring

On 12/18/2019 3:06 AM, Jens Axboe wrote:
> On 12/17/19 4:55 PM, Jann Horn wrote:
>> On Tue, Dec 17, 2019 at 11:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> There is no reliable way to submit and wait in a single syscall, as
>>> io_submit_sqes() may under-consume sqes (in case of an early error).
>>> Then it will wait for not-yet-submitted requests, deadlocking the user
>>> in most cases.
>>>
>>> In such cases adjust min_complete, so it won't wait for more than
>>> what have been submitted in the current io_uring_enter() call. It
>>> may be less than total in-flight, but that up to a user to handle.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> [...]
>>>         if (flags & IORING_ENTER_GETEVENTS) {
>>>                 unsigned nr_events = 0;
>>>
>>>                 min_complete = min(min_complete, ctx->cq_entries);
>>> +               if (submitted != to_submit)
>>> +                       min_complete = min(min_complete, (u32)submitted);
>>
>> Hm. Let's say someone submits two requests, first an ACCEPT request
>> that might stall indefinitely and then a WRITE to a file on disk that
>> is expected to complete quickly; and the caller uses min_complete=1
>> because they want to wait for the WRITE op. But now the submission of
>> the WRITE fails, io_uring_enter() computes min_complete=min(1, 1)=1,
>> and it blocks on the ACCEPT op. That would be bad, right?
>>
>> If the usecase I described is valid, I think it might make more sense
>> to do something like this:
>>
>> u32 missing_submissions = to_submit - submitted;
>> min_complete = min(min_complete, ctx->cq_entries);
>> if ((flags & IORING_ENTER_GETEVENTS) && missing_submissions < min_complete) {
>>   min_complete -= missing_submissions;
>>   [...]
>> }
>>
>> In other words: If we do a partially successful submission, only wait
>> as long as we know that userspace definitely wants us to wait for one
>> of the pending requests; and once we can't tell whether userspace
>> intended to wait longer, return to userspace and let the user decide.
>>
>> Or it might make sense to just ignore IORING_ENTER_GETEVENTS
>> completely in the partial submission case, in case userspace wants to
>> immediately react to the failed request by writing out an error
>> message to a socket or whatever. This case probably isn't
>> performance-critical, right? And it would simplify things a bit.
> 
> That's a good point, and Pavel's first patch actually did that. I
> didn't consider the different request type case, which might be
> uncommon but definitely valid.
> 
> Probably the safest bet here is just to not wait at all if we fail
> submitting all of them. This isn't a fast path, there was an error
> somehow which meant we didn't submit it all. So just return the
> submit count (including 0, not -EAGAIN) if we fail submitting,
> and ignore IORING_ENTER_GETEVENTS for that case.
> 
I see nothing wrong with -EAGAIN, it's returned only if it can't
allocate memory for the first request. If so, can you then just take the
v1? It will probably be applied cleanly.

> Pavel, care to submit a new one? I'll drop this one now.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-18  9:38       ` Pavel Begunkov
@ 2019-12-18 13:02         ` Jens Axboe
  2019-12-18 13:09           ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2019-12-18 13:02 UTC (permalink / raw)
  To: Pavel Begunkov, Jann Horn; +Cc: io-uring

On 12/18/19 2:38 AM, Pavel Begunkov wrote:
> On 12/18/2019 3:06 AM, Jens Axboe wrote:
>> On 12/17/19 4:55 PM, Jann Horn wrote:
>>> On Tue, Dec 17, 2019 at 11:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> There is no reliable way to submit and wait in a single syscall, as
>>>> io_submit_sqes() may under-consume sqes (in case of an early error).
>>>> Then it will wait for not-yet-submitted requests, deadlocking the user
>>>> in most cases.
>>>>
>>>> In such cases adjust min_complete, so it won't wait for more than
>>>> what have been submitted in the current io_uring_enter() call. It
>>>> may be less than total in-flight, but that up to a user to handle.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> [...]
>>>>         if (flags & IORING_ENTER_GETEVENTS) {
>>>>                 unsigned nr_events = 0;
>>>>
>>>>                 min_complete = min(min_complete, ctx->cq_entries);
>>>> +               if (submitted != to_submit)
>>>> +                       min_complete = min(min_complete, (u32)submitted);
>>>
>>> Hm. Let's say someone submits two requests, first an ACCEPT request
>>> that might stall indefinitely and then a WRITE to a file on disk that
>>> is expected to complete quickly; and the caller uses min_complete=1
>>> because they want to wait for the WRITE op. But now the submission of
>>> the WRITE fails, io_uring_enter() computes min_complete=min(1, 1)=1,
>>> and it blocks on the ACCEPT op. That would be bad, right?
>>>
>>> If the usecase I described is valid, I think it might make more sense
>>> to do something like this:
>>>
>>> u32 missing_submissions = to_submit - submitted;
>>> min_complete = min(min_complete, ctx->cq_entries);
>>> if ((flags & IORING_ENTER_GETEVENTS) && missing_submissions < min_complete) {
>>>   min_complete -= missing_submissions;
>>>   [...]
>>> }
>>>
>>> In other words: If we do a partially successful submission, only wait
>>> as long as we know that userspace definitely wants us to wait for one
>>> of the pending requests; and once we can't tell whether userspace
>>> intended to wait longer, return to userspace and let the user decide.
>>>
>>> Or it might make sense to just ignore IORING_ENTER_GETEVENTS
>>> completely in the partial submission case, in case userspace wants to
>>> immediately react to the failed request by writing out an error
>>> message to a socket or whatever. This case probably isn't
>>> performance-critical, right? And it would simplify things a bit.
>>
>> That's a good point, and Pavel's first patch actually did that. I
>> didn't consider the different request type case, which might be
>> uncommon but definitely valid.
>>
>> Probably the safest bet here is just to not wait at all if we fail
>> submitting all of them. This isn't a fast path, there was an error
>> somehow which meant we didn't submit it all. So just return the
>> submit count (including 0, not -EAGAIN) if we fail submitting,
>> and ignore IORING_ENTER_GETEVENTS for that case.
>>
> I see nothing wrong with -EAGAIN, it's returned only if it can't
> allocate memory for the first request. If so, can you then just take the
> v1? It will probably be applied cleanly.

-EAGAIN for request alloc is fine, but your v1 also returned -EAGAIN if
someone asked to submit 0, which is a bug. We must return zero for that
case.

So your v1 without that would work, something ala:

if (submitted != to_submit)
	goto out;

without the turning 0 into -EAGAIN unconditionally, io_submit_sqes() does
the right thing there as it is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] io_uring: don't wait when under-submitting
  2019-12-18 13:02         ` Jens Axboe
@ 2019-12-18 13:09           ` Pavel Begunkov
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2019-12-18 13:09 UTC (permalink / raw)
  To: Jens Axboe, Jann Horn; +Cc: io-uring

On 12/18/2019 4:02 PM, Jens Axboe wrote:
> On 12/18/19 2:38 AM, Pavel Begunkov wrote:
>> On 12/18/2019 3:06 AM, Jens Axboe wrote:
>>> On 12/17/19 4:55 PM, Jann Horn wrote:
>>>> On Tue, Dec 17, 2019 at 11:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>> There is no reliable way to submit and wait in a single syscall, as
>>>>> io_submit_sqes() may under-consume sqes (in case of an early error).
>>>>> Then it will wait for not-yet-submitted requests, deadlocking the user
>>>>> in most cases.
>>>>>
>>>>> In such cases adjust min_complete, so it won't wait for more than
>>>>> what have been submitted in the current io_uring_enter() call. It
>>>>> may be less than total in-flight, but that up to a user to handle.
>>>>>
>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> [...]
>>>>>         if (flags & IORING_ENTER_GETEVENTS) {
>>>>>                 unsigned nr_events = 0;
>>>>>
>>>>>                 min_complete = min(min_complete, ctx->cq_entries);
>>>>> +               if (submitted != to_submit)
>>>>> +                       min_complete = min(min_complete, (u32)submitted);
>>>>
>>>> Hm. Let's say someone submits two requests, first an ACCEPT request
>>>> that might stall indefinitely and then a WRITE to a file on disk that
>>>> is expected to complete quickly; and the caller uses min_complete=1
>>>> because they want to wait for the WRITE op. But now the submission of
>>>> the WRITE fails, io_uring_enter() computes min_complete=min(1, 1)=1,
>>>> and it blocks on the ACCEPT op. That would be bad, right?
>>>>
>>>> If the usecase I described is valid, I think it might make more sense
>>>> to do something like this:
>>>>
>>>> u32 missing_submissions = to_submit - submitted;
>>>> min_complete = min(min_complete, ctx->cq_entries);
>>>> if ((flags & IORING_ENTER_GETEVENTS) && missing_submissions < min_complete) {
>>>>   min_complete -= missing_submissions;
>>>>   [...]
>>>> }
>>>>
>>>> In other words: If we do a partially successful submission, only wait
>>>> as long as we know that userspace definitely wants us to wait for one
>>>> of the pending requests; and once we can't tell whether userspace
>>>> intended to wait longer, return to userspace and let the user decide.
>>>>
>>>> Or it might make sense to just ignore IORING_ENTER_GETEVENTS
>>>> completely in the partial submission case, in case userspace wants to
>>>> immediately react to the failed request by writing out an error
>>>> message to a socket or whatever. This case probably isn't
>>>> performance-critical, right? And it would simplify things a bit.
>>>
>>> That's a good point, and Pavel's first patch actually did that. I
>>> didn't consider the different request type case, which might be
>>> uncommon but definitely valid.
>>>
>>> Probably the safest bet here is just to not wait at all if we fail
>>> submitting all of them. This isn't a fast path, there was an error
>>> somehow which meant we didn't submit it all. So just return the
>>> submit count (including 0, not -EAGAIN) if we fail submitting,
>>> and ignore IORING_ENTER_GETEVENTS for that case.
>>>
>> I see nothing wrong with -EAGAIN, it's returned only if it can't
>> allocate memory for the first request. If so, can you then just take the
>> v1? It will probably be applied cleanly.
> 
> -EAGAIN for request alloc is fine, but your v1 also returned -EAGAIN if
> someone asked to submit 0, which is a bug. We must return zero for that
> case.
> 
Now I see which -EAGAIN you meant. You're right, I'll resend

> So your v1 without that would work, something ala:
> 
> if (submitted != to_submit)
> 	goto out;
> 
> without the turning 0 into -EAGAIN unconditionally, io_submit_sqes() does
> the right thing there as it is.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-12-18 13:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 22:54 [PATCHSET] io_uring fixes for 5.5 Jens Axboe
2019-12-17 22:54 ` [PATCH 1/7] io_uring: fix stale comment and a few typos Jens Axboe
2019-12-17 22:54 ` [PATCH 2/7] io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG Jens Axboe
2019-12-17 22:54 ` [PATCH 3/7] io_uring: don't wait when under-submitting Jens Axboe
2019-12-17 23:55   ` Jann Horn
2019-12-18  0:06     ` Jens Axboe
2019-12-18  9:38       ` Pavel Begunkov
2019-12-18 13:02         ` Jens Axboe
2019-12-18 13:09           ` Pavel Begunkov
2019-12-17 22:54 ` [PATCH 4/7] io_uring: fix pre-prepped issue with force_nonblock == true Jens Axboe
2019-12-17 22:54 ` [PATCH 5/7] io_uring: remove 'sqe' parameter to the OP helpers that take it Jens Axboe
2019-12-17 22:54 ` [PATCH 6/7] io_uring: any deferred command must have stable sqe data Jens Axboe
2019-12-17 22:54 ` [PATCH 7/7] io_uring: make HARDLINK imply LINK Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).