All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 0/7] CQ locking optimisation
@ 2022-12-05  2:44 Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 1/7] io_uring: skip overflow CQE posting for dying ring Pavel Begunkov
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Optimise CQ locking for event posting depending on a number of ring setup flags.
QD1 nop benchmark showed 12.067 -> 12.565 MIOPS increase, which more than 8.5%
of the io_uring kernel overhead (taking into account that the syscall overhead
is over 50%) or 4.12% of the total performance. Naturally, it's not only about
QD1, applications can submit a bunch of requests but their completions will may
arrive randomly hurting batching and so performance (or latency).

The downside is that we have to punt all io-wq completions to the
original task. The performance win should diminish with better
completion batching, but it should be worth it for as it also helps tw,
which in reality often don't complete too many requests.

The feature depends on DEFER_TASKRUN but can be relaxed to SINGLE_ISSUER

Pavel Begunkov (7):
  io_uring: skip overflow CQE posting for dying ring
  io_uring: don't check overflow flush failures
  io_uring: complete all requests in task context
  io_uring: force multishot CQEs into task context
  io_uring: post msg_ring CQE in task context
  io_uring: use tw for putting rsrc
  io_uring: skip spinlocking for ->task_complete

 include/linux/io_uring.h       |   2 +
 include/linux/io_uring_types.h |   3 +
 io_uring/io_uring.c            | 163 ++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  14 ++-
 io_uring/msg_ring.c            |  14 ++-
 io_uring/net.c                 |  21 +++++
 io_uring/rsrc.c                |  19 +++-
 io_uring/rsrc.h                |   1 +
 8 files changed, 179 insertions(+), 58 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH for-next 1/7] io_uring: skip overflow CQE posting for dying ring
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 2/7] io_uring: don't check overflow flush failures Pavel Begunkov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

After io_ring_ctx_wait_and_kill() is called there should be no users
poking into rings and so there is no need to post CQEs. So, instead of
trying to post overflowed CQEs into the CQ, drop them. Also, do it
in io_ring_exit_work() in a loop to reduce the number of contexts it
can be executed from and even when it struggles to quiesce the ring we
won't be leaving memory allocated for longer than needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 436b1ac8f6d0..4721ff6cafaa 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -611,12 +611,30 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx)
 }
 
 /* Returns true if there are no backlogged entries after the flush */
-static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
+static void io_cqring_overflow_kill(struct io_ring_ctx *ctx)
+{
+	struct io_overflow_cqe *ocqe;
+	LIST_HEAD(list);
+
+	io_cq_lock(ctx);
+	list_splice_init(&ctx->cq_overflow_list, &list);
+	clear_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
+	io_cq_unlock(ctx);
+
+	while (!list_empty(&list)) {
+		ocqe = list_first_entry(&list, struct io_overflow_cqe, list);
+		list_del(&ocqe->list);
+		kfree(ocqe);
+	}
+}
+
+/* Returns true if there are no backlogged entries after the flush */
+static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 {
 	bool all_flushed;
 	size_t cqe_size = sizeof(struct io_uring_cqe);
 
-	if (!force && __io_cqring_events(ctx) == ctx->cq_entries)
+	if (__io_cqring_events(ctx) == ctx->cq_entries)
 		return false;
 
 	if (ctx->flags & IORING_SETUP_CQE32)
@@ -627,15 +645,11 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 		struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
 		struct io_overflow_cqe *ocqe;
 
-		if (!cqe && !force)
+		if (!cqe)
 			break;
 		ocqe = list_first_entry(&ctx->cq_overflow_list,
 					struct io_overflow_cqe, list);
-		if (cqe)
-			memcpy(cqe, &ocqe->cqe, cqe_size);
-		else
-			io_account_cq_overflow(ctx);
-
+		memcpy(cqe, &ocqe->cqe, cqe_size);
 		list_del(&ocqe->list);
 		kfree(ocqe);
 	}
@@ -658,7 +672,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 		/* iopoll syncs against uring_lock, not completion_lock */
 		if (ctx->flags & IORING_SETUP_IOPOLL)
 			mutex_lock(&ctx->uring_lock);
-		ret = __io_cqring_overflow_flush(ctx, false);
+		ret = __io_cqring_overflow_flush(ctx);
 		if (ctx->flags & IORING_SETUP_IOPOLL)
 			mutex_unlock(&ctx->uring_lock);
 	}
@@ -1478,7 +1492,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 	check_cq = READ_ONCE(ctx->check_cq);
 	if (unlikely(check_cq)) {
 		if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT))
-			__io_cqring_overflow_flush(ctx, false);
+			__io_cqring_overflow_flush(ctx);
 		/*
 		 * Similarly do not spin if we have not informed the user of any
 		 * dropped CQE.
@@ -2646,8 +2660,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
 		__io_sqe_files_unregister(ctx);
-	if (ctx->rings)
-		__io_cqring_overflow_flush(ctx, true);
+	io_cqring_overflow_kill(ctx);
 	io_eventfd_unregister(ctx);
 	io_alloc_cache_free(&ctx->apoll_cache, io_apoll_cache_free);
 	io_alloc_cache_free(&ctx->netmsg_cache, io_netmsg_cache_free);
@@ -2788,6 +2801,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	 * as nobody else will be looking for them.
 	 */
 	do {
+		if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) {
+			mutex_lock(&ctx->uring_lock);
+			io_cqring_overflow_kill(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
+
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
 
@@ -2853,8 +2872,6 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 
 	mutex_lock(&ctx->uring_lock);
 	percpu_ref_kill(&ctx->refs);
-	if (ctx->rings)
-		__io_cqring_overflow_flush(ctx, true);
 	xa_for_each(&ctx->personalities, index, creds)
 		io_unregister_personality(ctx, index);
 	if (ctx->rings)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 2/7] io_uring: don't check overflow flush failures
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 1/7] io_uring: skip overflow CQE posting for dying ring Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 3/7] io_uring: complete all requests in task context Pavel Begunkov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The only way to fail overflowed CQEs flush is for CQ to be fully packed.
There is one place checking for flush failures, i.e. io_cqring_wait(),
but we limit the number to be waited for by the CQ size, so getting a
failure automatically means that we're done with waiting.

Don't check for failures, rarely but they might spuriously fail CQ
waiting with -EBUSY.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4721ff6cafaa..7239776a9d4b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -629,13 +629,12 @@ static void io_cqring_overflow_kill(struct io_ring_ctx *ctx)
 }
 
 /* Returns true if there are no backlogged entries after the flush */
-static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
+static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 {
-	bool all_flushed;
 	size_t cqe_size = sizeof(struct io_uring_cqe);
 
 	if (__io_cqring_events(ctx) == ctx->cq_entries)
-		return false;
+		return;
 
 	if (ctx->flags & IORING_SETUP_CQE32)
 		cqe_size <<= 1;
@@ -654,30 +653,23 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 		kfree(ocqe);
 	}
 
-	all_flushed = list_empty(&ctx->cq_overflow_list);
-	if (all_flushed) {
+	if (list_empty(&ctx->cq_overflow_list)) {
 		clear_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
 		atomic_andnot(IORING_SQ_CQ_OVERFLOW, &ctx->rings->sq_flags);
 	}
-
 	io_cq_unlock_post(ctx);
-	return all_flushed;
 }
 
-static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx)
+static void io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 {
-	bool ret = true;
-
 	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) {
 		/* iopoll syncs against uring_lock, not completion_lock */
 		if (ctx->flags & IORING_SETUP_IOPOLL)
 			mutex_lock(&ctx->uring_lock);
-		ret = __io_cqring_overflow_flush(ctx);
+		__io_cqring_overflow_flush(ctx);
 		if (ctx->flags & IORING_SETUP_IOPOLL)
 			mutex_unlock(&ctx->uring_lock);
 	}
-
-	return ret;
 }
 
 void __io_put_task(struct task_struct *task, int nr)
@@ -2505,11 +2497,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	trace_io_uring_cqring_wait(ctx, min_events);
 	do {
-		/* if we can't even flush overflow, don't wait for more */
-		if (!io_cqring_overflow_flush(ctx)) {
-			ret = -EBUSY;
-			break;
-		}
+		io_cqring_overflow_flush(ctx);
 		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 3/7] io_uring: complete all requests in task context
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 1/7] io_uring: skip overflow CQE posting for dying ring Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 2/7] io_uring: don't check overflow flush failures Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 4/7] io_uring: force multishot CQEs into " Pavel Begunkov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

This patch adds ctx->task_complete flag. If set, we'll complete all
requests in the context of the original task. Note, this extends to
completion CQE posting only but not io_kiocb cleanup / free, e.g. io-wq
may free the requests in the free calllback. This flag will be used
later for optimisations purposes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring.h       |  2 ++
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 14 +++++++++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 29e519752da4..934e5dd4ccc0 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -11,6 +11,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_UNLOCKED		= 2,
 	/* the request is executed from poll, it should not be freed */
 	IO_URING_F_MULTISHOT		= 4,
+	/* executed by io-wq */
+	IO_URING_F_IOWQ			= 8,
 	/* int's last bit, sign checks are usually faster than a bit test */
 	IO_URING_F_NONBLOCK		= INT_MIN,
 
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index accdfecee953..6be1e1359c89 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -208,6 +208,8 @@ struct io_ring_ctx {
 		unsigned int		drain_disabled: 1;
 		unsigned int		has_evfd: 1;
 		unsigned int		syscall_iopoll: 1;
+		/* all CQEs should be posted only by the submitter task */
+		unsigned int		task_complete: 1;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7239776a9d4b..0c86df7112fb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -932,8 +932,11 @@ static void __io_req_complete_post(struct io_kiocb *req)
 
 void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 {
-	if (!(issue_flags & IO_URING_F_UNLOCKED) ||
-	    !(req->ctx->flags & IORING_SETUP_IOPOLL)) {
+	if (req->ctx->task_complete && (issue_flags & IO_URING_F_IOWQ)) {
+		req->io_task_work.func = io_req_task_complete;
+		io_req_task_work_add(req);
+	} else if (!(issue_flags & IO_URING_F_UNLOCKED) ||
+		   !(req->ctx->flags & IORING_SETUP_IOPOLL)) {
 		__io_req_complete_post(req);
 	} else {
 		struct io_ring_ctx *ctx = req->ctx;
@@ -1841,7 +1844,7 @@ void io_wq_submit_work(struct io_wq_work *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	const struct io_op_def *def = &io_op_defs[req->opcode];
-	unsigned int issue_flags = IO_URING_F_UNLOCKED;
+	unsigned int issue_flags = IO_URING_F_UNLOCKED | IO_URING_F_IOWQ;
 	bool needs_poll = false;
 	int ret = 0, err = -ECANCELED;
 
@@ -3501,6 +3504,11 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (!ctx)
 		return -ENOMEM;
 
+	if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
+	    !(ctx->flags & IORING_SETUP_IOPOLL) &&
+	    !(ctx->flags & IORING_SETUP_SQPOLL))
+		ctx->task_complete = true;
+
 	/*
 	 * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user
 	 * space applications don't need to do io completion events
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 4/7] io_uring: force multishot CQEs into task context
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (2 preceding siblings ...)
  2022-12-05  2:44 ` [PATCH for-next 3/7] io_uring: complete all requests in task context Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 5/7] io_uring: post msg_ring CQE in " Pavel Begunkov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Multishot are posting CQEs outside of the normal request completion
path, which is usually done from within a task work handler. However, it
might be not the case when it's yet to be polled but has been punted to
io-wq. Make it abide ->task_complete and push it to the polling path
when executed by io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index 90342dcb6b1d..f276f6dd5b09 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -67,6 +67,19 @@ struct io_sr_msg {
 	struct io_kiocb 		*notif;
 };
 
+static inline bool io_check_multishot(struct io_kiocb *req,
+				      unsigned int issue_flags)
+{
+	/*
+	 * When ->locked_cq is set we only allow to post CQEs from the original
+	 * task context. Usual request completions will be handled in other
+	 * generic paths but multipoll may decide to post extra cqes.
+	 */
+	return !(issue_flags & IO_URING_F_IOWQ) ||
+		!(issue_flags & IO_URING_F_MULTISHOT) ||
+		!req->ctx->task_complete;
+}
+
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -730,6 +743,9 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
 		return io_setup_async_msg(req, kmsg, issue_flags);
 
+	if (!io_check_multishot(req, issue_flags))
+		return io_setup_async_msg(req, kmsg, issue_flags);
+
 retry_multishot:
 	if (io_do_buffer_select(req)) {
 		void __user *buf;
@@ -829,6 +845,9 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
 		return -EAGAIN;
 
+	if (!io_check_multishot(req, issue_flags))
+		return -EAGAIN;
+
 	sock = sock_from_file(req->file);
 	if (unlikely(!sock))
 		return -ENOTSOCK;
@@ -1280,6 +1299,8 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
 	struct file *file;
 	int ret, fd;
 
+	if (!io_check_multishot(req, issue_flags))
+		return -EAGAIN;
 retry:
 	if (!fixed) {
 		fd = __get_unused_fd_flags(accept->flags, accept->nofile);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (3 preceding siblings ...)
  2022-12-05  2:44 ` [PATCH for-next 4/7] io_uring: force multishot CQEs into " Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05 11:53   ` Jens Axboe
  2022-12-05  2:44 ` [PATCH for-next 6/7] io_uring: use tw for putting rsrc Pavel Begunkov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We want to limit post_aux_cqe() to the task context when ->task_complete
is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to another
thread. Instead of trying to invent a new delayed CQE posting mechanism
push them into the overflow list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 12 ++++++++++++
 io_uring/io_uring.h |  2 ++
 io_uring/msg_ring.c | 14 ++++++++++++--
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0c86df7112fb..7fda57dc0e8c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -860,6 +860,18 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags
 	return __io_post_aux_cqe(ctx, user_data, res, cflags, true);
 }
 
+bool io_post_aux_cqe_overflow(struct io_ring_ctx *ctx,
+			      u64 user_data, s32 res, u32 cflags)
+{
+	bool filled;
+
+	io_cq_lock(ctx);
+	ctx->cq_extra++;
+	filled = io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
+	io_cq_unlock_post(ctx);
+	return filled;
+}
+
 bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32 cflags,
 		bool allow_overflow)
 {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 62227ec3260c..a0b11a631e29 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -36,6 +36,8 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags
 bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32 cflags,
 		bool allow_overflow);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
+bool io_post_aux_cqe_overflow(struct io_ring_ctx *ctx,
+			      u64 user_data, s32 res, u32 cflags);
 
 struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
 
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index afb543aab9f6..7717fe519b07 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -23,6 +23,16 @@ struct io_msg {
 	u32 flags;
 };
 
+/* post cqes to another ring */
+static int io_msg_post_cqe(struct io_ring_ctx *ctx,
+			   u64 user_data, s32 res, u32 cflags)
+{
+	if (!ctx->task_complete || current == ctx->submitter_task)
+		return io_post_aux_cqe(ctx, user_data, res, cflags);
+	else
+		return io_post_aux_cqe_overflow(ctx, user_data, res, cflags);
+}
+
 static int io_msg_ring_data(struct io_kiocb *req)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
@@ -31,7 +41,7 @@ static int io_msg_ring_data(struct io_kiocb *req)
 	if (msg->src_fd || msg->dst_fd || msg->flags)
 		return -EINVAL;
 
-	if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, 0))
+	if (io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
 		return 0;
 
 	return -EOVERFLOW;
@@ -116,7 +126,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 	 * completes with -EOVERFLOW, then the sender must ensure that a
 	 * later IORING_OP_MSG_RING delivers the message.
 	 */
-	if (!io_post_aux_cqe(target_ctx, msg->user_data, msg->len, 0))
+	if (!io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
 		ret = -EOVERFLOW;
 out_unlock:
 	io_double_unlock_ctx(ctx, target_ctx, issue_flags);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 6/7] io_uring: use tw for putting rsrc
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (4 preceding siblings ...)
  2022-12-05  2:44 ` [PATCH for-next 5/7] io_uring: post msg_ring CQE in " Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-05  2:44 ` [PATCH for-next 7/7] io_uring: skip spinlocking for ->task_complete Pavel Begunkov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Use task_work for completing rsrc removals, it'll be needed later for
spinlock optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            |  1 +
 io_uring/rsrc.c                | 19 +++++++++++++++++--
 io_uring/rsrc.h                |  1 +
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 6be1e1359c89..dcd8a563ab52 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -328,6 +328,7 @@ struct io_ring_ctx {
 	struct io_rsrc_data		*buf_data;
 
 	struct delayed_work		rsrc_put_work;
+	struct callback_head		rsrc_put_tw;
 	struct llist_head		rsrc_put_llist;
 	struct list_head		rsrc_ref_list;
 	spinlock_t			rsrc_ref_lock;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7fda57dc0e8c..9eb771a4c912 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -326,6 +326,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	spin_lock_init(&ctx->rsrc_ref_lock);
 	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
 	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
+	init_task_work(&ctx->rsrc_put_tw, io_rsrc_put_tw);
 	init_llist_head(&ctx->rsrc_put_llist);
 	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index d25309400a45..18de10c68a15 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -204,6 +204,14 @@ void io_rsrc_put_work(struct work_struct *work)
 	}
 }
 
+void io_rsrc_put_tw(struct callback_head *cb)
+{
+	struct io_ring_ctx *ctx = container_of(cb, struct io_ring_ctx,
+					       rsrc_put_tw);
+
+	io_rsrc_put_work(&ctx->rsrc_put_work.work);
+}
+
 void io_wait_rsrc_data(struct io_rsrc_data *data)
 {
 	if (data && !atomic_dec_and_test(&data->refs))
@@ -242,8 +250,15 @@ static __cold void io_rsrc_node_ref_zero(struct percpu_ref *ref)
 	}
 	spin_unlock_irqrestore(&ctx->rsrc_ref_lock, flags);
 
-	if (first_add)
-		mod_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
+	if (!first_add)
+		return;
+
+	if (ctx->submitter_task) {
+		if (!task_work_add(ctx->submitter_task, &ctx->rsrc_put_tw,
+				   ctx->notify_method))
+			return;
+	}
+	mod_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
 }
 
 static struct io_rsrc_node *io_rsrc_node_alloc(void)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 81445a477622..2b8743645efc 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -53,6 +53,7 @@ struct io_mapped_ubuf {
 	struct bio_vec	bvec[];
 };
 
+void io_rsrc_put_tw(struct callback_head *cb);
 void io_rsrc_put_work(struct work_struct *work);
 void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
 void io_wait_rsrc_data(struct io_rsrc_data *data);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next 7/7] io_uring: skip spinlocking for ->task_complete
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (5 preceding siblings ...)
  2022-12-05  2:44 ` [PATCH for-next 6/7] io_uring: use tw for putting rsrc Pavel Begunkov
@ 2022-12-05  2:44 ` Pavel Begunkov
  2022-12-06 16:53 ` [PATCH for-next 0/7] CQ locking optimisation Jens Axboe
  2022-12-06 17:17 ` Jens Axboe
  8 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-05  2:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

->task_complete was added to serialised CQE posting by doing it from
the task context only (or fallback wq when the task is dead), and now we
can use that to avoid taking ->completion_lock while filling CQ entries.
The patch skips spinlocking only in two spots,
__io_submit_flush_completions() and flushing in io_aux_cqe, it's safer
and covers all cases we care about. Extra care is taken to force taking
the lock while queueing overflow entries.

It fundamentally relies on SINGLE_ISSUER to have only one task posting
events. It also need to take into account overflowed CQEs, flushing of
which happens in the cq wait path, and so this implementation also needs
DEFER_TASKRUN to limit waiters. For the same reason we disable it for
SQPOLL, and for IOPOLL as it won't benefit from it in any case.
DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 71 +++++++++++++++++++++++++++++++++------------
 io_uring/io_uring.h | 12 ++++++--
 2 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9eb771a4c912..36cb63e4174f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -594,13 +594,25 @@ static inline void io_cq_unlock(struct io_ring_ctx *ctx)
 	spin_unlock(&ctx->completion_lock);
 }
 
+static inline void __io_cq_lock(struct io_ring_ctx *ctx)
+	__acquires(ctx->completion_lock)
+{
+	if (!ctx->task_complete)
+		spin_lock(&ctx->completion_lock);
+}
+
+static inline void __io_cq_unlock(struct io_ring_ctx *ctx)
+{
+	if (!ctx->task_complete)
+		spin_unlock(&ctx->completion_lock);
+}
+
 /* keep it inlined for io_submit_flush_completions() */
-static inline void io_cq_unlock_post_inline(struct io_ring_ctx *ctx)
+static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
 	io_commit_cqring(ctx);
-	spin_unlock(&ctx->completion_lock);
-
+	__io_cq_unlock(ctx);
 	io_commit_cqring_flush(ctx);
 	io_cqring_wake(ctx);
 }
@@ -608,7 +620,10 @@ static inline void io_cq_unlock_post_inline(struct io_ring_ctx *ctx)
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
-	io_cq_unlock_post_inline(ctx);
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_commit_cqring_flush(ctx);
+	io_cqring_wake(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
@@ -795,12 +810,13 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 	return &rings->cqes[off];
 }
 
-static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
-			    bool allow_overflow)
+static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags)
 {
 	struct io_uring_cqe *cqe;
 
-	lockdep_assert_held(&ctx->completion_lock);
+	if (!ctx->task_complete)
+		lockdep_assert_held(&ctx->completion_lock);
 
 	ctx->cq_extra++;
 
@@ -823,10 +839,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32
 		}
 		return true;
 	}
-
-	if (allow_overflow)
-		return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
-
 	return false;
 }
 
@@ -840,7 +852,17 @@ static void __io_flush_post_cqes(struct io_ring_ctx *ctx)
 	for (i = 0; i < state->cqes_count; i++) {
 		struct io_uring_cqe *cqe = &state->cqes[i];
 
-		io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags, true);
+		if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) {
+			if (ctx->task_complete) {
+				spin_lock(&ctx->completion_lock);
+				io_cqring_event_overflow(ctx, cqe->user_data,
+							cqe->res, cqe->flags, 0, 0);
+				spin_unlock(&ctx->completion_lock);
+			} else {
+				io_cqring_event_overflow(ctx, cqe->user_data,
+							cqe->res, cqe->flags, 0, 0);
+			}
+		}
 	}
 	state->cqes_count = 0;
 }
@@ -851,7 +873,10 @@ static bool __io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u
 	bool filled;
 
 	io_cq_lock(ctx);
-	filled = io_fill_cqe_aux(ctx, user_data, res, cflags, allow_overflow);
+	filled = io_fill_cqe_aux(ctx, user_data, res, cflags);
+	if (!filled && allow_overflow)
+		filled = io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
+
 	io_cq_unlock_post(ctx);
 	return filled;
 }
@@ -887,10 +912,10 @@ bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32
 	lockdep_assert_held(&ctx->uring_lock);
 
 	if (ctx->submit_state.cqes_count == length) {
-		io_cq_lock(ctx);
+		__io_cq_lock(ctx);
 		__io_flush_post_cqes(ctx);
 		/* no need to flush - flush is deferred */
-		io_cq_unlock(ctx);
+		__io_cq_unlock_post(ctx);
 	}
 
 	/* For defered completions this is not as strict as it is otherwise,
@@ -1418,7 +1443,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	struct io_wq_work_node *node, *prev;
 	struct io_submit_state *state = &ctx->submit_state;
 
-	io_cq_lock(ctx);
+	__io_cq_lock(ctx);
 	/* must come first to preserve CQE ordering in failure cases */
 	if (state->cqes_count)
 		__io_flush_post_cqes(ctx);
@@ -1426,10 +1451,18 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 					    comp_list);
 
-		if (!(req->flags & REQ_F_CQE_SKIP))
-			io_fill_cqe_req(ctx, req);
+		if (!(req->flags & REQ_F_CQE_SKIP) &&
+		    unlikely(!__io_fill_cqe_req(ctx, req))) {
+			if (ctx->task_complete) {
+				spin_lock(&ctx->completion_lock);
+				io_req_cqe_overflow(req);
+				spin_unlock(&ctx->completion_lock);
+			} else {
+				io_req_cqe_overflow(req);
+			}
+		}
 	}
-	io_cq_unlock_post_inline(ctx);
+	__io_cq_unlock_post(ctx);
 
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
 		io_free_batch_list(ctx, state->compl_reqs.first);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index a0b11a631e29..c20f15f5024d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -112,7 +112,7 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
 	return io_get_cqe_overflow(ctx, false);
 }
 
-static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
+static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx,
 				     struct io_kiocb *req)
 {
 	struct io_uring_cqe *cqe;
@@ -124,7 +124,7 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 	 */
 	cqe = io_get_cqe(ctx);
 	if (unlikely(!cqe))
-		return io_req_cqe_overflow(req);
+		return false;
 
 	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
 				req->cqe.res, req->cqe.flags,
@@ -147,6 +147,14 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 	return true;
 }
 
+static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
+				   struct io_kiocb *req)
+{
+	if (likely(__io_fill_cqe_req(ctx, req)))
+		return true;
+	return io_req_cqe_overflow(req);
+}
+
 static inline void req_set_fail(struct io_kiocb *req)
 {
 	req->flags |= REQ_F_FAIL;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-05  2:44 ` [PATCH for-next 5/7] io_uring: post msg_ring CQE in " Pavel Begunkov
@ 2022-12-05 11:53   ` Jens Axboe
  2022-12-05 15:12     ` Dylan Yudaken
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2022-12-05 11:53 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 12/4/22 7:44?PM, Pavel Begunkov wrote:
> We want to limit post_aux_cqe() to the task context when ->task_complete
> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to another
> thread. Instead of trying to invent a new delayed CQE posting mechanism
> push them into the overflow list.

This is really the only one out of the series that I'm not a big fan of.
If we always rely on overflow for msg_ring, then that basically removes
it from being usable in a higher performance setting.

The natural way to do this would be to post the cqe via task_work for
the target, ring, but we also don't any storage available for that.
Might still be better to alloc something ala

struct tw_cqe_post {
	struct task_work work;
	s32 res;
	u32 flags;
	u64 user_data;
}

and post it with that?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-05 11:53   ` Jens Axboe
@ 2022-12-05 15:12     ` Dylan Yudaken
  2022-12-05 15:18       ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Dylan Yudaken @ 2022-12-05 15:12 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
> > We want to limit post_aux_cqe() to the task context when -
> > >task_complete
> > is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
> > another
> > thread. Instead of trying to invent a new delayed CQE posting
> > mechanism
> > push them into the overflow list.
> 
> This is really the only one out of the series that I'm not a big fan
> of.
> If we always rely on overflow for msg_ring, then that basically
> removes
> it from being usable in a higher performance setting.
> 
> The natural way to do this would be to post the cqe via task_work for
> the target, ring, but we also don't any storage available for that.
> Might still be better to alloc something ala
> 
> struct tw_cqe_post {
>         struct task_work work;
>         s32 res;
>         u32 flags;
>         u64 user_data;
> }
> 
> and post it with that?
> 

It might work to post the whole request to the target, post the cqe,
and then return the request back to the originating ring via tw for the
msg_ring CQE and cleanup.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-05 15:12     ` Dylan Yudaken
@ 2022-12-05 15:18       ` Jens Axboe
  2022-12-06 10:42         ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2022-12-05 15:18 UTC (permalink / raw)
  To: Dylan Yudaken, asml.silence, io-uring

On 12/5/22 8:12?AM, Dylan Yudaken wrote:
> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>> We want to limit post_aux_cqe() to the task context when -
>>>> task_complete
>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>> another
>>> thread. Instead of trying to invent a new delayed CQE posting
>>> mechanism
>>> push them into the overflow list.
>>
>> This is really the only one out of the series that I'm not a big fan
>> of.
>> If we always rely on overflow for msg_ring, then that basically
>> removes
>> it from being usable in a higher performance setting.
>>
>> The natural way to do this would be to post the cqe via task_work for
>> the target, ring, but we also don't any storage available for that.
>> Might still be better to alloc something ala
>>
>> struct tw_cqe_post {
>> ????????struct task_work work;
>> ????????s32 res;
>> ????????u32 flags;
>> ????????u64 user_data;
>> }
>>
>> and post it with that?
>>
> 
> It might work to post the whole request to the target, post the cqe,
> and then return the request back to the originating ring via tw for the
> msg_ring CQE and cleanup.

I did consider that, but then you need to ref that request as well as
bounce it twice via task_work. Probably easier to just alloc at that
point? Though if you do that, then the target cqe would post later than
the original. And potentially lose -EOVERFLOW if the target ring is
overflown...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-05 15:18       ` Jens Axboe
@ 2022-12-06 10:42         ` Pavel Begunkov
  2022-12-06 16:06           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-06 10:42 UTC (permalink / raw)
  To: Jens Axboe, Dylan Yudaken, io-uring

On 12/5/22 15:18, Jens Axboe wrote:
> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>> We want to limit post_aux_cqe() to the task context when -
>>>>> task_complete
>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>> another
>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>> mechanism
>>>> push them into the overflow list.
>>>
>>> This is really the only one out of the series that I'm not a big fan
>>> of.
>>> If we always rely on overflow for msg_ring, then that basically
>>> removes
>>> it from being usable in a higher performance setting.
>>>
>>> The natural way to do this would be to post the cqe via task_work for
>>> the target, ring, but we also don't any storage available for that.
>>> Might still be better to alloc something ala
>>>
>>> struct tw_cqe_post {
>>> ????????struct task_work work;
>>> ????????s32 res;
>>> ????????u32 flags;
>>> ????????u64 user_data;
>>> }
>>>
>>> and post it with that?

What does it change performance wise? I need to add a patch to
"try to flush before overflowing", but apart from that it's one
additional allocation in both cases but adds additional
raw / not-batch task_work.

>> It might work to post the whole request to the target, post the cqe,
>> and then return the request back to the originating ring via tw for the
>> msg_ring CQE and cleanup.
> 
> I did consider that, but then you need to ref that request as well as
> bounce it twice via task_work. Probably easier to just alloc at that
> point? Though if you do that, then the target cqe would post later than
> the original. And potentially lose -EOVERFLOW if the target ring is
> overflown...

Double tw is interesting for future plans, but yeah, I don't think
it's so much of a difference in context of this series.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-06 10:42         ` Pavel Begunkov
@ 2022-12-06 16:06           ` Jens Axboe
  2022-12-07  3:59             ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2022-12-06 16:06 UTC (permalink / raw)
  To: Pavel Begunkov, Dylan Yudaken, io-uring

On 12/6/22 3:42?AM, Pavel Begunkov wrote:
> On 12/5/22 15:18, Jens Axboe wrote:
>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>> task_complete
>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>> another
>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>> mechanism
>>>>> push them into the overflow list.
>>>>
>>>> This is really the only one out of the series that I'm not a big fan
>>>> of.
>>>> If we always rely on overflow for msg_ring, then that basically
>>>> removes
>>>> it from being usable in a higher performance setting.
>>>>
>>>> The natural way to do this would be to post the cqe via task_work for
>>>> the target, ring, but we also don't any storage available for that.
>>>> Might still be better to alloc something ala
>>>>
>>>> struct tw_cqe_post {
>>>> ????????struct task_work work;
>>>> ????????s32 res;
>>>> ????????u32 flags;
>>>> ????????u64 user_data;
>>>> }
>>>>
>>>> and post it with that?
> 
> What does it change performance wise? I need to add a patch to
> "try to flush before overflowing", but apart from that it's one
> additional allocation in both cases but adds additional
> raw / not-batch task_work.

It adds alloc+free for each one, and overflow flush needed on the
recipient side. It also adds a cq lock/unlock, though I don't think that
part will be a big deal.

>>> It might work to post the whole request to the target, post the cqe,
>>> and then return the request back to the originating ring via tw for the
>>> msg_ring CQE and cleanup.
>>
>> I did consider that, but then you need to ref that request as well as
>> bounce it twice via task_work. Probably easier to just alloc at that
>> point? Though if you do that, then the target cqe would post later than
>> the original. And potentially lose -EOVERFLOW if the target ring is
>> overflown...
> 
> Double tw is interesting for future plans, but yeah, I don't think
> it's so much of a difference in context of this series.

I did a half assed patch for that... Below.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 36cb63e4174f..974eeaac313f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1254,10 +1254,10 @@ static void io_req_local_work_add(struct io_kiocb *req)
 	__io_cqring_wake(ctx);
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+void __io_req_task_work_add_ctx(struct io_ring_ctx *ctx, struct io_kiocb *req,
+				struct task_struct *task, bool allow_local)
 {
-	struct io_uring_task *tctx = req->task->io_uring;
-	struct io_ring_ctx *ctx = req->ctx;
+	struct io_uring_task *tctx = task->io_uring;
 
 	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
 		io_req_local_work_add(req);
@@ -1277,6 +1277,11 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 	io_fallback_tw(tctx);
 }
 
+void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+{
+	__io_req_task_work_add_ctx(req->ctx, req, req->task, allow_local);
+}
+
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
 	struct llist_node *node;
@@ -1865,7 +1870,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 
 	/* If the op doesn't have a file, we're not polling for it */
-	if ((req->ctx->flags & IORING_SETUP_IOPOLL) && req->file)
+	if ((req->ctx->flags & IORING_SETUP_IOPOLL) && !def->noiopoll && req->file)
 		io_iopoll_req_issued(req, issue_flags);
 
 	return 0;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c20f15f5024d..3d24cba17504 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -51,6 +51,8 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
 }
 
 void __io_req_task_work_add(struct io_kiocb *req, bool allow_local);
+void __io_req_task_work_add_ctx(struct io_ring_ctx *ctx, struct io_kiocb *req,
+				struct task_struct *task, bool allow_local);
 bool io_is_uring_fops(struct file *file);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 7717fe519b07..fdc189b04d30 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -23,14 +23,41 @@ struct io_msg {
 	u32 flags;
 };
 
+static void io_msg_cqe_post(struct io_kiocb *req, bool *locked)
+{
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+	struct io_ring_ctx *ctx = req->file->private_data;
+	unsigned issue_flags = 0;
+	int ret = 0;
+
+	if (!io_post_aux_cqe(ctx, msg->user_data, msg->len, msg->flags))
+		ret = -EOVERFLOW;
+
+	io_req_set_res(req, ret, 0);
+	if (!*locked)
+		issue_flags = IO_URING_F_UNLOCKED;
+	io_req_complete_post(req, issue_flags);
+}
+
+static int io_msg_post_remote(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	req->io_task_work.func = io_msg_cqe_post;
+	__io_req_task_work_add_ctx(ctx, req, ctx->submitter_task, true);
+	return IOU_ISSUE_SKIP_COMPLETE;
+}
+
 /* post cqes to another ring */
-static int io_msg_post_cqe(struct io_ring_ctx *ctx,
-			   u64 user_data, s32 res, u32 cflags)
+static int io_msg_post_cqe(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
-	if (!ctx->task_complete || current == ctx->submitter_task)
-		return io_post_aux_cqe(ctx, user_data, res, cflags);
-	else
-		return io_post_aux_cqe_overflow(ctx, user_data, res, cflags);
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+
+	if (!ctx->task_complete || current == ctx->submitter_task) {
+		if (io_post_aux_cqe(ctx, msg->user_data, msg->len, msg->flags))
+			return 0;
+		return -EOVERFLOW;
+	}
+
+	return io_msg_post_remote(ctx, req);
 }
 
 static int io_msg_ring_data(struct io_kiocb *req)
@@ -41,10 +68,7 @@ static int io_msg_ring_data(struct io_kiocb *req)
 	if (msg->src_fd || msg->dst_fd || msg->flags)
 		return -EINVAL;
 
-	if (io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
-		return 0;
-
-	return -EOVERFLOW;
+	return io_msg_post_cqe(target_ctx, req);
 }
 
 static void io_double_unlock_ctx(struct io_ring_ctx *ctx,
@@ -126,8 +150,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 	 * completes with -EOVERFLOW, then the sender must ensure that a
 	 * later IORING_OP_MSG_RING delivers the message.
 	 */
-	if (!io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
-		ret = -EOVERFLOW;
+	ret = io_msg_post_cqe(target_ctx, req);
 out_unlock:
 	io_double_unlock_ctx(ctx, target_ctx, issue_flags);
 	return ret;
@@ -173,13 +196,11 @@ int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
+	if (ret == IOU_ISSUE_SKIP_COMPLETE)
+		return IOU_ISSUE_SKIP_COMPLETE;
 done:
 	if (ret < 0)
 		req_set_fail(req);
 	io_req_set_res(req, ret, 0);
-	/* put file to avoid an attempt to IOPOLL the req */
-	if (!(req->flags & REQ_F_FIXED_FILE))
-		io_put_file(req->file);
-	req->file = NULL;
 	return IOU_OK;
 }
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 83dc0f9ad3b2..638df83895fb 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -436,6 +436,7 @@ const struct io_op_def io_op_defs[] = {
 	[IORING_OP_MSG_RING] = {
 		.needs_file		= 1,
 		.iopoll			= 1,
+		.noiopoll		= 1,
 		.name			= "MSG_RING",
 		.prep			= io_msg_ring_prep,
 		.issue			= io_msg_ring,
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 3efe06d25473..e378eb240538 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -25,6 +25,8 @@ struct io_op_def {
 	unsigned		ioprio : 1;
 	/* supports iopoll */
 	unsigned		iopoll : 1;
+	/* don't iopoll for this request */
+	unsigned		noiopoll : 1;
 	/* opcode specific path will handle ->async_data allocation if needed */
 	unsigned		manual_alloc : 1;
 	/* size of async data needed, if any */

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 0/7] CQ locking optimisation
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (6 preceding siblings ...)
  2022-12-05  2:44 ` [PATCH for-next 7/7] io_uring: skip spinlocking for ->task_complete Pavel Begunkov
@ 2022-12-06 16:53 ` Jens Axboe
  2022-12-06 17:17 ` Jens Axboe
  8 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2022-12-06 16:53 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 12/4/22 7:44 PM, Pavel Begunkov wrote:
> Optimise CQ locking for event posting depending on a number of ring setup flags.
> QD1 nop benchmark showed 12.067 -> 12.565 MIOPS increase, which more than 8.5%
> of the io_uring kernel overhead (taking into account that the syscall overhead
> is over 50%) or 4.12% of the total performance. Naturally, it's not only about
> QD1, applications can submit a bunch of requests but their completions will may
> arrive randomly hurting batching and so performance (or latency).
> 
> The downside is that we have to punt all io-wq completions to the
> original task. The performance win should diminish with better
> completion batching, but it should be worth it for as it also helps tw,
> which in reality often don't complete too many requests.
> 
> The feature depends on DEFER_TASKRUN but can be relaxed to SINGLE_ISSUER

Let's hash out the details for MSG_RING later, if we have to.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 0/7] CQ locking optimisation
  2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
                   ` (7 preceding siblings ...)
  2022-12-06 16:53 ` [PATCH for-next 0/7] CQ locking optimisation Jens Axboe
@ 2022-12-06 17:17 ` Jens Axboe
  8 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2022-12-06 17:17 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Mon, 05 Dec 2022 02:44:24 +0000, Pavel Begunkov wrote:
> Optimise CQ locking for event posting depending on a number of ring setup flags.
> QD1 nop benchmark showed 12.067 -> 12.565 MIOPS increase, which more than 8.5%
> of the io_uring kernel overhead (taking into account that the syscall overhead
> is over 50%) or 4.12% of the total performance. Naturally, it's not only about
> QD1, applications can submit a bunch of requests but their completions will may
> arrive randomly hurting batching and so performance (or latency).
> 
> [...]

Applied, thanks!

[1/7] io_uring: skip overflow CQE posting for dying ring
      commit: 3dac93b1fae0b90211ed50fac8c2b48df1fc01dc
[2/7] io_uring: don't check overflow flush failures
      commit: a3f63209455a1d453ee8d9b87d0e07971b3c356e
[3/7] io_uring: complete all requests in task context
      commit: ab857514be26e0050e29696f363a96d238d8817e
[4/7] io_uring: force multishot CQEs into task context
      commit: 6db5fe86590f68c69747e8d5a3190b710e36ffb2
[5/7] io_uring: post msg_ring CQE in task context
      commit: d9143438fdccc62eb31a0985caa00c2876f8aa75
[6/7] io_uring: use tw for putting rsrc
      commit: 3a65f4413a2ccd362227c7d121ef549aa5a92b46
[7/7] io_uring: skip spinlocking for ->task_complete
      commit: 65a52cc3de9d7a93aa4c52a4a03e4a91ad7d1943

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-06 16:06           ` Jens Axboe
@ 2022-12-07  3:59             ` Pavel Begunkov
  2022-12-07 15:42               ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-12-07  3:59 UTC (permalink / raw)
  To: Jens Axboe, Dylan Yudaken, io-uring

On 12/6/22 16:06, Jens Axboe wrote:
> On 12/6/22 3:42?AM, Pavel Begunkov wrote:
>> On 12/5/22 15:18, Jens Axboe wrote:
>>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>>> task_complete
>>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>>> another
>>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>>> mechanism
>>>>>> push them into the overflow list.
>>>>>
>>>>> This is really the only one out of the series that I'm not a big fan
>>>>> of.
>>>>> If we always rely on overflow for msg_ring, then that basically
>>>>> removes
>>>>> it from being usable in a higher performance setting.
>>>>>
>>>>> The natural way to do this would be to post the cqe via task_work for
>>>>> the target, ring, but we also don't any storage available for that.
>>>>> Might still be better to alloc something ala
>>>>>
>>>>> struct tw_cqe_post {
>>>>> ????????struct task_work work;
>>>>> ????????s32 res;
>>>>> ????????u32 flags;
>>>>> ????????u64 user_data;
>>>>> }
>>>>>
>>>>> and post it with that?
>>
>> What does it change performance wise? I need to add a patch to
>> "try to flush before overflowing", but apart from that it's one
>> additional allocation in both cases but adds additional
>> raw / not-batch task_work.
> 
> It adds alloc+free for each one, and overflow flush needed on the
> recipient side. It also adds a cq lock/unlock, though I don't think that
> part will be a big deal.

And that approach below does 2 tw swings, neither is ideal but
it feels like a bearable price for poking into another ring.

I sent a series with the double tw approach, should be better for
CQ ordering, can you pick it up instead? I don't use io_uring tw
infra of a ring the request doesn't belong to as it seems to me
like shooting yourself in the leg.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context
  2022-12-07  3:59             ` Pavel Begunkov
@ 2022-12-07 15:42               ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2022-12-07 15:42 UTC (permalink / raw)
  To: Pavel Begunkov, Dylan Yudaken, io-uring

On 12/6/22 8:59 PM, Pavel Begunkov wrote:
> On 12/6/22 16:06, Jens Axboe wrote:
>> On 12/6/22 3:42?AM, Pavel Begunkov wrote:
>>> On 12/5/22 15:18, Jens Axboe wrote:
>>>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>>>> task_complete
>>>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>>>> another
>>>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>>>> mechanism
>>>>>>> push them into the overflow list.
>>>>>>
>>>>>> This is really the only one out of the series that I'm not a big fan
>>>>>> of.
>>>>>> If we always rely on overflow for msg_ring, then that basically
>>>>>> removes
>>>>>> it from being usable in a higher performance setting.
>>>>>>
>>>>>> The natural way to do this would be to post the cqe via task_work for
>>>>>> the target, ring, but we also don't any storage available for that.
>>>>>> Might still be better to alloc something ala
>>>>>>
>>>>>> struct tw_cqe_post {
>>>>>> ????????struct task_work work;
>>>>>> ????????s32 res;
>>>>>> ????????u32 flags;
>>>>>> ????????u64 user_data;
>>>>>> }
>>>>>>
>>>>>> and post it with that?
>>>
>>> What does it change performance wise? I need to add a patch to
>>> "try to flush before overflowing", but apart from that it's one
>>> additional allocation in both cases but adds additional
>>> raw / not-batch task_work.
>>
>> It adds alloc+free for each one, and overflow flush needed on the
>> recipient side. It also adds a cq lock/unlock, though I don't think that
>> part will be a big deal.
> 
> And that approach below does 2 tw swings, neither is ideal but
> it feels like a bearable price for poking into another ring.
> 
> I sent a series with the double tw approach, should be better for
> CQ ordering, can you pick it up instead? I don't use io_uring tw
> infra of a ring the request doesn't belong to as it seems to me
> like shooting yourself in the leg.

Yeah I think that's the right choice, it was just a quick hack on
my end to see if it was feasible. But it's not a good fit to use
our general tw infra for this.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-12-07 15:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-05  2:44 [PATCH for-next 0/7] CQ locking optimisation Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 1/7] io_uring: skip overflow CQE posting for dying ring Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 2/7] io_uring: don't check overflow flush failures Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 3/7] io_uring: complete all requests in task context Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 4/7] io_uring: force multishot CQEs into " Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 5/7] io_uring: post msg_ring CQE in " Pavel Begunkov
2022-12-05 11:53   ` Jens Axboe
2022-12-05 15:12     ` Dylan Yudaken
2022-12-05 15:18       ` Jens Axboe
2022-12-06 10:42         ` Pavel Begunkov
2022-12-06 16:06           ` Jens Axboe
2022-12-07  3:59             ` Pavel Begunkov
2022-12-07 15:42               ` Jens Axboe
2022-12-05  2:44 ` [PATCH for-next 6/7] io_uring: use tw for putting rsrc Pavel Begunkov
2022-12-05  2:44 ` [PATCH for-next 7/7] io_uring: skip spinlocking for ->task_complete Pavel Begunkov
2022-12-06 16:53 ` [PATCH for-next 0/7] CQ locking optimisation Jens Axboe
2022-12-06 17:17 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.