io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next 0/7] io_uring: defer task work to when it is needed
@ 2022-08-15 13:09 Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 1/7] io_uring: use local ctx variable Dylan Yudaken
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

We have seen workloads which suffer due to the way task work is currently
scheduled. This scheduling can cause non-trivial tasks to run interrupting
useful work on the workload. For example in network servers, a large async
recv may run, calling memcpy on a large packet, interrupting a send. Which
would add latency.

This series adds an option to defer async work until user space calls
io_uring_enter with the GETEVENTS flag. This allows the workload to choose
when to schedule async work and have finer control (at the expense of
complexity of managing this) of scheduling.

Patches 1/2/3 are prep patches
Patch 4 changes io_uring_enter to not always pre-run task work. It is not
obvious that this is useful regardless of this series
Patch 5/6/7 adds the new flag and functionality

Dylan Yudaken (7):
  io_uring: use local ctx variable
  io_uring: remove unnecessary variable
  io_uring: introduce io_has_work
  io_uring: do not always run task work at the start of io_uring_enter
  io_uring: add IORING_SETUP_DEFER_TASKRUN
  io_uring: move io_eventfd_put
  io_uring: signal registered eventfd to process deferred task work

 include/linux/io_uring_types.h |   3 +
 include/uapi/linux/io_uring.h  |   7 +
 io_uring/cancel.c              |   2 +-
 io_uring/io_uring.c            | 232 +++++++++++++++++++++++++--------
 io_uring/io_uring.h            |  31 ++++-
 io_uring/rsrc.c                |   2 +-
 6 files changed, 221 insertions(+), 56 deletions(-)


base-commit: ff34d8d06a1f16b6a58fb41bfbaa475cc6c02497
-- 
2.30.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH for-next 1/7] io_uring: use local ctx variable
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 13:46   ` Pavel Begunkov
  2022-08-15 13:09 ` [PATCH for-next 2/7] io_uring: remove unnecessary variable Dylan Yudaken
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

small change to use the local ctx

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 io_uring/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ebfdb2212ec2..ab3e3d9e9fcd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1072,8 +1072,8 @@ void io_req_task_work_add(struct io_kiocb *req)
 		req = container_of(node, struct io_kiocb, io_task_work.node);
 		node = node->next;
 		if (llist_add(&req->io_task_work.node,
-			      &req->ctx->fallback_llist))
-			schedule_delayed_work(&req->ctx->fallback_work, 1);
+			      &ctx->fallback_llist))
+			schedule_delayed_work(&ctx->fallback_work, 1);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 2/7] io_uring: remove unnecessary variable
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 1/7] io_uring: use local ctx variable Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 3/7] io_uring: introduce io_has_work Dylan Yudaken
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

'running' is set once and read once, so can easily just remove it

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 io_uring/io_uring.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ab3e3d9e9fcd..1d33b2422f2c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1052,12 +1052,9 @@ void io_req_task_work_add(struct io_kiocb *req)
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct llist_node *node;
-	bool running;
-
-	running = !llist_add(&req->io_task_work.node, &tctx->task_list);
 
 	/* task_work already pending, we're done */
-	if (running)
+	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
 		return;
 
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 3/7] io_uring: introduce io_has_work
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 1/7] io_uring: use local ctx variable Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 2/7] io_uring: remove unnecessary variable Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter Dylan Yudaken
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

This will be used later to know if the ring has outstanding work. Right
now just if there is overflow CQEs to copy to the main CQE ring, but later
will include deferred tasks

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 io_uring/io_uring.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1d33b2422f2c..8cc4b28b1725 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2144,6 +2144,11 @@ struct io_wait_queue {
 	unsigned nr_timeouts;
 };
 
+static inline bool io_has_work(struct io_ring_ctx *ctx)
+{
+	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
+}
+
 static inline bool io_should_wake(struct io_wait_queue *iowq)
 {
 	struct io_ring_ctx *ctx = iowq->ctx;
@@ -2162,13 +2167,13 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 {
 	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
 							wq);
+	struct io_ring_ctx *ctx = iowq->ctx;
 
 	/*
 	 * Cannot safely flush overflowed CQEs from here, ensure we wake up
 	 * the task, and the next invocation will do it.
 	 */
-	if (io_should_wake(iowq) ||
-	    test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &iowq->ctx->check_cq))
+	if (io_should_wake(iowq) || io_has_work(ctx))
 		return autoremove_wake_function(curr, mode, wake_flags, key);
 	return -1;
 }
@@ -2504,8 +2509,8 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	 * Users may get EPOLLIN meanwhile seeing nothing in cqring, this
 	 * pushs them to do the flush.
 	 */
-	if (io_cqring_events(ctx) ||
-	    test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))
+
+	if (io_cqring_events(ctx) || io_has_work(ctx))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
 	return mask;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
                   ` (2 preceding siblings ...)
  2022-08-15 13:09 ` [PATCH for-next 3/7] io_uring: introduce io_has_work Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 13:50   ` Pavel Begunkov
  2022-08-15 13:09 ` [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN Dylan Yudaken
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

It is normally better to wait for task work until after submissions. This
will allow greater batching if either work arrives in the meanwhile, or if
the submissions cause task work to be queued up.

For SQPOLL this also no longer runs task work, but this is handled inside
the SQPOLL loop anyway.

For IOPOLL io_iopoll_check will run task work anyway

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 io_uring/io_uring.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8cc4b28b1725..3b08369c3c60 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2990,8 +2990,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	struct fd f;
 	long ret;
 
-	io_run_task_work();
-
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
 			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
 			       IORING_ENTER_REGISTERED_RING)))
@@ -3060,7 +3058,11 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
 			goto iopoll_locked;
 		mutex_unlock(&ctx->uring_lock);
+		io_run_task_work();
+	} else {
+		io_run_task_work();
 	}
+
 	if (flags & IORING_ENTER_GETEVENTS) {
 		int ret2;
 		if (ctx->syscall_iopoll) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
                   ` (3 preceding siblings ...)
  2022-08-15 13:09 ` [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 14:02   ` Pavel Begunkov
  2022-08-15 13:09 ` [PATCH for-next 6/7] io_uring: move io_eventfd_put Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 7/7] io_uring: signal registered eventfd to process deferred task work Dylan Yudaken
  6 siblings, 1 reply; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag.

Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.

For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.

The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.

This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.

There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.

[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 include/linux/io_uring_types.h |   2 +
 include/uapi/linux/io_uring.h  |   7 ++
 io_uring/cancel.c              |   2 +-
 io_uring/io_uring.c            | 125 ++++++++++++++++++++++++++++-----
 io_uring/io_uring.h            |  31 +++++++-
 io_uring/rsrc.c                |   2 +-
 6 files changed, 148 insertions(+), 21 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..d56ff2185168 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -301,6 +301,8 @@ struct io_ring_ctx {
 		struct io_hash_table	cancel_table;
 		bool			poll_multi_queue;
 
+		struct llist_head	work_llist;
+
 		struct list_head	io_buffers_comp;
 	} ____cacheline_aligned_in_smp;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1463cfecb56b..be8d1801bf4a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -153,6 +153,13 @@ enum {
  */
 #define IORING_SETUP_SINGLE_ISSUER	(1U << 12)
 
+/*
+ * Defer running task work to get events.
+ * Rather than running bits of task work whenever the task transitions
+ * try to do it just before it is needed.
+ */
+#define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index e4e1dc0325f0..db6180b62e41 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -292,7 +292,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
 			break;
 
 		mutex_unlock(&ctx->uring_lock);
-		ret = io_run_task_work_sig();
+		ret = io_run_task_work_sig(ctx);
 		if (ret < 0) {
 			mutex_lock(&ctx->uring_lock);
 			break;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3b08369c3c60..35edeed1bcc1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -316,6 +316,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
 	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
 	init_llist_head(&ctx->rsrc_put_llist);
+	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_WQ_LIST(&ctx->locked_free_list);
@@ -1047,12 +1048,30 @@ void tctx_task_work(struct callback_head *cb)
 	trace_io_uring_task_work_run(tctx, count, loops);
 }
 
-void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_local_work_add(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
+		return;
+
+	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+
+	io_cqring_wake(ctx);
+}
+
+static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 {
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct llist_node *node;
 
+	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		io_req_local_work_add(req);
+		return;
+	}
+
 	/* task_work already pending, we're done */
 	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
 		return;
@@ -1074,6 +1093,69 @@ void io_req_task_work_add(struct io_kiocb *req)
 	}
 }
 
+void io_req_task_work_add(struct io_kiocb *req)
+{
+	__io_req_task_work_add(req, true);
+}
+
+static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
+{
+	struct llist_node *node;
+
+	node = llist_del_all(&ctx->work_llist);
+	while (node) {
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		node = node->next;
+		__io_req_task_work_add(req, false);
+	}
+}
+
+bool io_run_local_work(struct io_ring_ctx *ctx, bool locked)
+{
+	struct llist_node *node;
+	struct llist_node fake;
+	struct llist_node *current_final = NULL;
+	unsigned int count;
+
+	if (!locked)
+		locked = mutex_trylock(&ctx->uring_lock);
+
+	node = io_llist_xchg(&ctx->work_llist, &fake);
+	count = 0;
+again:
+	while (node != current_final) {
+		struct llist_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
+		if (unlikely(!same_thread_group(req->task, current))) {
+			__io_req_task_work_add(req, false);
+		} else {
+			req->io_task_work.func(req, &locked);
+			count++;
+		}
+		node = next;
+	}
+
+	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+
+	node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
+	if (node != &fake) {
+		current_final = &fake;
+		node = io_llist_xchg(&ctx->work_llist, &fake);
+		goto again;
+	}
+
+	if (locked) {
+		io_submit_flush_completions(ctx);
+		mutex_unlock(&ctx->uring_lock);
+	}
+	return count > 0;
+}
+
 static void io_req_tw_post(struct io_kiocb *req, bool *locked)
 {
 	io_req_complete_post(req);
@@ -1284,8 +1366,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 		if (wq_list_empty(&ctx->iopoll_list)) {
 			u32 tail = ctx->cached_cq_tail;
 
-			mutex_unlock(&ctx->uring_lock);
-			io_run_task_work();
+			io_run_task_work_unlock_ctx(ctx);
 			mutex_lock(&ctx->uring_lock);
 
 			/* some requests don't go through iopoll_list */
@@ -2146,7 +2227,9 @@ struct io_wait_queue {
 
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
-	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
+	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
+	       ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
+		!llist_empty(&ctx->work_llist));
 }
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)
@@ -2178,9 +2261,9 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return -1;
 }
 
-int io_run_task_work_sig(void)
+int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (io_run_task_work())
+	if (io_run_task_work_ctx(ctx, true))
 		return 1;
 	if (task_sigpending(current))
 		return -EINTR;
@@ -2196,7 +2279,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	unsigned long check_cq;
 
 	/* make sure we run task_work before checking for signals */
-	ret = io_run_task_work_sig();
+	ret = io_run_task_work_sig(ctx);
 	if (ret || io_should_wake(iowq))
 		return ret;
 
@@ -2230,7 +2313,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		io_cqring_overflow_flush(ctx);
 		if (io_cqring_events(ctx) >= min_events)
 			return 0;
-		if (!io_run_task_work())
+		if (!io_run_task_work_ctx(ctx, false))
 			break;
 	} while (1);
 
@@ -2768,13 +2851,14 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 		}
 	}
 
+	io_move_task_work_from_local(ctx);
 	ret |= io_cancel_defer_files(ctx, task, cancel_all);
 	mutex_lock(&ctx->uring_lock);
 	ret |= io_poll_remove_all(ctx, task, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, task, cancel_all);
 	if (task)
-		ret |= io_run_task_work();
+		ret |= io_run_task_work_ctx(ctx, true);
 	return ret;
 }
 
@@ -2837,7 +2921,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		}
 
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
-		io_run_task_work();
+		io_run_task_work_ctx(ctx, true);
 		io_uring_drop_tctx_refs(current);
 
 		/*
@@ -3055,12 +3139,15 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			mutex_unlock(&ctx->uring_lock);
 			goto out;
 		}
-		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
+
+		if (!(flags & IORING_ENTER_GETEVENTS))
+			mutex_unlock(&ctx->uring_lock);
+		else if (ctx->syscall_iopoll)
 			goto iopoll_locked;
-		mutex_unlock(&ctx->uring_lock);
-		io_run_task_work();
+		else
+			io_run_task_work_unlock_ctx(ctx);
 	} else {
-		io_run_task_work();
+		io_run_task_work_ctx(ctx, false);
 	}
 
 	if (flags & IORING_ENTER_GETEVENTS) {
@@ -3293,13 +3380,15 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		/* IPI related flags don't make sense with SQPOLL */
 		if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
-				  IORING_SETUP_TASKRUN_FLAG))
+				  IORING_SETUP_TASKRUN_FLAG |
+				  IORING_SETUP_DEFER_TASKRUN))
 			goto err;
 		ctx->notify_method = TWA_SIGNAL_NO_IPI;
 	} else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
 		ctx->notify_method = TWA_SIGNAL_NO_IPI;
 	} else {
-		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG &&
+		    !(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
 			goto err;
 		ctx->notify_method = TWA_SIGNAL;
 	}
@@ -3404,7 +3493,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
@@ -3876,7 +3965,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 
 	ctx = f.file->private_data;
 
-	io_run_task_work();
+	io_run_task_work_ctx(ctx, true);
 
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2f73f83af960..2a317a0a1eea 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -26,7 +26,8 @@ enum {
 
 struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx);
 bool io_req_cqe_overflow(struct io_kiocb *req);
-int io_run_task_work_sig(void);
+int io_run_task_work_sig(struct io_ring_ctx *ctx);
+bool io_run_local_work(struct io_ring_ctx *ctx, bool locked);
 void io_req_complete_failed(struct io_kiocb *req, s32 res);
 void __io_req_complete(struct io_kiocb *req, unsigned issue_flags);
 void io_req_complete_post(struct io_kiocb *req);
@@ -234,6 +235,34 @@ static inline bool io_run_task_work(void)
 	return false;
 }
 
+static inline bool io_run_task_work_ctx(struct io_ring_ctx *ctx, bool all)
+{
+	bool ret = false;
+
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		ret = io_run_local_work(ctx, false);
+		if (!all)
+			return ret;
+	}
+	ret  |= io_run_task_work();
+	return ret;
+}
+
+static inline bool io_run_task_work_unlock_ctx(struct io_ring_ctx *ctx)
+{
+	bool ret;
+
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		ret = io_run_local_work(ctx, true);
+		mutex_unlock(&ctx->uring_lock);
+	} else {
+		mutex_unlock(&ctx->uring_lock);
+		ret = io_run_task_work();
+	}
+
+	return ret;
+}
+
 static inline void io_tw_lock(struct io_ring_ctx *ctx, bool *locked)
 {
 	if (!*locked) {
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 71359a4d0bd4..80cda6e2067f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -343,7 +343,7 @@ __cold static int io_rsrc_ref_quiesce(struct io_rsrc_data *data,
 		flush_delayed_work(&ctx->rsrc_put_work);
 		reinit_completion(&data->done);
 
-		ret = io_run_task_work_sig();
+		ret = io_run_task_work_sig(ctx);
 		mutex_lock(&ctx->uring_lock);
 	} while (ret >= 0);
 	data->quiesce = false;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 6/7] io_uring: move io_eventfd_put
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
                   ` (4 preceding siblings ...)
  2022-08-15 13:09 ` [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  2022-08-15 13:09 ` [PATCH for-next 7/7] io_uring: signal registered eventfd to process deferred task work Dylan Yudaken
  6 siblings, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

Non functional change: move this function above io_eventfd_signal so it
can be used from there

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 io_uring/io_uring.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 35edeed1bcc1..70fa075c6887 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -478,6 +478,14 @@ static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 	}
 }
 
+static void io_eventfd_put(struct rcu_head *rcu)
+{
+	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
+
+	eventfd_ctx_put(ev_fd->cq_ev_fd);
+	kfree(ev_fd);
+}
+
 static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
 	struct io_ev_fd *ev_fd;
@@ -2452,14 +2460,6 @@ static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg,
 	return 0;
 }
 
-static void io_eventfd_put(struct rcu_head *rcu)
-{
-	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
-
-	eventfd_ctx_put(ev_fd->cq_ev_fd);
-	kfree(ev_fd);
-}
-
 static int io_eventfd_unregister(struct io_ring_ctx *ctx)
 {
 	struct io_ev_fd *ev_fd;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next 7/7] io_uring: signal registered eventfd to process deferred task work
  2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
                   ` (5 preceding siblings ...)
  2022-08-15 13:09 ` [PATCH for-next 6/7] io_uring: move io_eventfd_put Dylan Yudaken
@ 2022-08-15 13:09 ` Dylan Yudaken
  6 siblings, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 13:09 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: Kernel-team, Dylan Yudaken

Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.

In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.

This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 75 ++++++++++++++++++++++++----------
 2 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d56ff2185168..42494176434a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -184,6 +184,7 @@ struct io_ev_fd {
 	struct eventfd_ctx	*cq_ev_fd;
 	unsigned int		eventfd_async: 1;
 	struct rcu_head		rcu;
+	atomic_t		refs;
 };
 
 struct io_alloc_cache {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 70fa075c6887..d54b0f667a59 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -478,33 +478,33 @@ static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 	}
 }
 
+
+static inline void __io_eventfd_put(struct io_ev_fd *ev_fd)
+{
+	if (atomic_dec_and_test(&ev_fd->refs)) {
+		eventfd_ctx_put(ev_fd->cq_ev_fd);
+		kfree(ev_fd);
+	}
+}
+
+static void io_eventfd_signal_put(struct rcu_head *rcu)
+{
+	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
+
+	eventfd_signal(ev_fd->cq_ev_fd, 1);
+	__io_eventfd_put(ev_fd);
+}
+
 static void io_eventfd_put(struct rcu_head *rcu)
 {
 	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
 
-	eventfd_ctx_put(ev_fd->cq_ev_fd);
-	kfree(ev_fd);
+	__io_eventfd_put(ev_fd);
 }
 
 static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
-	struct io_ev_fd *ev_fd;
-	bool skip;
-
-	spin_lock(&ctx->completion_lock);
-	/*
-	 * Eventfd should only get triggered when at least one event has been
-	 * posted. Some applications rely on the eventfd notification count only
-	 * changing IFF a new CQE has been added to the CQ ring. There's no
-	 * depedency on 1:1 relationship between how many times this function is
-	 * called (and hence the eventfd count) and number of CQEs posted to the
-	 * CQ ring.
-	 */
-	skip = ctx->cached_cq_tail == ctx->evfd_last_cq_tail;
-	ctx->evfd_last_cq_tail = ctx->cached_cq_tail;
-	spin_unlock(&ctx->completion_lock);
-	if (skip)
-		return;
+	struct io_ev_fd *ev_fd = NULL;
 
 	rcu_read_lock();
 	/*
@@ -522,13 +522,43 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
 		goto out;
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
 		goto out;
+	if (ev_fd->eventfd_async && !io_wq_current_is_worker())
+		goto out;
 
-	if (!ev_fd->eventfd_async || io_wq_current_is_worker())
+	if (likely(eventfd_signal_allowed())) {
 		eventfd_signal(ev_fd->cq_ev_fd, 1);
+	} else {
+		atomic_inc(&ev_fd->refs);
+		call_rcu(&ev_fd->rcu, io_eventfd_signal_put);
+	}
+
 out:
 	rcu_read_unlock();
 }
 
+static void io_eventfd_flush_signal(struct io_ring_ctx *ctx)
+{
+	bool skip;
+
+	spin_lock(&ctx->completion_lock);
+
+	/*
+	 * Eventfd should only get triggered when at least one event has been
+	 * posted. Some applications rely on the eventfd notification count
+	 * only changing IFF a new CQE has been added to the CQ ring. There's
+	 * no depedency on 1:1 relationship between how many times this
+	 * function is called (and hence the eventfd count) and number of CQEs
+	 * posted to the CQ ring.
+	 */
+	skip = ctx->cached_cq_tail == ctx->evfd_last_cq_tail;
+	ctx->evfd_last_cq_tail = ctx->cached_cq_tail;
+	spin_unlock(&ctx->completion_lock);
+	if (skip)
+		return;
+
+	io_eventfd_signal(ctx);
+}
+
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
 	if (ctx->off_timeout_used || ctx->drain_active) {
@@ -540,7 +570,7 @@ void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 		spin_unlock(&ctx->completion_lock);
 	}
 	if (ctx->has_evfd)
-		io_eventfd_signal(ctx);
+		io_eventfd_flush_signal(ctx);
 }
 
 static inline void io_cqring_ev_posted(struct io_ring_ctx *ctx)
@@ -1066,6 +1096,8 @@ static void io_req_local_work_add(struct io_kiocb *req)
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 
+	if (ctx->has_evfd)
+		io_eventfd_signal(ctx);
 	io_cqring_wake(ctx);
 }
 
@@ -2457,6 +2489,7 @@ static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg,
 	ev_fd->eventfd_async = eventfd_async;
 	ctx->has_evfd = true;
 	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
+	atomic_set(&ev_fd->refs, 1);
 	return 0;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 1/7] io_uring: use local ctx variable
  2022-08-15 13:09 ` [PATCH for-next 1/7] io_uring: use local ctx variable Dylan Yudaken
@ 2022-08-15 13:46   ` Pavel Begunkov
  2022-08-15 15:16     ` Dylan Yudaken
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-08-15 13:46 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe, io-uring; +Cc: Kernel-team

On 8/15/22 14:09, Dylan Yudaken wrote:
> small change to use the local ctx
> 
> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---
>   io_uring/io_uring.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index ebfdb2212ec2..ab3e3d9e9fcd 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1072,8 +1072,8 @@ void io_req_task_work_add(struct io_kiocb *req)
>   		req = container_of(node, struct io_kiocb, io_task_work.node);
>   		node = node->next;
>   		if (llist_add(&req->io_task_work.node,
> -			      &req->ctx->fallback_llist))
> -			schedule_delayed_work(&req->ctx->fallback_work, 1);
> +			      &ctx->fallback_llist))
> +			schedule_delayed_work(&ctx->fallback_work, 1);

Requests here can be from different rings, you can't use @ctx
from above

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter
  2022-08-15 13:09 ` [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter Dylan Yudaken
@ 2022-08-15 13:50   ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-08-15 13:50 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe, io-uring; +Cc: Kernel-team

On 8/15/22 14:09, Dylan Yudaken wrote:
> It is normally better to wait for task work until after submissions. This
> will allow greater batching if either work arrives in the meanwhile, or if
> the submissions cause task work to be queued up.
> 
> For SQPOLL this also no longer runs task work, but this is handled inside
> the SQPOLL loop anyway.
> 
> For IOPOLL io_iopoll_check will run task work anyway

It's here to free resources (e.g. io_kiocb so can be reused in the
submission) and so, but we don't care much. Running them after
submission doesn't make much difference as either we go to
cq_wait, which will run it for us, or exit and again they'll be
executed.

In short, instead of moving we can just kill it.

> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---
>   io_uring/io_uring.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 8cc4b28b1725..3b08369c3c60 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2990,8 +2990,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   	struct fd f;
>   	long ret;
>   
> -	io_run_task_work();
> -
>   	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
>   			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
>   			       IORING_ENTER_REGISTERED_RING)))
> @@ -3060,7 +3058,11 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
>   			goto iopoll_locked;
>   		mutex_unlock(&ctx->uring_lock);
> +		io_run_task_work();
> +	} else {
> +		io_run_task_work();
>   	}
> +
>   	if (flags & IORING_ENTER_GETEVENTS) {
>   		int ret2;
>   		if (ctx->syscall_iopoll) {

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
  2022-08-15 13:09 ` [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN Dylan Yudaken
@ 2022-08-15 14:02   ` Pavel Begunkov
  2022-08-15 15:25     ` Dylan Yudaken
  2022-08-16 15:28     ` Dylan Yudaken
  0 siblings, 2 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-08-15 14:02 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe, io-uring; +Cc: Kernel-team

On 8/15/22 14:09, Dylan Yudaken wrote:
> Allow deferring async tasks until the user calls io_uring_enter(2) with
> the IORING_ENTER_GETEVENTS flag.
> 
> Being able to hand pick when tasks are run prevents the problem where
> there is current work to be done, however task work runs anyway.
> 
> For example, a common workload would obtain a batch of CQEs, and process
> each one. Interrupting this to additional taskwork would add latency but
> not gain anything. If instead task work is deferred to just before more
> CQEs are obtained then no additional latency is added.
> 
> The way this is implemented is by trying to keep task work local to a
> io_ring_ctx, rather than to the submission task. This is required, as the
> application will want to wake up only a single io_ring_ctx at a time to
> process work, and so the lists of work have to be kept separate.
> 
> This has some other benefits like not having to check the task continually
> in handle_tw_list (and potentially unlocking/locking those), and reducing
> locks in the submit & process completions path.
> 
> There are networking cases where using this option can reduce request
> latency by 50%. For example a contrived example using [1] where the client
> sends 2k data and receives the same data back while doing some system
> calls (to trigger task work) shows this reduction. The reason ends up
> being that if sending responses is delayed by processing task work, then
> the client side sits idle. Whereas reordering the sends first means that
> the client runs it's workload in parallel with the local task work.
> 
> [1]:
> Using https://github.com/DylanZA/netbench/tree/defer_run
> Client:
> ./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
> Server:
> ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"
> 
> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---
>   include/linux/io_uring_types.h |   2 +
>   include/uapi/linux/io_uring.h  |   7 ++
>   io_uring/cancel.c              |   2 +-
>   io_uring/io_uring.c            | 125 ++++++++++++++++++++++++++++-----
>   io_uring/io_uring.h            |  31 +++++++-
>   io_uring/rsrc.c                |   2 +-
>   6 files changed, 148 insertions(+), 21 deletions(-)
> 
[...]
> -void io_req_task_work_add(struct io_kiocb *req)
> +static void io_req_local_work_add(struct io_kiocb *req)
> +{
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
> +		return;
> +
> +	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> +		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
> +
> +	io_cqring_wake(ctx);
> +}
> +
> +static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
>   {
>   	struct io_uring_task *tctx = req->task->io_uring;
>   	struct io_ring_ctx *ctx = req->ctx;
>   	struct llist_node *node;
>   
> +	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> +		io_req_local_work_add(req);
> +		return;
> +	}
> +
>   	/* task_work already pending, we're done */
>   	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
>   		return;
> @@ -1074,6 +1093,69 @@ void io_req_task_work_add(struct io_kiocb *req)
>   	}
>   }
>   
> +void io_req_task_work_add(struct io_kiocb *req)
> +{
> +	__io_req_task_work_add(req, true);
> +}
> +
> +static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_all(&ctx->work_llist);
> +	while (node) {
> +		struct io_kiocb *req = container_of(node, struct io_kiocb,
> +						    io_task_work.node);
> +
> +		node = node->next;
> +		__io_req_task_work_add(req, false);
> +	}
> +}
> +
> +bool io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> +{
> +	struct llist_node *node;
> +	struct llist_node fake;
> +	struct llist_node *current_final = NULL;
> +	unsigned int count;
> +
> +	if (!locked)
> +		locked = mutex_trylock(&ctx->uring_lock);
> +
> +	node = io_llist_xchg(&ctx->work_llist, &fake);
> +	count = 0;
> +again:
> +	while (node != current_final) {
> +		struct llist_node *next = node->next;
> +		struct io_kiocb *req = container_of(node, struct io_kiocb,
> +						    io_task_work.node);
> +		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
> +		if (unlikely(!same_thread_group(req->task, current))) {

Not same thread group, they have to be executed by the same thread.
One of the assumptions is that current->io_uring is the same
as the request was initialised with.

> +			__io_req_task_work_add(req, false);

Why do we mix local and normal tw in the same ring? I think we
need to do either one or another. What is blocking it?

> +		} else {
> +			req->io_task_work.func(req, &locked);
> +			count++;
> +		}
> +		node = next;
> +	}
> +
> +	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> +		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
> +
> +	node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> +	if (node != &fake) {
> +		current_final = &fake;
> +		node = io_llist_xchg(&ctx->work_llist, &fake);
> +		goto again;
> +	}
> +
> +	if (locked) {
> +		io_submit_flush_completions(ctx);
> +		mutex_unlock(&ctx->uring_lock);
> +	}
> +	return count > 0;
> +}
> +
>   static void io_req_tw_post(struct io_kiocb *req, bool *locked)
>   {
>   	io_req_complete_post(req);
> @@ -1284,8 +1366,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
>   		if (wq_list_empty(&ctx->iopoll_list)) {
>   			u32 tail = ctx->cached_cq_tail;
>   
> -			mutex_unlock(&ctx->uring_lock);
> -			io_run_task_work();
> +			io_run_task_work_unlock_ctx(ctx);
>   			mutex_lock(&ctx->uring_lock);
>   
>   			/* some requests don't go through iopoll_list */
> @@ -2146,7 +2227,9 @@ struct io_wait_queue {
>   
>   static inline bool io_has_work(struct io_ring_ctx *ctx)
>   {
> -	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
> +	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
> +	       ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
> +		!llist_empty(&ctx->work_llist));
>   }
>   
>   static inline bool io_should_wake(struct io_wait_queue *iowq)
> @@ -2178,9 +2261,9 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>   	return -1;
>   }
>   
> -int io_run_task_work_sig(void)
> +int io_run_task_work_sig(struct io_ring_ctx *ctx)
>   {
> -	if (io_run_task_work())
> +	if (io_run_task_work_ctx(ctx, true))
>   		return 1;
>   	if (task_sigpending(current))
>   		return -EINTR;
> @@ -2196,7 +2279,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>   	unsigned long check_cq;
>   
>   	/* make sure we run task_work before checking for signals */
> -	ret = io_run_task_work_sig();
> +	ret = io_run_task_work_sig(ctx);
>   	if (ret || io_should_wake(iowq))
>   		return ret;
>   
> @@ -2230,7 +2313,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>   		io_cqring_overflow_flush(ctx);
>   		if (io_cqring_events(ctx) >= min_events)
>   			return 0;
> -		if (!io_run_task_work())
> +		if (!io_run_task_work_ctx(ctx, false))
>   			break;
>   	} while (1);
>   
> @@ -2768,13 +2851,14 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>   		}
>   	}
>   
> +	io_move_task_work_from_local(ctx);

Why do we even need to move them? Isn't it easier to just
execute them here?

>   	ret |= io_cancel_defer_files(ctx, task, cancel_all);
>   	mutex_lock(&ctx->uring_lock);
>   	ret |= io_poll_remove_all(ctx, task, cancel_all);
>   	mutex_unlock(&ctx->uring_lock);
>   	ret |= io_kill_timeouts(ctx, task, cancel_all);
>   	if (task)
> -		ret |= io_run_task_work();
> +		ret |= io_run_task_work_ctx(ctx, true);
>   	return ret;
>   }
>   
> @@ -2837,7 +2921,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
>   		}
>   
>   		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
> -		io_run_task_work();
> +		io_run_task_work_ctx(ctx, true);
>   		io_uring_drop_tctx_refs(current);
>   
>   		/*
> @@ -3055,12 +3139,15 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   			mutex_unlock(&ctx->uring_lock);
>   			goto out;
>   		}
> -		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
> +
> +		if (!(flags & IORING_ENTER_GETEVENTS))
> +			mutex_unlock(&ctx->uring_lock);
> +		else if (ctx->syscall_iopoll)
>   			goto iopoll_locked;
> -		mutex_unlock(&ctx->uring_lock);
> -		io_run_task_work();
> +		else
> +			io_run_task_work_unlock_ctx(ctx);

Let's unroll this function and get rid of conditional
locking, especially since you don't need the io_run_task_work()
part here.

>   	} else {
> -		io_run_task_work();
> +		io_run_task_work_ctx(ctx, false);
>   	}
>   
...
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -26,7 +26,8 @@ enum {
>   
...
>   
> +static inline bool io_run_task_work_ctx(struct io_ring_ctx *ctx, bool all)
> +{
> +	bool ret = false;
> +
> +	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> +		ret = io_run_local_work(ctx, false);
> +		if (!all)
> +			return ret;
> +	}
> +	ret  |= io_run_task_work();
> +	return ret;
> +}
> +
> +static inline bool io_run_task_work_unlock_ctx(struct io_ring_ctx *ctx)
> +{
> +	bool ret;
> +
> +	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> +		ret = io_run_local_work(ctx, true);
> +		mutex_unlock(&ctx->uring_lock);

If I read it right, io_run_local_work(locked=true) will
release the lock, and then we unlock it a second time
here. I'd suggest moving conditional locking out
of io_run_local_work().

> +	} else {
> +		mutex_unlock(&ctx->uring_lock);
> +		ret = io_run_task_work();
> +	}
> +
> +	return ret;
> +}
> +

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 1/7] io_uring: use local ctx variable
  2022-08-15 13:46   ` Pavel Begunkov
@ 2022-08-15 15:16     ` Dylan Yudaken
  0 siblings, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 15:16 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Kernel Team

On Mon, 2022-08-15 at 14:46 +0100, Pavel Begunkov wrote:
> On 8/15/22 14:09, Dylan Yudaken wrote:
> > small change to use the local ctx
> > 
> > Signed-off-by: Dylan Yudaken <dylany@fb.com>
> > ---
> >   io_uring/io_uring.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index ebfdb2212ec2..ab3e3d9e9fcd 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -1072,8 +1072,8 @@ void io_req_task_work_add(struct io_kiocb
> > *req)
> >                 req = container_of(node, struct io_kiocb,
> > io_task_work.node);
> >                 node = node->next;
> >                 if (llist_add(&req->io_task_work.node,
> > -                             &req->ctx->fallback_llist))
> > -                       schedule_delayed_work(&req->ctx-
> > >fallback_work, 1);
> > +                             &ctx->fallback_llist))
> > +                       schedule_delayed_work(&ctx->fallback_work,
> > 1);
> 
> Requests here can be from different rings, you can't use @ctx
> from above
> 

Ah - thats a completely miss by me. I'll remove this patch from the
series.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
  2022-08-15 14:02   ` Pavel Begunkov
@ 2022-08-15 15:25     ` Dylan Yudaken
  2022-08-15 15:36       ` Pavel Begunkov
  2022-08-16 15:28     ` Dylan Yudaken
  1 sibling, 1 reply; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-15 15:25 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Kernel Team

On Mon, 2022-08-15 at 15:02 +0100, Pavel Begunkov wrote:
> On 8/15/22 14:09, Dylan Yudaken wrote:
> > Allow deferring async tasks until the user calls io_uring_enter(2)
> > with
> > the IORING_ENTER_GETEVENTS flag.
> > 
> > Being able to hand pick when tasks are run prevents the problem
> > where
> > there is current work to be done, however task work runs anyway.
> > 
> > For example, a common workload would obtain a batch of CQEs, and
> > process
> > each one. Interrupting this to additional taskwork would add
> > latency but
> > not gain anything. If instead task work is deferred to just before
> > more
> > CQEs are obtained then no additional latency is added.
> > 
> > The way this is implemented is by trying to keep task work local to
> > a
> > io_ring_ctx, rather than to the submission task. This is required,
> > as the
> > application will want to wake up only a single io_ring_ctx at a
> > time to
> > process work, and so the lists of work have to be kept separate.
> > 
> > This has some other benefits like not having to check the task
> > continually
> > in handle_tw_list (and potentially unlocking/locking those), and
> > reducing
> > locks in the submit & process completions path.
> > 
> > There are networking cases where using this option can reduce
> > request
> > latency by 50%. For example a contrived example using [1] where the
> > client
> > sends 2k data and receives the same data back while doing some
> > system
> > calls (to trigger task work) shows this reduction. The reason ends
> > up
> > being that if sending responses is delayed by processing task work,
> > then
> > the client side sits idle. Whereas reordering the sends first means
> > that
> > the client runs it's workload in parallel with the local task work.
> > 
> > [1]:
> > Using https://github.com/DylanZA/netbench/tree/defer_run
> > Client:
> > ./netbench  --client_only 1 --control_port 10000 --host <host> --tx
> > "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --
> > workload 1000"
> > Server:
> > ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --
> > defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1
> > --workload 100"
> > 
> > Signed-off-by: Dylan Yudaken <dylany@fb.com>
> > ---
> >   include/linux/io_uring_types.h |   2 +
> >   include/uapi/linux/io_uring.h  |   7 ++
> >   io_uring/cancel.c              |   2 +-
> >   io_uring/io_uring.c            | 125
> > ++++++++++++++++++++++++++++-----
> >   io_uring/io_uring.h            |  31 +++++++-
> >   io_uring/rsrc.c                |   2 +-
> >   6 files changed, 148 insertions(+), 21 deletions(-)
> > 
> [...]
> > 
> > +
> > +bool io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> > +{
> > +       struct llist_node *node;
> > +       struct llist_node fake;
> > +       struct llist_node *current_final = NULL;
> > +       unsigned int count;
> > +
> > +       if (!locked)
> > +               locked = mutex_trylock(&ctx->uring_lock);
> > +
> > +       node = io_llist_xchg(&ctx->work_llist, &fake);
> > +       count = 0;
> > +again:
> > +       while (node != current_final) {
> > +               struct llist_node *next = node->next;
> > +               struct io_kiocb *req = container_of(node, struct
> > io_kiocb,
> > +                                                  
> > io_task_work.node);
> > +               prefetch(container_of(next, struct io_kiocb,
> > io_task_work.node));
> > +               if (unlikely(!same_thread_group(req->task,
> > current))) {
> 
> Not same thread group, they have to be executed by the same thread.
> One of the assumptions is that current->io_uring is the same
> as the request was initialised with.

How do the wq paths work in that case? I can see in io_queue_iowq that
we only check for same_thread_group.

If required to be the same task we'd probably want to enforce
IORING_SETUP_SINGLE_ISSUER for this flag (not a big problem).

> 
> > +                       __io_req_task_work_add(req, false);
> 
> Why do we mix local and normal tw in the same ring? I think we
> need to do either one or another. What is blocking it?

This is for some slow paths such as cancelling requests or the ring
shutting down. We need to make sure things are run and so need to have
both ways.

> 
> > +               } else {
> > +                       req->io_task_work.func(req, &locked);
> > +                       count++;
> > +               }
> > +               node = next;
> > +       }
> > +
> > +       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> > +               atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings-
> > >sq_flags);
> > +
> > +       node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> > +       if (node != &fake) {
> > +               current_final = &fake;
> > +               node = io_llist_xchg(&ctx->work_llist, &fake);
> > +               goto again;
> > +       }
> > +
> > +       if (locked) {
> > +               io_submit_flush_completions(ctx);
> > +               mutex_unlock(&ctx->uring_lock);
> > +       }
> > +       return count > 0;
> > +}
> > +
> >   static void io_req_tw_post(struct io_kiocb *req, bool *locked)
> >   {
> >         io_req_complete_post(req);
> > @@ -1284,8 +1366,7 @@ static int io_iopoll_check(struct io_ring_ctx
> > *ctx, long min)
> >                 if (wq_list_empty(&ctx->iopoll_list)) {
> >                         u32 tail = ctx->cached_cq_tail;
> >   
> > -                       mutex_unlock(&ctx->uring_lock);
> > -                       io_run_task_work();
> > +                       io_run_task_work_unlock_ctx(ctx);
> >                         mutex_lock(&ctx->uring_lock);
> >   
> >                         /* some requests don't go through
> > iopoll_list */
> > @@ -2146,7 +2227,9 @@ struct io_wait_queue {
> >   
> >   static inline bool io_has_work(struct io_ring_ctx *ctx)
> >   {
> > -       return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
> > +       return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)
> > ||
> > +              ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
> > +               !llist_empty(&ctx->work_llist));
> >   }
> >   
> >   static inline bool io_should_wake(struct io_wait_queue *iowq)
> > @@ -2178,9 +2261,9 @@ static int io_wake_function(struct
> > wait_queue_entry *curr, unsigned int mode,
> >         return -1;
> >   }
> >   
> > -int io_run_task_work_sig(void)
> > +int io_run_task_work_sig(struct io_ring_ctx *ctx)
> >   {
> > -       if (io_run_task_work())
> > +       if (io_run_task_work_ctx(ctx, true))
> >                 return 1;
> >         if (task_sigpending(current))
> >                 return -EINTR;
> > @@ -2196,7 +2279,7 @@ static inline int
> > io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> >         unsigned long check_cq;
> >   
> >         /* make sure we run task_work before checking for signals
> > */
> > -       ret = io_run_task_work_sig();
> > +       ret = io_run_task_work_sig(ctx);
> >         if (ret || io_should_wake(iowq))
> >                 return ret;
> >   
> > @@ -2230,7 +2313,7 @@ static int io_cqring_wait(struct io_ring_ctx
> > *ctx, int min_events,
> >                 io_cqring_overflow_flush(ctx);
> >                 if (io_cqring_events(ctx) >= min_events)
> >                         return 0;
> > -               if (!io_run_task_work())
> > +               if (!io_run_task_work_ctx(ctx, false))
> >                         break;
> >         } while (1);
> >   
> > @@ -2768,13 +2851,14 @@ static __cold bool
> > io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> >                 }
> >         }
> >   
> > +       io_move_task_work_from_local(ctx);
> 
> Why do we even need to move them? Isn't it easier to just
> execute them here?

I think I was being too conservative. Can execute them here, and if it
is the wrong task can move them to that task's task_work.
> 
> >         ret |= io_cancel_defer_files(ctx, task, cancel_all);
> >         mutex_lock(&ctx->uring_lock);
> >         ret |= io_poll_remove_all(ctx, task, cancel_all);
> >         mutex_unlock(&ctx->uring_lock);
> >         ret |= io_kill_timeouts(ctx, task, cancel_all);
> >         if (task)
> > -               ret |= io_run_task_work();
> > +               ret |= io_run_task_work_ctx(ctx, true);
> >         return ret;
> >   }
> >   
> > @@ -2837,7 +2921,7 @@ __cold void io_uring_cancel_generic(bool
> > cancel_all, struct io_sq_data *sqd)
> >                 }
> >   
> >                 prepare_to_wait(&tctx->wait, &wait,
> > TASK_INTERRUPTIBLE);
> > -               io_run_task_work();
> > +               io_run_task_work_ctx(ctx, true);
> >                 io_uring_drop_tctx_refs(current);
> >   
> >                 /*
> > @@ -3055,12 +3139,15 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> >                         mutex_unlock(&ctx->uring_lock);
> >                         goto out;
> >                 }
> > -               if ((flags & IORING_ENTER_GETEVENTS) && ctx-
> > >syscall_iopoll)
> > +
> > +               if (!(flags & IORING_ENTER_GETEVENTS))
> > +                       mutex_unlock(&ctx->uring_lock);
> > +               else if (ctx->syscall_iopoll)
> >                         goto iopoll_locked;
> > -               mutex_unlock(&ctx->uring_lock);
> > -               io_run_task_work();
> > +               else
> > +                       io_run_task_work_unlock_ctx(ctx);
> 
> Let's unroll this function and get rid of conditional
> locking, especially since you don't need the io_run_task_work()
> part here.
> 
> >         } else {
> > -               io_run_task_work();
> > +               io_run_task_work_ctx(ctx, false);
> >         }
> >   
> ...
> > --- a/io_uring/io_uring.h
> > +++ b/io_uring/io_uring.h
> > @@ -26,7 +26,8 @@ enum {
> >   
> ...
> >   
> > +static inline bool io_run_task_work_ctx(struct io_ring_ctx *ctx,
> > bool all)
> > +{
> > +       bool ret = false;
> > +
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> > +               ret = io_run_local_work(ctx, false);
> > +               if (!all)
> > +                       return ret;
> > +       }
> > +       ret  |= io_run_task_work();
> > +       return ret;
> > +}
> > +
> > +static inline bool io_run_task_work_unlock_ctx(struct io_ring_ctx
> > *ctx)
> > +{
> > +       bool ret;
> > +
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> > +               ret = io_run_local_work(ctx, true);
> > +               mutex_unlock(&ctx->uring_lock);
> 
> If I read it right, io_run_local_work(locked=true) will
> release the lock, and then we unlock it a second time
> here. I'd suggest moving conditional locking out
> of io_run_local_work().

Ah - thanks for spotting that. I'll clean it up

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
  2022-08-15 15:25     ` Dylan Yudaken
@ 2022-08-15 15:36       ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-08-15 15:36 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, io-uring; +Cc: Kernel Team

On 8/15/22 16:25, Dylan Yudaken wrote:
> On Mon, 2022-08-15 at 15:02 +0100, Pavel Begunkov wrote:
>> On 8/15/22 14:09, Dylan Yudaken wrote:
[...]
>> Not same thread group, they have to be executed by the same thread.
>> One of the assumptions is that current->io_uring is the same
>> as the request was initialised with.
> 
> How do the wq paths work in that case? I can see in io_queue_iowq that
> we only check for same_thread_group.

It's under WARN_ON_ONCE() and should never happen, otherwise
there is a bug. iowq would io_req_complete_post() to complete.

> If required to be the same task we'd probably want to enforce
> IORING_SETUP_SINGLE_ISSUER for this flag (not a big problem).

And I guess also make sure it's not run by a different task
on the waiting side or so.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
  2022-08-15 14:02   ` Pavel Begunkov
  2022-08-15 15:25     ` Dylan Yudaken
@ 2022-08-16 15:28     ` Dylan Yudaken
  1 sibling, 0 replies; 15+ messages in thread
From: Dylan Yudaken @ 2022-08-16 15:28 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Kernel Team

On Mon, 2022-08-15 at 15:02 +0100, Pavel Begunkov wrote:
> > -               if ((flags & IORING_ENTER_GETEVENTS) && ctx-
> > >syscall_iopoll)
> > +
> > +               if (!(flags & IORING_ENTER_GETEVENTS))
> > +                       mutex_unlock(&ctx->uring_lock);
> > +               else if (ctx->syscall_iopoll)
> >                         goto iopoll_locked;
> > -               mutex_unlock(&ctx->uring_lock);
> > -               io_run_task_work();
> > +               else
> > +                       io_run_task_work_unlock_ctx(ctx);
> 
> Let's unroll this function and get rid of conditional
> locking, especially since you don't need the io_run_task_work()
> part here.

I am struggling to figure out how to do this cleanly in v2 as I don't
want to break the existing codepath that skips the unlock/lock for
iopoll.

I'll post v2 anyway - but please let me know if my solution isn't what
you were thinking

Dylan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-08-16 15:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-15 13:09 [PATCH for-next 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 1/7] io_uring: use local ctx variable Dylan Yudaken
2022-08-15 13:46   ` Pavel Begunkov
2022-08-15 15:16     ` Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 2/7] io_uring: remove unnecessary variable Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 3/7] io_uring: introduce io_has_work Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 4/7] io_uring: do not always run task work at the start of io_uring_enter Dylan Yudaken
2022-08-15 13:50   ` Pavel Begunkov
2022-08-15 13:09 ` [PATCH for-next 5/7] io_uring: add IORING_SETUP_DEFER_TASKRUN Dylan Yudaken
2022-08-15 14:02   ` Pavel Begunkov
2022-08-15 15:25     ` Dylan Yudaken
2022-08-15 15:36       ` Pavel Begunkov
2022-08-16 15:28     ` Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 6/7] io_uring: move io_eventfd_put Dylan Yudaken
2022-08-15 13:09 ` [PATCH for-next 7/7] io_uring: signal registered eventfd to process deferred task work Dylan Yudaken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).