io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET for-next 0/8] io_uring SQPOLL improvements
@ 2020-09-03  2:20 Jens Axboe
  2020-09-03  2:20 ` [PATCH 1/8] io_uring: io_sq_thread() doesn't need to flush signals Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring

The SQPOLL support is useful as it stands, but for various use cases
it's less useful than it could be. In no particular order:

- We currently require root. I'd like to "relax" that to CAP_SYS_NICE
  instead, which I think fits quite nicely with it.

- You get one poll thread per ring. For cases that use multiple rings,
  this isn't necessary and is also quite wasteful.

Patch 1 is a cleanup, patch 2 allows CAP_SYS_NICE for SQPOLL, and the
remaining patches gradually work our way to be able to support
shared SQPOLL threads. The latter works exactly like
IORING_SETUP_ATTACH_WQ, where we currently support just sharing the
io-wq backend between threads. With this added, we also support the
SQPOLL thread.

I'm sure there are some quirks and issues to iron out in the SQPOLL
series, but some initial testing shows it working for me and it passes
the test suite. I'll run some more testing, reviews would be more than
welcome.

 fs/io_uring.c | 385 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 268 insertions(+), 117 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/8] io_uring: io_sq_thread() doesn't need to flush signals
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 2/8] io_uring: allow SQPOLL with CAP_SYS_NICE privileges Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d97601cacd7f..4766cc54144d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6700,8 +6700,6 @@ static int io_sq_thread(void *data)
 					io_ring_clear_wakeup_flag(ctx);
 					continue;
 				}
-				if (signal_pending(current))
-					flush_signals(current);
 				schedule();
 				finish_wait(&ctx->sqo_wait, &wait);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/8] io_uring: allow SQPOLL with CAP_SYS_NICE privileges
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
  2020-09-03  2:20 ` [PATCH 1/8] io_uring: io_sq_thread() doesn't need to flush signals Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 3/8] io_uring: use private ctx wait queue entries for SQPOLL Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

CAP_SYS_ADMIN is too restrictive for a lot of uses cases, allow
CAP_SYS_NICE based on the premise that such users are already allowed
to raise the priority of tasks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4766cc54144d..e2e62dbc4b93 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7552,7 +7552,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		ret = -EPERM;
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
 			goto err;
 
 		/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/8] io_uring: use private ctx wait queue entries for SQPOLL
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
  2020-09-03  2:20 ` [PATCH 1/8] io_uring: io_sq_thread() doesn't need to flush signals Jens Axboe
  2020-09-03  2:20 ` [PATCH 2/8] io_uring: allow SQPOLL with CAP_SYS_NICE privileges Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 4/8] io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.

We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e2e62dbc4b93..76f02db37ffc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -277,7 +277,10 @@ struct io_ring_ctx {
 	struct io_wq		*io_wq;
 	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
-	wait_queue_head_t	sqo_wait;
+	struct wait_queue_head	*sqo_wait;
+	struct wait_queue_head	__sqo_wait;
+	struct wait_queue_entry	sqo_wait_entry;
+
 
 	/*
 	 * For SQPOLL usage - no reference is held to this file table, we
@@ -1083,7 +1086,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
-	init_waitqueue_head(&ctx->sqo_wait);
+	init_waitqueue_head(&ctx->__sqo_wait);
+	ctx->sqo_wait = &ctx->__sqo_wait;
 	init_waitqueue_head(&ctx->cq_wait);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->ref_comp);
@@ -1346,8 +1350,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
-	if (waitqueue_active(&ctx->sqo_wait))
-		wake_up(&ctx->sqo_wait);
+	if (waitqueue_active(ctx->sqo_wait))
+		wake_up(ctx->sqo_wait);
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
@@ -2411,9 +2415,8 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 	else
 		list_add_tail(&req->inflight_entry, &ctx->iopoll_list);
 
-	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
-	    wq_has_sleeper(&ctx->sqo_wait))
-		wake_up(&ctx->sqo_wait);
+	if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(ctx->sqo_wait))
+		wake_up(ctx->sqo_wait);
 }
 
 static void __io_state_file_put(struct io_submit_state *state)
@@ -6614,10 +6617,11 @@ static int io_sq_thread(void *data)
 {
 	struct io_ring_ctx *ctx = data;
 	const struct cred *old_cred;
-	DEFINE_WAIT(wait);
 	unsigned long timeout;
 	int ret = 0;
 
+	init_wait(&ctx->sqo_wait_entry);
+
 	complete(&ctx->sq_thread_comp);
 
 	old_cred = override_creds(ctx->creds);
@@ -6671,7 +6675,7 @@ static int io_sq_thread(void *data)
 				continue;
 			}
 
-			prepare_to_wait(&ctx->sqo_wait, &wait,
+			prepare_to_wait(ctx->sqo_wait, &ctx->sqo_wait_entry,
 						TASK_INTERRUPTIBLE);
 
 			/*
@@ -6683,7 +6687,7 @@ static int io_sq_thread(void *data)
 			 */
 			if ((ctx->flags & IORING_SETUP_IOPOLL) &&
 			    !list_empty_careful(&ctx->iopoll_list)) {
-				finish_wait(&ctx->sqo_wait, &wait);
+				finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 				continue;
 			}
 
@@ -6692,22 +6696,22 @@ static int io_sq_thread(void *data)
 			to_submit = io_sqring_entries(ctx);
 			if (!to_submit || ret == -EBUSY) {
 				if (kthread_should_park()) {
-					finish_wait(&ctx->sqo_wait, &wait);
+					finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 					break;
 				}
 				if (io_run_task_work()) {
-					finish_wait(&ctx->sqo_wait, &wait);
+					finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 					io_ring_clear_wakeup_flag(ctx);
 					continue;
 				}
 				schedule();
-				finish_wait(&ctx->sqo_wait, &wait);
+				finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 
 				io_ring_clear_wakeup_flag(ctx);
 				ret = 0;
 				continue;
 			}
-			finish_wait(&ctx->sqo_wait, &wait);
+			finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 
 			io_ring_clear_wakeup_flag(ctx);
 		}
@@ -8371,7 +8375,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if (!list_empty_careful(&ctx->cq_overflow_list))
 			io_cqring_overflow_flush(ctx, false);
 		if (flags & IORING_ENTER_SQ_WAKEUP)
-			wake_up(&ctx->sqo_wait);
+			wake_up(ctx->sqo_wait);
 		submitted = to_submit;
 	} else if (to_submit) {
 		mutex_lock(&ctx->uring_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/8] io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
                   ` (2 preceding siblings ...)
  2020-09-03  2:20 ` [PATCH 3/8] io_uring: use private ctx wait queue entries for SQPOLL Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 5/8] io_uring: split work handling part of SQPOLL into helper Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.

Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 76f02db37ffc..95c81e0395d9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6613,6 +6613,18 @@ static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx)
 	spin_unlock_irq(&ctx->completion_lock);
 }
 
+static int io_sq_wake_function(struct wait_queue_entry *wqe, unsigned mode,
+			       int sync, void *key)
+{
+	struct io_ring_ctx *ctx = container_of(wqe, struct io_ring_ctx, sqo_wait_entry);
+	int ret;
+
+	ret = autoremove_wake_function(wqe, mode, sync, key);
+	if (ret)
+		io_ring_clear_wakeup_flag(ctx);
+	return ret;
+}
+
 static int io_sq_thread(void *data)
 {
 	struct io_ring_ctx *ctx = data;
@@ -6621,6 +6633,7 @@ static int io_sq_thread(void *data)
 	int ret = 0;
 
 	init_wait(&ctx->sqo_wait_entry);
+	ctx->sqo_wait_entry.func = io_sq_wake_function;
 
 	complete(&ctx->sq_thread_comp);
 
@@ -6707,7 +6720,6 @@ static int io_sq_thread(void *data)
 				schedule();
 				finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
 
-				io_ring_clear_wakeup_flag(ctx);
 				ret = 0;
 				continue;
 			}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/8] io_uring: split work handling part of SQPOLL into helper
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
                   ` (3 preceding siblings ...)
  2020-09-03  2:20 ` [PATCH 4/8] io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 6/8] io_uring: split SQPOLL data into separate structure Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.

__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 183 ++++++++++++++++++++++++++------------------------
 1 file changed, 96 insertions(+), 87 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 95c81e0395d9..8ce1b4247120 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6625,114 +6625,123 @@ static int io_sq_wake_function(struct wait_queue_entry *wqe, unsigned mode,
 	return ret;
 }
 
-static int io_sq_thread(void *data)
+enum sq_ret {
+	SQT_IDLE	= 1,
+	SQT_SPIN	= 2,
+	SQT_DID_WORK	= 4,
+};
+
+static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
+				  unsigned long start_jiffies)
 {
-	struct io_ring_ctx *ctx = data;
-	const struct cred *old_cred;
-	unsigned long timeout;
+	unsigned long timeout = start_jiffies + ctx->sq_thread_idle;
+	unsigned int to_submit;
 	int ret = 0;
 
-	init_wait(&ctx->sqo_wait_entry);
-	ctx->sqo_wait_entry.func = io_sq_wake_function;
+again:
+	if (!list_empty(&ctx->iopoll_list)) {
+		unsigned nr_events = 0;
 
-	complete(&ctx->sq_thread_comp);
+		mutex_lock(&ctx->uring_lock);
+		if (!list_empty(&ctx->iopoll_list) && !need_resched())
+			io_do_iopoll(ctx, &nr_events, 0);
+		mutex_unlock(&ctx->uring_lock);
+	}
 
-	old_cred = override_creds(ctx->creds);
+	to_submit = io_sqring_entries(ctx);
 
-	task_lock(current);
-	current->files = ctx->sqo_files;
-	task_unlock(current);
+	/*
+	 * If submit got -EBUSY, flag us as needing the application
+	 * to enter the kernel to reap and flush events.
+	 */
+	if (!to_submit || ret == -EBUSY || need_resched()) {
+		/*
+		 * Drop cur_mm before scheduling, we can't hold it for
+		 * long periods (or over schedule()). Do this before
+		 * adding ourselves to the waitqueue, as the unuse/drop
+		 * may sleep.
+		 */
+		io_sq_thread_drop_mm();
 
-	timeout = jiffies + ctx->sq_thread_idle;
-	while (!kthread_should_park()) {
-		unsigned int to_submit;
+		/*
+		 * We're polling. If we're within the defined idle
+		 * period, then let us spin without work before going
+		 * to sleep. The exception is if we got EBUSY doing
+		 * more IO, we should wait for the application to
+		 * reap events and wake us up.
+		 */
+		if (!list_empty(&ctx->iopoll_list) || need_resched() ||
+		    (!time_after(start_jiffies, timeout) && ret != -EBUSY &&
+		    !percpu_ref_is_dying(&ctx->refs)))
+			return SQT_SPIN;
 
-		if (!list_empty(&ctx->iopoll_list)) {
-			unsigned nr_events = 0;
+		prepare_to_wait(ctx->sqo_wait, &ctx->sqo_wait_entry,
+					TASK_INTERRUPTIBLE);
 
-			mutex_lock(&ctx->uring_lock);
-			if (!list_empty(&ctx->iopoll_list) && !need_resched())
-				io_do_iopoll(ctx, &nr_events, 0);
-			else
-				timeout = jiffies + ctx->sq_thread_idle;
-			mutex_unlock(&ctx->uring_lock);
+		/*
+		 * While doing polled IO, before going to sleep, we need
+		 * to check if there are new reqs added to iopoll_list,
+		 * it is because reqs may have been punted to io worker
+		 * and will be added to iopoll_list later, hence check
+		 * the iopoll_list again.
+		 */
+		if ((ctx->flags & IORING_SETUP_IOPOLL) &&
+		    !list_empty_careful(&ctx->iopoll_list)) {
+			finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+			goto again;
 		}
 
+		io_ring_set_wakeup_flag(ctx);
+
 		to_submit = io_sqring_entries(ctx);
+		if (!to_submit || ret == -EBUSY)
+			return SQT_IDLE;
 
-		/*
-		 * If submit got -EBUSY, flag us as needing the application
-		 * to enter the kernel to reap and flush events.
-		 */
-		if (!to_submit || ret == -EBUSY || need_resched()) {
-			/*
-			 * Drop cur_mm before scheduling, we can't hold it for
-			 * long periods (or over schedule()). Do this before
-			 * adding ourselves to the waitqueue, as the unuse/drop
-			 * may sleep.
-			 */
-			io_sq_thread_drop_mm();
+		finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+		io_ring_clear_wakeup_flag(ctx);
+	}
 
-			/*
-			 * We're polling. If we're within the defined idle
-			 * period, then let us spin without work before going
-			 * to sleep. The exception is if we got EBUSY doing
-			 * more IO, we should wait for the application to
-			 * reap events and wake us up.
-			 */
-			if (!list_empty(&ctx->iopoll_list) || need_resched() ||
-			    (!time_after(jiffies, timeout) && ret != -EBUSY &&
-			    !percpu_ref_is_dying(&ctx->refs))) {
-				io_run_task_work();
-				cond_resched();
-				continue;
-			}
+	mutex_lock(&ctx->uring_lock);
+	if (likely(!percpu_ref_is_dying(&ctx->refs)))
+		ret = io_submit_sqes(ctx, to_submit, NULL, -1);
+	mutex_unlock(&ctx->uring_lock);
+	return SQT_DID_WORK;
+}
 
-			prepare_to_wait(ctx->sqo_wait, &ctx->sqo_wait_entry,
-						TASK_INTERRUPTIBLE);
+static int io_sq_thread(void *data)
+{
+	struct io_ring_ctx *ctx = data;
+	const struct cred *old_cred;
+	unsigned long start_jiffies;
 
-			/*
-			 * While doing polled IO, before going to sleep, we need
-			 * to check if there are new reqs added to iopoll_list,
-			 * it is because reqs may have been punted to io worker
-			 * and will be added to iopoll_list later, hence check
-			 * the iopoll_list again.
-			 */
-			if ((ctx->flags & IORING_SETUP_IOPOLL) &&
-			    !list_empty_careful(&ctx->iopoll_list)) {
-				finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
-				continue;
-			}
+	init_wait(&ctx->sqo_wait_entry);
+	ctx->sqo_wait_entry.func = io_sq_wake_function;
 
-			io_ring_set_wakeup_flag(ctx);
+	complete(&ctx->sq_thread_comp);
 
-			to_submit = io_sqring_entries(ctx);
-			if (!to_submit || ret == -EBUSY) {
-				if (kthread_should_park()) {
-					finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
-					break;
-				}
-				if (io_run_task_work()) {
-					finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
-					io_ring_clear_wakeup_flag(ctx);
-					continue;
-				}
-				schedule();
-				finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+	old_cred = override_creds(ctx->creds);
 
-				ret = 0;
-				continue;
-			}
-			finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+	task_lock(current);
+	current->files = ctx->sqo_files;
+	task_unlock(current);
 
-			io_ring_clear_wakeup_flag(ctx);
-		}
+	start_jiffies = jiffies;
+	while (!kthread_should_park()) {
+		enum sq_ret ret;
 
-		mutex_lock(&ctx->uring_lock);
-		if (likely(!percpu_ref_is_dying(&ctx->refs)))
-			ret = io_submit_sqes(ctx, to_submit, NULL, -1);
-		mutex_unlock(&ctx->uring_lock);
-		timeout = jiffies + ctx->sq_thread_idle;
+		ret = __io_sq_thread(ctx, start_jiffies);
+		switch (ret) {
+		case SQT_IDLE:
+			schedule();
+			start_jiffies = jiffies;
+			continue;
+		case SQT_SPIN:
+			io_run_task_work();
+			cond_resched();
+			fallthrough;
+		case SQT_DID_WORK:
+			continue;
+		}
 	}
 
 	io_run_task_work();
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/8] io_uring: split SQPOLL data into separate structure
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
                   ` (4 preceding siblings ...)
  2020-09-03  2:20 ` [PATCH 5/8] io_uring: split work handling part of SQPOLL into helper Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 7/8] io_uring: base SQPOLL handling off io_sq_data Jens Axboe
  2020-09-03  2:20 ` [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too Jens Axboe
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.

In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 104 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 31 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8ce1b4247120..35ea69aad9c0 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -229,6 +229,12 @@ struct io_restriction {
 	bool registered;
 };
 
+struct io_sq_data {
+	refcount_t		refs;
+	struct task_struct	*thread;
+	struct wait_queue_head	wait;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -275,13 +281,7 @@ struct io_ring_ctx {
 
 	/* IO offload */
 	struct io_wq		*io_wq;
-	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
-	struct wait_queue_head	*sqo_wait;
-	struct wait_queue_head	__sqo_wait;
-	struct wait_queue_entry	sqo_wait_entry;
-
-
 	/*
 	 * For SQPOLL usage - no reference is held to this file table, we
 	 * rely on fops->flush() and our callback there waiting for the users
@@ -289,6 +289,10 @@ struct io_ring_ctx {
 	 */
 	struct files_struct	*sqo_files;
 
+	struct wait_queue_entry	sqo_wait_entry;
+
+	struct io_sq_data	*sq_data;	/* if using sq thread polling */
+
 	/*
 	 * If used, fixed file set. Writers must ensure that ->refs is dead,
 	 * readers must ensure that ->refs is alive as long as the file* is
@@ -1086,8 +1090,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
-	init_waitqueue_head(&ctx->__sqo_wait);
-	ctx->sqo_wait = &ctx->__sqo_wait;
 	init_waitqueue_head(&ctx->cq_wait);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->ref_comp);
@@ -1350,8 +1352,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
-	if (waitqueue_active(ctx->sqo_wait))
-		wake_up(ctx->sqo_wait);
+	if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
+		wake_up(&ctx->sq_data->wait);
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
@@ -2415,8 +2417,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 	else
 		list_add_tail(&req->inflight_entry, &ctx->iopoll_list);
 
-	if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(ctx->sqo_wait))
-		wake_up(ctx->sqo_wait);
+	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
+	    wq_has_sleeper(&ctx->sq_data->wait))
+		wake_up(&ctx->sq_data->wait);
 }
 
 static void __io_state_file_put(struct io_submit_state *state)
@@ -6635,6 +6638,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 				  unsigned long start_jiffies)
 {
 	unsigned long timeout = start_jiffies + ctx->sq_thread_idle;
+	struct io_sq_data *sqd = ctx->sq_data;
 	unsigned int to_submit;
 	int ret = 0;
 
@@ -6675,7 +6679,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		    !percpu_ref_is_dying(&ctx->refs)))
 			return SQT_SPIN;
 
-		prepare_to_wait(ctx->sqo_wait, &ctx->sqo_wait_entry,
+		prepare_to_wait(&sqd->wait, &ctx->sqo_wait_entry,
 					TASK_INTERRUPTIBLE);
 
 		/*
@@ -6687,7 +6691,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		 */
 		if ((ctx->flags & IORING_SETUP_IOPOLL) &&
 		    !list_empty_careful(&ctx->iopoll_list)) {
-			finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+			finish_wait(&sqd->wait, &ctx->sqo_wait_entry);
 			goto again;
 		}
 
@@ -6697,7 +6701,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		if (!to_submit || ret == -EBUSY)
 			return SQT_IDLE;
 
-		finish_wait(ctx->sqo_wait, &ctx->sqo_wait_entry);
+		finish_wait(&sqd->wait, &ctx->sqo_wait_entry);
 		io_ring_clear_wakeup_flag(ctx);
 	}
 
@@ -6925,18 +6929,46 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	return 0;
 }
 
-static void io_sq_thread_stop(struct io_ring_ctx *ctx)
+static void io_put_sq_data(struct io_sq_data *sqd)
 {
-	if (ctx->sqo_thread) {
-		wait_for_completion(&ctx->sq_thread_comp);
+	if (refcount_dec_and_test(&sqd->refs)) {
 		/*
 		 * The park is a bit of a work-around, without it we get
 		 * warning spews on shutdown with SQPOLL set and affinity
 		 * set to a single CPU.
 		 */
-		kthread_park(ctx->sqo_thread);
-		kthread_stop(ctx->sqo_thread);
-		ctx->sqo_thread = NULL;
+		if (sqd->thread) {
+			kthread_park(sqd->thread);
+			kthread_stop(sqd->thread);
+		}
+
+		kfree(sqd);
+	}
+}
+
+static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
+{
+	struct io_sq_data *sqd;
+
+	sqd = kzalloc(sizeof(*sqd), GFP_KERNEL);
+	if (!sqd)
+		return ERR_PTR(-ENOMEM);
+
+	refcount_set(&sqd->refs, 1);
+	init_waitqueue_head(&sqd->wait);
+	return sqd;
+}
+
+static void io_sq_thread_stop(struct io_ring_ctx *ctx)
+{
+	struct io_sq_data *sqd = ctx->sq_data;
+
+	if (sqd) {
+		if (sqd->thread)
+			wait_for_completion(&ctx->sq_thread_comp);
+
+		io_put_sq_data(sqd);
+		ctx->sq_data = NULL;
 	}
 }
 
@@ -7576,10 +7608,19 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 	int ret;
 
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		struct io_sq_data *sqd;
+
 		ret = -EPERM;
 		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
 			goto err;
 
+		sqd = io_get_sq_data(p);
+		if (IS_ERR(sqd)) {
+			ret = PTR_ERR(sqd);
+			goto err;
+		}
+		ctx->sq_data = sqd;
+
 		/*
 		 * We will exit the sqthread before current exits, so we can
 		 * avoid taking a reference here and introducing weird
@@ -7600,16 +7641,15 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			if (!cpu_online(cpu))
 				goto err;
 
-			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
-							ctx, cpu,
-							"io_uring-sq");
+			sqd->thread = kthread_create_on_cpu(io_sq_thread, ctx,
+							cpu, "io_uring-sq");
 		} else {
-			ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
+			sqd->thread = kthread_create(io_sq_thread, ctx,
 							"io_uring-sq");
 		}
-		if (IS_ERR(ctx->sqo_thread)) {
-			ret = PTR_ERR(ctx->sqo_thread);
-			ctx->sqo_thread = NULL;
+		if (IS_ERR(sqd->thread)) {
+			ret = PTR_ERR(sqd->thread);
+			sqd->thread = NULL;
 			goto err;
 		}
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
@@ -7631,8 +7671,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 static void io_sq_offload_start(struct io_ring_ctx *ctx)
 {
-	if ((ctx->flags & IORING_SETUP_SQPOLL) && ctx->sqo_thread)
-		wake_up_process(ctx->sqo_thread);
+	struct io_sq_data *sqd = ctx->sq_data;
+
+	if ((ctx->flags & IORING_SETUP_SQPOLL) && sqd->thread)
+		wake_up_process(sqd->thread);
 }
 
 static inline void __io_unaccount_mem(struct user_struct *user,
@@ -8396,7 +8438,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if (!list_empty_careful(&ctx->cq_overflow_list))
 			io_cqring_overflow_flush(ctx, false);
 		if (flags & IORING_ENTER_SQ_WAKEUP)
-			wake_up(ctx->sqo_wait);
+			wake_up(&ctx->sq_data->wait);
 		submitted = to_submit;
 	} else if (to_submit) {
 		mutex_lock(&ctx->uring_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/8] io_uring: base SQPOLL handling off io_sq_data
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
                   ` (5 preceding siblings ...)
  2020-09-03  2:20 ` [PATCH 6/8] io_uring: split SQPOLL data into separate structure Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-03  2:20 ` [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too Jens Axboe
  7 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.

As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 101 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 72 insertions(+), 29 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 35ea69aad9c0..5bafc7a2c65c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -231,6 +231,12 @@ struct io_restriction {
 
 struct io_sq_data {
 	refcount_t		refs;
+
+	/* ctx's that are using this sqd */
+	struct list_head	ctx_list;
+	struct list_head	ctx_new_list;
+	struct mutex		ctx_lock;
+
 	struct task_struct	*thread;
 	struct wait_queue_head	wait;
 };
@@ -290,6 +296,7 @@ struct io_ring_ctx {
 	struct files_struct	*sqo_files;
 
 	struct wait_queue_entry	sqo_wait_entry;
+	struct list_head	sqd_list;
 
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
@@ -1090,6 +1097,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
+	INIT_LIST_HEAD(&ctx->sqd_list);
 	init_waitqueue_head(&ctx->cq_wait);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->ref_comp);
@@ -6712,49 +6720,74 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 	return SQT_DID_WORK;
 }
 
+static void io_sqd_init_new(struct io_sq_data *sqd)
+{
+	struct io_ring_ctx *ctx;
+
+	while (!list_empty(&sqd->ctx_new_list)) {
+		ctx = list_first_entry(&sqd->ctx_new_list, struct io_ring_ctx, sqd_list);
+		init_wait(&ctx->sqo_wait_entry);
+		ctx->sqo_wait_entry.func = io_sq_wake_function;
+		list_move_tail(&ctx->sqd_list, &sqd->ctx_list);
+		complete(&ctx->sq_thread_comp);
+	}
+}
+
 static int io_sq_thread(void *data)
 {
-	struct io_ring_ctx *ctx = data;
-	const struct cred *old_cred;
+	struct io_sq_data *sqd = data;
+	struct io_ring_ctx *ctx;
 	unsigned long start_jiffies;
 
-	init_wait(&ctx->sqo_wait_entry);
-	ctx->sqo_wait_entry.func = io_sq_wake_function;
+	start_jiffies = jiffies;
+	while (!kthread_should_park()) {
+		const struct cred *old_cred = NULL;
+		enum sq_ret ret = 0;
 
-	complete(&ctx->sq_thread_comp);
+		mutex_lock(&sqd->ctx_lock);
 
-	old_cred = override_creds(ctx->creds);
+		if (unlikely(!list_empty(&sqd->ctx_new_list)))
+			io_sqd_init_new(sqd);
 
-	task_lock(current);
-	current->files = ctx->sqo_files;
-	task_unlock(current);
+		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
+			if (current->cred != ctx->creds) {
+				if (old_cred)
+					revert_creds(old_cred);
+				old_cred = override_creds(ctx->creds);
+			}
 
-	start_jiffies = jiffies;
-	while (!kthread_should_park()) {
-		enum sq_ret ret;
+			if (current->files != ctx->sqo_files) {
+				task_lock(current);
+				current->files = ctx->sqo_files;
+				task_unlock(current);
+			}
 
-		ret = __io_sq_thread(ctx, start_jiffies);
-		switch (ret) {
-		case SQT_IDLE:
-			schedule();
-			start_jiffies = jiffies;
-			continue;
-		case SQT_SPIN:
+			ret |= __io_sq_thread(ctx, start_jiffies);
+
+			io_sq_thread_drop_mm();
+		}
+
+		mutex_unlock(&sqd->ctx_lock);
+
+		if (old_cred)
+			revert_creds(old_cred);
+
+		if (ret & SQT_SPIN) {
 			io_run_task_work();
 			cond_resched();
-			fallthrough;
-		case SQT_DID_WORK:
-			continue;
+		} else if (ret == SQT_IDLE) {
+			schedule();
+			start_jiffies = jiffies;
 		}
 	}
 
 	io_run_task_work();
 
-	io_sq_thread_drop_mm();
-	task_lock(current);
-	current->files = NULL;
-	task_unlock(current);
-	revert_creds(old_cred);
+	if (current->files) {
+		task_lock(current);
+		current->files = NULL;
+		task_unlock(current);
+	}
 
 	kthread_parkme();
 
@@ -6955,6 +6988,9 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
 		return ERR_PTR(-ENOMEM);
 
 	refcount_set(&sqd->refs, 1);
+	INIT_LIST_HEAD(&sqd->ctx_list);
+	INIT_LIST_HEAD(&sqd->ctx_new_list);
+	mutex_init(&sqd->ctx_lock);
 	init_waitqueue_head(&sqd->wait);
 	return sqd;
 }
@@ -6967,6 +7003,10 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
 		if (sqd->thread)
 			wait_for_completion(&ctx->sq_thread_comp);
 
+		mutex_lock(&sqd->ctx_lock);
+		list_del(&ctx->sqd_list);
+		mutex_unlock(&sqd->ctx_lock);
+
 		io_put_sq_data(sqd);
 		ctx->sq_data = NULL;
 	}
@@ -7620,6 +7660,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			goto err;
 		}
 		ctx->sq_data = sqd;
+		mutex_lock(&sqd->ctx_lock);
+		list_add(&ctx->sqd_list, &sqd->ctx_new_list);
+		mutex_unlock(&sqd->ctx_lock);
 
 		/*
 		 * We will exit the sqthread before current exits, so we can
@@ -7641,10 +7684,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			if (!cpu_online(cpu))
 				goto err;
 
-			sqd->thread = kthread_create_on_cpu(io_sq_thread, ctx,
+			sqd->thread = kthread_create_on_cpu(io_sq_thread, sqd,
 							cpu, "io_uring-sq");
 		} else {
-			sqd->thread = kthread_create(io_sq_thread, ctx,
+			sqd->thread = kthread_create(io_sq_thread, sqd,
 							"io_uring-sq");
 		}
 		if (IS_ERR(sqd->thread)) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
                   ` (6 preceding siblings ...)
  2020-09-03  2:20 ` [PATCH 7/8] io_uring: base SQPOLL handling off io_sq_data Jens Axboe
@ 2020-09-03  2:20 ` Jens Axboe
  2020-09-07  8:56   ` Xiaoguang Wang
  7 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-09-03  2:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5bafc7a2c65c..07e16049e62d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -232,6 +232,10 @@ struct io_restriction {
 struct io_sq_data {
 	refcount_t		refs;
 
+	/* global sqd lookup */
+	struct list_head	all_sqd_list;
+	int			attach_fd;
+
 	/* ctx's that are using this sqd */
 	struct list_head	ctx_list;
 	struct list_head	ctx_new_list;
@@ -241,6 +245,9 @@ struct io_sq_data {
 	struct wait_queue_head	wait;
 };
 
+static LIST_HEAD(sqd_list);
+static DEFINE_MUTEX(sqd_lock);
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -6975,14 +6982,38 @@ static void io_put_sq_data(struct io_sq_data *sqd)
 			kthread_stop(sqd->thread);
 		}
 
+		mutex_lock(&sqd_lock);
+		list_del(&sqd->all_sqd_list);
+		mutex_unlock(&sqd_lock);
+
 		kfree(sqd);
 	}
 }
 
+static struct io_sq_data *io_attach_sq_data(struct io_uring_params *p)
+{
+	struct io_sq_data *sqd, *ret = ERR_PTR(-ENXIO);
+
+	mutex_lock(&sqd_lock);
+	list_for_each_entry(sqd, &sqd_list, all_sqd_list) {
+		if (sqd->attach_fd == p->wq_fd) {
+			refcount_inc(&sqd->refs);
+			ret = sqd;
+			break;
+		}
+	}
+	mutex_unlock(&sqd_lock);
+
+	return ret;
+}
+
 static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
 {
 	struct io_sq_data *sqd;
 
+	if (p->flags & IORING_SETUP_ATTACH_WQ)
+		return io_attach_sq_data(p);
+
 	sqd = kzalloc(sizeof(*sqd), GFP_KERNEL);
 	if (!sqd)
 		return ERR_PTR(-ENOMEM);
@@ -6992,6 +7023,10 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
 	INIT_LIST_HEAD(&sqd->ctx_new_list);
 	mutex_init(&sqd->ctx_lock);
 	init_waitqueue_head(&sqd->wait);
+
+	mutex_lock(&sqd_lock);
+	list_add_tail(&sqd->all_sqd_list, &sqd_list);
+	mutex_unlock(&sqd_lock);
 	return sqd;
 }
 
@@ -7675,6 +7710,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		if (!ctx->sq_thread_idle)
 			ctx->sq_thread_idle = HZ;
 
+		if (sqd->thread)
+			goto done;
+
 		if (p->flags & IORING_SETUP_SQ_AFF) {
 			int cpu = p->sq_thread_cpu;
 
@@ -7701,6 +7739,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		goto err;
 	}
 
+done:
 	ret = io_init_wq_offload(ctx, p);
 	if (ret)
 		goto err;
@@ -8831,6 +8870,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret < 0)
 		goto err;
 
+	if ((ctx->flags & (IORING_SETUP_SQPOLL | IORING_SETUP_ATTACH_WQ)) ==
+	    IORING_SETUP_SQPOLL)
+		ctx->sq_data->attach_fd = ret;
+
 	trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
 	return ret;
 err:
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-03  2:20 ` [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too Jens Axboe
@ 2020-09-07  8:56   ` Xiaoguang Wang
  2020-09-07 14:00     ` Pavel Begunkov
  2020-09-07 16:14     ` Jens Axboe
  0 siblings, 2 replies; 16+ messages in thread
From: Xiaoguang Wang @ 2020-09-07  8:56 UTC (permalink / raw)
  To: Jens Axboe, io-uring

hi,

First thanks for this patchset:) and I have some questions below:

1. Does a uring instance which has SQPOLL enabled but does not specify
a valid cpu is meaningful in real business product?
IMHO, in a real business product, cpu usage should be under strict management,
say a uring instance which has SQPOLL enabled but is not bound to fixed cpu,
then this uring instance's corresponding io_sq_thread could be scheduled to any
cpu(and can be re-scheduled many times), which may impact any other application
greatly because of cpu contention, especially this io_sq_thread is just doing
busy loop in its sq_thread_idle period time, so in a real business product, I
wonder whether SQPOLL is only meaningful if user specifies a fixed cpu core.

2. Does IORING_SETUP_ATTACH_WQ is really convenient?
IORING_SETUP_ATTACH_WQ always needs to ensure a parent uring instance exists,
that means it also requires app to regulate the creation oder of uring instanes,
which maybe not convenient, just imagine how to integrate this new SQPOLL
improvements to fio util.

3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag?
In your current implementation, if a ctx is marked as SQT_IDLE, this ctx will be
set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work to do, then io_sq_thread
is still running and does not need to be waken up, then a later wakeup form userspace
is unnecessary. I think it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all
ctxes have no work to do, you can have a look at my attached codes:)

4. Is io_attach_sq_data really safe?
sqd_list is a global list, but its search key is a fd local to process, different
processes may have same fd, then this codes looks odd, seems that your design is
to make io_sq_thread shared inside process.

Indeed I have sent a patch to implement similar idea before:
https://lore.kernel.org/io-uring/20200520115648.6140-1-xiaoguang.wang@linux.alibaba.com/
And below is a new version which is implemented in our kernel.
commit 0195c87c786e034e351510e27bf1d15f2a3b9be2
Author: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Date:   Mon Jul 27 21:39:59 2020 +0800

     alios: io_uring: add percpu io sq thread support

     task #26578122

     Currently we can create multiple io_uring instances which all have SQPOLL
     enabled and make them bound to same cpu core by setting sq_thread_cpu argument,
     but sometimes this isn't efficient. Imagine such extreme case, create two io
     uring instances, which both have SQPOLL enabled and are bound to same cpu core.
     One instance submits io per 500ms, another instance submits io continually,
     then obviously the 1st instance still always contend for cpu resource, which
     will impact 2nd instance.

     To fix this issue, add a new flag IORING_SETUP_SQPOLL_PERCPU, when both
     IORING_SETUP_SQ_AFF and IORING_SETUP_SQPOLL_PERCPU are enabled, we create a
     percpu io sq_thread to handle multiple io_uring instances' io requests with
     round-robin strategy, the improvements are very big, see below:

     IOPS:
       No of instances       1     2     4     8     16     32
       kernel unpatched   589k  487k  303k  165k  85.8k  43.7k
       kernel patched     590k  593k  581k  538k   494k   406k

     LATENCY(us):
       No of instances       1     2     4     8     16     32
       kernel unpatched    217   262   422   775  1488    2917
       kernel patched      216   215   219   237   258     313

     Link: https://lore.kernel.org/io-uring/20200520115648.6140-1-xiaoguang.wang@linux.alibaba.com/
     Reviewed-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
     Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
     Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 530373a9e9c0..b8d66df002ba 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -259,7 +259,13 @@ struct io_ring_ctx {
         struct io_wq            *io_wq;
         struct task_struct      *sqo_thread;    /* if using sq thread polling */
         struct mm_struct        *sqo_mm;
-       wait_queue_head_t       sqo_wait;
+       wait_queue_head_t       *sqo_wait;
+       wait_queue_head_t       __sqo_wait;
+
+       /* Used for percpu io sq thread */
+       int                     submit_status;
+       int                     sq_thread_cpu;
+       struct list_head        node;

         /*
          * If used, fixed file set. Writers must ensure that ->refs is dead,
@@ -330,6 +336,17 @@ struct io_ring_ctx {
         struct work_struct              exit_work;
  };

+struct sq_thread_percpu {
+       struct list_head ctx_list;
+       struct mutex lock;
+       wait_queue_head_t sqo_wait;
+       struct task_struct *sqo_thread;
+       struct completion sq_thread_comp;
+       unsigned int sq_thread_idle;
+};
+
+static struct sq_thread_percpu __percpu *percpu_threads;
+
  /*
   * First field must be the file pointer in all the
   * iocb unions! See also 'struct kiocb' in <linux/fs.h>
@@ -981,9 +998,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
                 goto err;

         ctx->flags = p->flags;
-       init_waitqueue_head(&ctx->sqo_wait);
+       init_waitqueue_head(&ctx->__sqo_wait);
+       ctx->sqo_wait = &ctx->__sqo_wait;
         init_waitqueue_head(&ctx->cq_wait);
         INIT_LIST_HEAD(&ctx->cq_overflow_list);
+       INIT_LIST_HEAD(&ctx->node);
         init_completion(&ctx->ref_comp);
         init_completion(&ctx->sq_thread_comp);
         idr_init(&ctx->io_buffer_idr);
@@ -1210,8 +1229,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
  {
         if (waitqueue_active(&ctx->wait))
                 wake_up(&ctx->wait);
-       if (waitqueue_active(&ctx->sqo_wait))
-               wake_up(&ctx->sqo_wait);
+       if (waitqueue_active(ctx->sqo_wait))
+               wake_up(ctx->sqo_wait);
         if (io_should_trigger_evfd(ctx))
                 eventfd_signal(ctx->cq_ev_fd, 1);
  }
@@ -2034,8 +2053,8 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
                 list_add_tail(&req->list, &ctx->poll_list);

         if ((ctx->flags & IORING_SETUP_SQPOLL) &&
-           wq_has_sleeper(&ctx->sqo_wait))
-               wake_up(&ctx->sqo_wait);
+           wq_has_sleeper(ctx->sqo_wait))
+               wake_up(ctx->sqo_wait);
  }

  static void __io_state_file_put(struct io_submit_state *state)
@@ -6068,7 +6087,7 @@ static int io_sq_thread(void *data)
                                 continue;
                         }

-                       prepare_to_wait(&ctx->sqo_wait, &wait,
+                       prepare_to_wait(ctx->sqo_wait, &wait,
                                                 TASK_INTERRUPTIBLE);
                         /*
                          * While doing polled IO, before going to sleep, we need
@@ -6079,7 +6098,7 @@ static int io_sq_thread(void *data)
                          */
                         if ((ctx->flags & IORING_SETUP_IOPOLL) &&
                             !list_empty_careful(&ctx->poll_list)) {
-                               finish_wait(&ctx->sqo_wait, &wait);
+                               finish_wait(ctx->sqo_wait, &wait);
                                 continue;
                         }

@@ -6088,25 +6107,25 @@ static int io_sq_thread(void *data)
                         to_submit = io_sqring_entries(ctx);
                         if (!to_submit || ret == -EBUSY) {
                                 if (kthread_should_park()) {
-                                       finish_wait(&ctx->sqo_wait, &wait);
+                                       finish_wait(ctx->sqo_wait, &wait);
                                         break;
                                 }
                                 if (current->task_works) {
                                         task_work_run();
-                                       finish_wait(&ctx->sqo_wait, &wait);
+                                       finish_wait(ctx->sqo_wait, &wait);
                                         io_ring_clear_wakeup_flag(ctx);
                                         continue;
                                 }
                                 if (signal_pending(current))
                                         flush_signals(current);
                                 schedule();
-                               finish_wait(&ctx->sqo_wait, &wait);
+                               finish_wait(ctx->sqo_wait, &wait);

                                 io_ring_clear_wakeup_flag(ctx);
                                 ret = 0;
                                 continue;
                         }
-                       finish_wait(&ctx->sqo_wait, &wait);
+                       finish_wait(ctx->sqo_wait, &wait);

                         io_ring_clear_wakeup_flag(ctx);
                 }
@@ -6130,6 +6149,147 @@ static int io_sq_thread(void *data)
         return 0;
  }

+static int process_ctx(struct sq_thread_percpu *t, struct io_ring_ctx *ctx)
+{
+       int ret = 0;
+       unsigned int to_submit;
+       struct io_ring_ctx *ctx2;
+
+       list_for_each_entry(ctx2, &t->ctx_list, node) {
+               if (!list_empty(&ctx2->poll_list)) {
+                       unsigned int nr_events = 0;
+
+                       mutex_lock(&ctx2->uring_lock);
+                       if (!list_empty(&ctx2->poll_list))
+                               io_iopoll_getevents(ctx2, &nr_events, 0);
+                       mutex_unlock(&ctx2->uring_lock);
+               }
+       }
+
+       to_submit = io_sqring_entries(ctx);
+       if (to_submit) {
+               mutex_lock(&ctx->uring_lock);
+               if (likely(!percpu_ref_is_dying(&ctx->refs)))
+                       ret = io_submit_sqes(ctx, to_submit, NULL, -1);
+               mutex_unlock(&ctx->uring_lock);
+       }
+       return ret;
+}
+
+static int io_sq_thread_percpu(void *data)
+{
+       struct sq_thread_percpu *t = data;
+       struct io_ring_ctx *ctx;
+       const struct cred *saved_creds, *cur_creds, *old_creds;
+       mm_segment_t old_fs;
+       DEFINE_WAIT(wait);
+       unsigned long timeout;
+       int iters = 0;
+
+       complete(&t->sq_thread_comp);
+
+       old_fs = get_fs();
+       set_fs(USER_DS);
+       timeout = jiffies + t->sq_thread_idle;
+       saved_creds = cur_creds = NULL;
+       while (!kthread_should_park()) {
+               bool needs_run, needs_wait;
+               unsigned int to_submit;
+
+               mutex_lock(&t->lock);
+again:
+               needs_run = false;
+               list_for_each_entry(ctx, &t->ctx_list, node) {
+                       if (cur_creds != ctx->creds) {
+                               old_creds = override_creds(ctx->creds);
+                               cur_creds = ctx->creds;
+                               if (saved_creds)
+                                       put_cred(old_creds);
+                               else
+                                       saved_creds = old_creds;
+                       }
+
+                       ctx->submit_status = process_ctx(t, ctx);
+
+                       if (!needs_run)
+                               to_submit = io_sqring_entries(ctx);
+                       if (!needs_run &&
+                           ((to_submit && ctx->submit_status != -EBUSY) ||
+                           !list_empty(&ctx->poll_list)))
+                               needs_run = true;
+               }
+               if (needs_run && (++iters & 7)) {
+                       if (current->task_works)
+                               task_work_run();
+                       timeout = jiffies + t->sq_thread_idle;
+                       goto again;
+               }
+               mutex_unlock(&t->lock);
+               if (needs_run || !time_after(jiffies, timeout)) {
+                       if (current->task_works)
+                               task_work_run();
+                       if (need_resched()) {
+                               io_sq_thread_drop_mm(ctx);
+                               cond_resched();
+                       }
+                       if (needs_run)
+                               timeout = jiffies + t->sq_thread_idle;
+                       continue;
+               }
+
+               needs_wait = true;
+               prepare_to_wait(&t->sqo_wait, &wait, TASK_INTERRUPTIBLE);
+               mutex_lock(&t->lock);
+               list_for_each_entry(ctx, &t->ctx_list, node) {
+                       if ((ctx->flags & IORING_SETUP_IOPOLL) &&
+                           !list_empty_careful(&ctx->poll_list)) {
+                               needs_wait = false;
+                               break;
+                       }
+                       to_submit = io_sqring_entries(ctx);
+                       if (to_submit && ctx->submit_status != -EBUSY) {
+                               needs_wait = false;
+                               break;
+                       }
+               }
+               if (needs_wait) {
+                       list_for_each_entry(ctx, &t->ctx_list, node)
+                               io_ring_set_wakeup_flag(ctx);
+               }
+
+               mutex_unlock(&t->lock);
+
+               if (needs_wait) {
+                       if (current->task_works)
+                               task_work_run();
+                       io_sq_thread_drop_mm(ctx);
+                       if (kthread_should_park()) {
+                               finish_wait(&t->sqo_wait, &wait);
+                               break;
+                       }
+                       schedule();
+                       mutex_lock(&t->lock);
+                       list_for_each_entry(ctx, &t->ctx_list, node)
+                               io_ring_clear_wakeup_flag(ctx);
+                       mutex_unlock(&t->lock);
+                       finish_wait(&t->sqo_wait, &wait);
+               } else
+                       finish_wait(&t->sqo_wait, &wait);
+               timeout = jiffies + t->sq_thread_idle;
+       }
+
+       if (current->task_works)
+               task_work_run();
+
+       set_fs(old_fs);
+       io_sq_thread_drop_mm(ctx);
+       if (saved_creds)
+               revert_creds(saved_creds);
+
+       kthread_parkme();
+
+       return 0;
+}
  struct io_wait_queue {
         struct wait_queue_entry wq;
         struct io_ring_ctx *ctx;
@@ -6291,18 +6451,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
         return 0;
  }

+static void destroy_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu);
+
  static void io_sq_thread_stop(struct io_ring_ctx *ctx)
  {
         if (ctx->sqo_thread) {
-               wait_for_completion(&ctx->sq_thread_comp);
-               /*
-                * The park is a bit of a work-around, without it we get
-                * warning spews on shutdown with SQPOLL set and affinity
-                * set to a single CPU.
-                */
-               kthread_park(ctx->sqo_thread);
-               kthread_stop(ctx->sqo_thread);
-               ctx->sqo_thread = NULL;
+               if ((ctx->flags & IORING_SETUP_SQ_AFF) &&
+                   (ctx->flags & IORING_SETUP_SQPOLL_PERCPU) && percpu_threads) {
+                       destroy_sq_thread_percpu(ctx, ctx->sq_thread_cpu);
+               } else {
+                       wait_for_completion(&ctx->sq_thread_comp);
+                       /*
+                        * The park is a bit of a work-around, without it we get
+                        * warning spews on shutdown with SQPOLL set and affinity
+                        * set to a single CPU.
+                        */
+                       kthread_park(ctx->sqo_thread);
+                       kthread_stop(ctx->sqo_thread);
+                       ctx->sqo_thread = NULL;
+               }
         }
  }

@@ -6917,6 +7084,54 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
         return ret;
  }

+static void create_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu)
+{
+       struct sq_thread_percpu *t;
+
+       t = per_cpu_ptr(percpu_threads, cpu);
+       mutex_lock(&t->lock);
+       if (!t->sqo_thread) {
+               t->sqo_thread = kthread_create_on_cpu(io_sq_thread_percpu, t,
+                                       cpu, "io_uring-sq-percpu");
+               if (IS_ERR(t->sqo_thread)) {
+                       ctx->sqo_thread = t->sqo_thread;
+                       t->sqo_thread = NULL;
+                       mutex_unlock(&t->lock);
+                       return;
+               }
+       }
+
+       if (t->sq_thread_idle < ctx->sq_thread_idle)
+               t->sq_thread_idle = ctx->sq_thread_idle;
+       ctx->sqo_wait = &t->sqo_wait;
+       ctx->sq_thread_cpu = cpu;
+       list_add_tail(&ctx->node, &t->ctx_list);
+       ctx->sqo_thread = t->sqo_thread;
+       mutex_unlock(&t->lock);
+}
+
+static void destroy_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu)
+{
+       struct sq_thread_percpu *t;
+       struct task_struct *sqo_thread = NULL;
+
+       t = per_cpu_ptr(percpu_threads, cpu);
+       mutex_lock(&t->lock);
+       list_del(&ctx->node);
+       if (list_empty(&t->ctx_list)) {
+               sqo_thread = t->sqo_thread;
+               t->sqo_thread = NULL;
+       }
+       mutex_unlock(&t->lock);
+
+       if (sqo_thread) {
+               wait_for_completion(&t->sq_thread_comp);
+               kthread_park(sqo_thread);
+               kthread_stop(sqo_thread);
+       }
+}
+
+
  static int io_sq_offload_start(struct io_ring_ctx *ctx,
                                struct io_uring_params *p)
  {
@@ -6943,9 +7158,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
                         if (!cpu_online(cpu))
                                 goto err;

-                       ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
-                                                       ctx, cpu,
-                                                       "io_uring-sq");
+                       if ((p->flags & IORING_SETUP_SQPOLL_PERCPU) && percpu_threads)
+                               create_sq_thread_percpu(ctx, cpu);
+                       else
+                               ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, ctx, cpu,
+                                                                       "io_uring-sq");
                 } else {
                         ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
                                                         "io_uring-sq");
@@ -7632,7 +7849,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
                 if (!list_empty_careful(&ctx->cq_overflow_list))
                         io_cqring_overflow_flush(ctx, false);
                 if (flags & IORING_ENTER_SQ_WAKEUP)
-                       wake_up(&ctx->sqo_wait);
+                       wake_up(ctx->sqo_wait);
                 submitted = to_submit;
         } else if (to_submit) {
                 mutex_lock(&ctx->uring_lock);
@@ -7990,7 +8207,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)

         if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
                         IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
-                       IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ))
+                       IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
+                       IORING_SETUP_SQPOLL_PERCPU))
                 return -EINVAL;

         return  io_uring_create(entries, &p, params);
@@ -8218,6 +8436,8 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,

  static int __init io_uring_init(void)
  {
+       int cpu;
+
  #define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \
         BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \
         BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \
@@ -8257,6 +8477,28 @@ static int __init io_uring_init(void)
         BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
         BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
         req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+
+
+       percpu_threads = alloc_percpu(struct sq_thread_percpu);
+       /*
+        * Don't take this as fatal error, if this happens, we will just
+        * make io sq thread not go through io_sq_thread percpu version.
+        */
+       if (!percpu_threads)
+               return 0;
+
+       for_each_possible_cpu(cpu) {
+               struct sq_thread_percpu *t;
+
+               t = per_cpu_ptr(percpu_threads, cpu);
+               INIT_LIST_HEAD(&t->ctx_list);
+               init_waitqueue_head(&t->sqo_wait);
+               init_completion(&t->sq_thread_comp);
+               mutex_init(&t->lock);
+               t->sqo_thread = NULL;
+               t->sq_thread_idle = 0;
+       }
+
         return 0;
  };
  __initcall(io_uring_init);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d25c42ae6052..5929b7b95368 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -94,6 +94,7 @@ enum {
  #define IORING_SETUP_CQSIZE    (1U << 3)       /* app defines CQ size */
  #define IORING_SETUP_CLAMP     (1U << 4)       /* clamp SQ/CQ ring sizes */
  #define IORING_SETUP_ATTACH_WQ (1U << 5)       /* attach to existing wq */
+#define IORING_SETUP_SQPOLL_PERCPU     (1U << 31)      /* use percpu SQ poll thread */

  enum {
         IORING_OP_NOP,

Regards,
Xiaoguang Wang


> We support using IORING_SETUP_ATTACH_WQ to share async backends between
> rings created by the same process, this now also allows the same to
> happen with SQPOLL. The setup procedure remains the same, the caller
> sets io_uring_params->wq_fd to the 'parent' context, and then the newly
> created ring will attach to that async backend.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   fs/io_uring.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 43 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5bafc7a2c65c..07e16049e62d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -232,6 +232,10 @@ struct io_restriction {
>   struct io_sq_data {
>   	refcount_t		refs;
>   
> +	/* global sqd lookup */
> +	struct list_head	all_sqd_list;
> +	int			attach_fd;
> +
>   	/* ctx's that are using this sqd */
>   	struct list_head	ctx_list;
>   	struct list_head	ctx_new_list;
> @@ -241,6 +245,9 @@ struct io_sq_data {
>   	struct wait_queue_head	wait;
>   };
>   
> +static LIST_HEAD(sqd_list);
> +static DEFINE_MUTEX(sqd_lock);
> +
>   struct io_ring_ctx {
>   	struct {
>   		struct percpu_ref	refs;
> @@ -6975,14 +6982,38 @@ static void io_put_sq_data(struct io_sq_data *sqd)
>   			kthread_stop(sqd->thread);
>   		}
>   
> +		mutex_lock(&sqd_lock);
> +		list_del(&sqd->all_sqd_list);
> +		mutex_unlock(&sqd_lock);
> +
>   		kfree(sqd);
>   	}
>   }
>   
> +static struct io_sq_data *io_attach_sq_data(struct io_uring_params *p)
> +{
> +	struct io_sq_data *sqd, *ret = ERR_PTR(-ENXIO);
> +
> +	mutex_lock(&sqd_lock);
> +	list_for_each_entry(sqd, &sqd_list, all_sqd_list) {
> +		if (sqd->attach_fd == p->wq_fd) {
> +			refcount_inc(&sqd->refs);
> +			ret = sqd;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&sqd_lock);
> +
> +	return ret;
> +}
> +
>   static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
>   {
>   	struct io_sq_data *sqd;
>   
> +	if (p->flags & IORING_SETUP_ATTACH_WQ)
> +		return io_attach_sq_data(p);
> +
>   	sqd = kzalloc(sizeof(*sqd), GFP_KERNEL);
>   	if (!sqd)
>   		return ERR_PTR(-ENOMEM);
> @@ -6992,6 +7023,10 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
>   	INIT_LIST_HEAD(&sqd->ctx_new_list);
>   	mutex_init(&sqd->ctx_lock);
>   	init_waitqueue_head(&sqd->wait);
> +
> +	mutex_lock(&sqd_lock);
> +	list_add_tail(&sqd->all_sqd_list, &sqd_list);
> +	mutex_unlock(&sqd_lock);
>   	return sqd;
>   }
>   
> @@ -7675,6 +7710,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>   		if (!ctx->sq_thread_idle)
>   			ctx->sq_thread_idle = HZ;
>   
> +		if (sqd->thread)
> +			goto done;
> +
>   		if (p->flags & IORING_SETUP_SQ_AFF) {
>   			int cpu = p->sq_thread_cpu;
>   
> @@ -7701,6 +7739,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>   		goto err;
>   	}
>   
> +done:
>   	ret = io_init_wq_offload(ctx, p);
>   	if (ret)
>   		goto err;
> @@ -8831,6 +8870,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>   	if (ret < 0)
>   		goto err;
>   
> +	if ((ctx->flags & (IORING_SETUP_SQPOLL | IORING_SETUP_ATTACH_WQ)) ==
> +	    IORING_SETUP_SQPOLL)
> +		ctx->sq_data->attach_fd = ret;
> +
>   	trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
>   	return ret;
>   err:
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07  8:56   ` Xiaoguang Wang
@ 2020-09-07 14:00     ` Pavel Begunkov
  2020-09-07 16:11       ` Jens Axboe
  2020-09-07 16:14     ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2020-09-07 14:00 UTC (permalink / raw)
  To: Xiaoguang Wang, Jens Axboe, io-uring

On 07/09/2020 11:56, Xiaoguang Wang wrote:
> hi,
> 
> First thanks for this patchset:) and I have some questions below:
> 
> 1. Does a uring instance which has SQPOLL enabled but does not specify
> a valid cpu is meaningful in real business product?
> IMHO, in a real business product, cpu usage should be under strict management,
> say a uring instance which has SQPOLL enabled but is not bound to fixed cpu,
> then this uring instance's corresponding io_sq_thread could be scheduled to any
> cpu(and can be re-scheduled many times), which may impact any other application
> greatly because of cpu contention, especially this io_sq_thread is just doing
> busy loop in its sq_thread_idle period time, so in a real business product, I
> wonder whether SQPOLL is only meaningful if user specifies a fixed cpu core.

It is meaningful for a part of cases, for the other part you can set
processes' affinities and pin each SQPOLL thread to a specific CPU, as
you'd do even without this series. And that gives you more flexibilty,
IMHO.

> 2. Does IORING_SETUP_ATTACH_WQ is really convenient?
> IORING_SETUP_ATTACH_WQ always needs to ensure a parent uring instance exists,
> that means it also requires app to regulate the creation oder of uring instanes,
> which maybe not convenient, just imagine how to integrate this new SQPOLL
> improvements to fio util.

It may be not so convenient, but it's flexible enough and prevents isolation
breaking issues. We've discussed that in the thread with your patches.

> 
> 3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag?
> In your current implementation, if a ctx is marked as SQT_IDLE, this ctx will be
> set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work to do, then io_sq_thread
> is still running and does not need to be waken up, then a later wakeup form userspace
> is unnecessary. I think it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all
> ctxes have no work to do, you can have a look at my attached codes:)
> 
> 4. Is io_attach_sq_data really safe?
> sqd_list is a global list, but its search key is a fd local to process, different
> processes may have same fd, then this codes looks odd, seems that your design is
> to make io_sq_thread shared inside process.
> 
> Indeed I have sent a patch to implement similar idea before:
> https://lore.kernel.org/io-uring/20200520115648.6140-1-xiaoguang.wang@linux.alibaba.com/
> And below is a new version which is implemented in our kernel.
> commit 0195c87c786e034e351510e27bf1d15f2a3b9be2
> Author: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> Date:   Mon Jul 27 21:39:59 2020 +0800
> 
>     alios: io_uring: add percpu io sq thread support
> 
>     task #26578122
> 
>     Currently we can create multiple io_uring instances which all have SQPOLL
>     enabled and make them bound to same cpu core by setting sq_thread_cpu argument,
>     but sometimes this isn't efficient. Imagine such extreme case, create two io
>     uring instances, which both have SQPOLL enabled and are bound to same cpu core.
>     One instance submits io per 500ms, another instance submits io continually,
>     then obviously the 1st instance still always contend for cpu resource, which
>     will impact 2nd instance.
> 
>     To fix this issue, add a new flag IORING_SETUP_SQPOLL_PERCPU, when both
>     IORING_SETUP_SQ_AFF and IORING_SETUP_SQPOLL_PERCPU are enabled, we create a
>     percpu io sq_thread to handle multiple io_uring instances' io requests with
>     round-robin strategy, the improvements are very big, see below:
> 
>     IOPS:
>       No of instances       1     2     4     8     16     32
>       kernel unpatched   589k  487k  303k  165k  85.8k  43.7k
>       kernel patched     590k  593k  581k  538k   494k   406k
> 
>     LATENCY(us):
>       No of instances       1     2     4     8     16     32
>       kernel unpatched    217   262   422   775  1488    2917
>       kernel patched      216   215   219   237   258     313
> 
>     Link: https://lore.kernel.org/io-uring/20200520115648.6140-1-xiaoguang.wang@linux.alibaba.com/
>     Reviewed-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
>     Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>     Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 530373a9e9c0..b8d66df002ba 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -259,7 +259,13 @@ struct io_ring_ctx {
>         struct io_wq            *io_wq;
>         struct task_struct      *sqo_thread;    /* if using sq thread polling */
>         struct mm_struct        *sqo_mm;
> -       wait_queue_head_t       sqo_wait;
> +       wait_queue_head_t       *sqo_wait;
> +       wait_queue_head_t       __sqo_wait;
> +
> +       /* Used for percpu io sq thread */
> +       int                     submit_status;
> +       int                     sq_thread_cpu;
> +       struct list_head        node;
> 
>         /*
>          * If used, fixed file set. Writers must ensure that ->refs is dead,
> @@ -330,6 +336,17 @@ struct io_ring_ctx {
>         struct work_struct              exit_work;
>  };
> 
> +struct sq_thread_percpu {
> +       struct list_head ctx_list;
> +       struct mutex lock;
> +       wait_queue_head_t sqo_wait;
> +       struct task_struct *sqo_thread;
> +       struct completion sq_thread_comp;
> +       unsigned int sq_thread_idle;
> +};
> +
> +static struct sq_thread_percpu __percpu *percpu_threads;
> +
>  /*
>   * First field must be the file pointer in all the
>   * iocb unions! See also 'struct kiocb' in <linux/fs.h>
> @@ -981,9 +998,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>                 goto err;
> 
>         ctx->flags = p->flags;
> -       init_waitqueue_head(&ctx->sqo_wait);
> +       init_waitqueue_head(&ctx->__sqo_wait);
> +       ctx->sqo_wait = &ctx->__sqo_wait;
>         init_waitqueue_head(&ctx->cq_wait);
>         INIT_LIST_HEAD(&ctx->cq_overflow_list);
> +       INIT_LIST_HEAD(&ctx->node);
>         init_completion(&ctx->ref_comp);
>         init_completion(&ctx->sq_thread_comp);
>         idr_init(&ctx->io_buffer_idr);
> @@ -1210,8 +1229,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
>  {
>         if (waitqueue_active(&ctx->wait))
>                 wake_up(&ctx->wait);
> -       if (waitqueue_active(&ctx->sqo_wait))
> -               wake_up(&ctx->sqo_wait);
> +       if (waitqueue_active(ctx->sqo_wait))
> +               wake_up(ctx->sqo_wait);
>         if (io_should_trigger_evfd(ctx))
>                 eventfd_signal(ctx->cq_ev_fd, 1);
>  }
> @@ -2034,8 +2053,8 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
>                 list_add_tail(&req->list, &ctx->poll_list);
> 
>         if ((ctx->flags & IORING_SETUP_SQPOLL) &&
> -           wq_has_sleeper(&ctx->sqo_wait))
> -               wake_up(&ctx->sqo_wait);
> +           wq_has_sleeper(ctx->sqo_wait))
> +               wake_up(ctx->sqo_wait);
>  }
> 
>  static void __io_state_file_put(struct io_submit_state *state)
> @@ -6068,7 +6087,7 @@ static int io_sq_thread(void *data)
>                                 continue;
>                         }
> 
> -                       prepare_to_wait(&ctx->sqo_wait, &wait,
> +                       prepare_to_wait(ctx->sqo_wait, &wait,
>                                                 TASK_INTERRUPTIBLE);
>                         /*
>                          * While doing polled IO, before going to sleep, we need
> @@ -6079,7 +6098,7 @@ static int io_sq_thread(void *data)
>                          */
>                         if ((ctx->flags & IORING_SETUP_IOPOLL) &&
>                             !list_empty_careful(&ctx->poll_list)) {
> -                               finish_wait(&ctx->sqo_wait, &wait);
> +                               finish_wait(ctx->sqo_wait, &wait);
>                                 continue;
>                         }
> 
> @@ -6088,25 +6107,25 @@ static int io_sq_thread(void *data)
>                         to_submit = io_sqring_entries(ctx);
>                         if (!to_submit || ret == -EBUSY) {
>                                 if (kthread_should_park()) {
> -                                       finish_wait(&ctx->sqo_wait, &wait);
> +                                       finish_wait(ctx->sqo_wait, &wait);
>                                         break;
>                                 }
>                                 if (current->task_works) {
>                                         task_work_run();
> -                                       finish_wait(&ctx->sqo_wait, &wait);
> +                                       finish_wait(ctx->sqo_wait, &wait);
>                                         io_ring_clear_wakeup_flag(ctx);
>                                         continue;
>                                 }
>                                 if (signal_pending(current))
>                                         flush_signals(current);
>                                 schedule();
> -                               finish_wait(&ctx->sqo_wait, &wait);
> +                               finish_wait(ctx->sqo_wait, &wait);
> 
>                                 io_ring_clear_wakeup_flag(ctx);
>                                 ret = 0;
>                                 continue;
>                         }
> -                       finish_wait(&ctx->sqo_wait, &wait);
> +                       finish_wait(ctx->sqo_wait, &wait);
> 
>                         io_ring_clear_wakeup_flag(ctx);
>                 }
> @@ -6130,6 +6149,147 @@ static int io_sq_thread(void *data)
>         return 0;
>  }
> 
> +static int process_ctx(struct sq_thread_percpu *t, struct io_ring_ctx *ctx)
> +{
> +       int ret = 0;
> +       unsigned int to_submit;
> +       struct io_ring_ctx *ctx2;
> +
> +       list_for_each_entry(ctx2, &t->ctx_list, node) {
> +               if (!list_empty(&ctx2->poll_list)) {
> +                       unsigned int nr_events = 0;
> +
> +                       mutex_lock(&ctx2->uring_lock);
> +                       if (!list_empty(&ctx2->poll_list))
> +                               io_iopoll_getevents(ctx2, &nr_events, 0);
> +                       mutex_unlock(&ctx2->uring_lock);
> +               }
> +       }
> +
> +       to_submit = io_sqring_entries(ctx);
> +       if (to_submit) {
> +               mutex_lock(&ctx->uring_lock);
> +               if (likely(!percpu_ref_is_dying(&ctx->refs)))
> +                       ret = io_submit_sqes(ctx, to_submit, NULL, -1);
> +               mutex_unlock(&ctx->uring_lock);
> +       }
> +       return ret;
> +}
> +
> +static int io_sq_thread_percpu(void *data)
> +{
> +       struct sq_thread_percpu *t = data;
> +       struct io_ring_ctx *ctx;
> +       const struct cred *saved_creds, *cur_creds, *old_creds;
> +       mm_segment_t old_fs;
> +       DEFINE_WAIT(wait);
> +       unsigned long timeout;
> +       int iters = 0;
> +
> +       complete(&t->sq_thread_comp);
> +
> +       old_fs = get_fs();
> +       set_fs(USER_DS);
> +       timeout = jiffies + t->sq_thread_idle;
> +       saved_creds = cur_creds = NULL;
> +       while (!kthread_should_park()) {
> +               bool needs_run, needs_wait;
> +               unsigned int to_submit;
> +
> +               mutex_lock(&t->lock);
> +again:
> +               needs_run = false;
> +               list_for_each_entry(ctx, &t->ctx_list, node) {
> +                       if (cur_creds != ctx->creds) {
> +                               old_creds = override_creds(ctx->creds);
> +                               cur_creds = ctx->creds;
> +                               if (saved_creds)
> +                                       put_cred(old_creds);
> +                               else
> +                                       saved_creds = old_creds;
> +                       }
> +
> +                       ctx->submit_status = process_ctx(t, ctx);
> +
> +                       if (!needs_run)
> +                               to_submit = io_sqring_entries(ctx);
> +                       if (!needs_run &&
> +                           ((to_submit && ctx->submit_status != -EBUSY) ||
> +                           !list_empty(&ctx->poll_list)))
> +                               needs_run = true;
> +               }
> +               if (needs_run && (++iters & 7)) {
> +                       if (current->task_works)
> +                               task_work_run();
> +                       timeout = jiffies + t->sq_thread_idle;
> +                       goto again;
> +               }
> +               mutex_unlock(&t->lock);
> +               if (needs_run || !time_after(jiffies, timeout)) {
> +                       if (current->task_works)
> +                               task_work_run();
> +                       if (need_resched()) {
> +                               io_sq_thread_drop_mm(ctx);
> +                               cond_resched();
> +                       }
> +                       if (needs_run)
> +                               timeout = jiffies + t->sq_thread_idle;
> +                       continue;
> +               }
> +
> +               needs_wait = true;
> +               prepare_to_wait(&t->sqo_wait, &wait, TASK_INTERRUPTIBLE);
> +               mutex_lock(&t->lock);
> +               list_for_each_entry(ctx, &t->ctx_list, node) {
> +                       if ((ctx->flags & IORING_SETUP_IOPOLL) &&
> +                           !list_empty_careful(&ctx->poll_list)) {
> +                               needs_wait = false;
> +                               break;
> +                       }
> +                       to_submit = io_sqring_entries(ctx);
> +                       if (to_submit && ctx->submit_status != -EBUSY) {
> +                               needs_wait = false;
> +                               break;
> +                       }
> +               }
> +               if (needs_wait) {
> +                       list_for_each_entry(ctx, &t->ctx_list, node)
> +                               io_ring_set_wakeup_flag(ctx);
> +               }
> +
> +               mutex_unlock(&t->lock);
> +
> +               if (needs_wait) {
> +                       if (current->task_works)
> +                               task_work_run();
> +                       io_sq_thread_drop_mm(ctx);
> +                       if (kthread_should_park()) {
> +                               finish_wait(&t->sqo_wait, &wait);
> +                               break;
> +                       }
> +                       schedule();
> +                       mutex_lock(&t->lock);
> +                       list_for_each_entry(ctx, &t->ctx_list, node)
> +                               io_ring_clear_wakeup_flag(ctx);
> +                       mutex_unlock(&t->lock);
> +                       finish_wait(&t->sqo_wait, &wait);
> +               } else
> +                       finish_wait(&t->sqo_wait, &wait);
> +               timeout = jiffies + t->sq_thread_idle;
> +       }
> +
> +       if (current->task_works)
> +               task_work_run();
> +
> +       set_fs(old_fs);
> +       io_sq_thread_drop_mm(ctx);
> +       if (saved_creds)
> +               revert_creds(saved_creds);
> +
> +       kthread_parkme();
> +
> +       return 0;
> +}
>  struct io_wait_queue {
>         struct wait_queue_entry wq;
>         struct io_ring_ctx *ctx;
> @@ -6291,18 +6451,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>         return 0;
>  }
> 
> +static void destroy_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu);
> +
>  static void io_sq_thread_stop(struct io_ring_ctx *ctx)
>  {
>         if (ctx->sqo_thread) {
> -               wait_for_completion(&ctx->sq_thread_comp);
> -               /*
> -                * The park is a bit of a work-around, without it we get
> -                * warning spews on shutdown with SQPOLL set and affinity
> -                * set to a single CPU.
> -                */
> -               kthread_park(ctx->sqo_thread);
> -               kthread_stop(ctx->sqo_thread);
> -               ctx->sqo_thread = NULL;
> +               if ((ctx->flags & IORING_SETUP_SQ_AFF) &&
> +                   (ctx->flags & IORING_SETUP_SQPOLL_PERCPU) && percpu_threads) {
> +                       destroy_sq_thread_percpu(ctx, ctx->sq_thread_cpu);
> +               } else {
> +                       wait_for_completion(&ctx->sq_thread_comp);
> +                       /*
> +                        * The park is a bit of a work-around, without it we get
> +                        * warning spews on shutdown with SQPOLL set and affinity
> +                        * set to a single CPU.
> +                        */
> +                       kthread_park(ctx->sqo_thread);
> +                       kthread_stop(ctx->sqo_thread);
> +                       ctx->sqo_thread = NULL;
> +               }
>         }
>  }
> 
> @@ -6917,6 +7084,54 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
>         return ret;
>  }
> 
> +static void create_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu)
> +{
> +       struct sq_thread_percpu *t;
> +
> +       t = per_cpu_ptr(percpu_threads, cpu);
> +       mutex_lock(&t->lock);
> +       if (!t->sqo_thread) {
> +               t->sqo_thread = kthread_create_on_cpu(io_sq_thread_percpu, t,
> +                                       cpu, "io_uring-sq-percpu");
> +               if (IS_ERR(t->sqo_thread)) {
> +                       ctx->sqo_thread = t->sqo_thread;
> +                       t->sqo_thread = NULL;
> +                       mutex_unlock(&t->lock);
> +                       return;
> +               }
> +       }
> +
> +       if (t->sq_thread_idle < ctx->sq_thread_idle)
> +               t->sq_thread_idle = ctx->sq_thread_idle;
> +       ctx->sqo_wait = &t->sqo_wait;
> +       ctx->sq_thread_cpu = cpu;
> +       list_add_tail(&ctx->node, &t->ctx_list);
> +       ctx->sqo_thread = t->sqo_thread;
> +       mutex_unlock(&t->lock);
> +}
> +
> +static void destroy_sq_thread_percpu(struct io_ring_ctx *ctx, int cpu)
> +{
> +       struct sq_thread_percpu *t;
> +       struct task_struct *sqo_thread = NULL;
> +
> +       t = per_cpu_ptr(percpu_threads, cpu);
> +       mutex_lock(&t->lock);
> +       list_del(&ctx->node);
> +       if (list_empty(&t->ctx_list)) {
> +               sqo_thread = t->sqo_thread;
> +               t->sqo_thread = NULL;
> +       }
> +       mutex_unlock(&t->lock);
> +
> +       if (sqo_thread) {
> +               wait_for_completion(&t->sq_thread_comp);
> +               kthread_park(sqo_thread);
> +               kthread_stop(sqo_thread);
> +       }
> +}
> +
> +
>  static int io_sq_offload_start(struct io_ring_ctx *ctx,
>                                struct io_uring_params *p)
>  {
> @@ -6943,9 +7158,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
>                         if (!cpu_online(cpu))
>                                 goto err;
> 
> -                       ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
> -                                                       ctx, cpu,
> -                                                       "io_uring-sq");
> +                       if ((p->flags & IORING_SETUP_SQPOLL_PERCPU) && percpu_threads)
> +                               create_sq_thread_percpu(ctx, cpu);
> +                       else
> +                               ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, ctx, cpu,
> +                                                                       "io_uring-sq");
>                 } else {
>                         ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
>                                                         "io_uring-sq");
> @@ -7632,7 +7849,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>                 if (!list_empty_careful(&ctx->cq_overflow_list))
>                         io_cqring_overflow_flush(ctx, false);
>                 if (flags & IORING_ENTER_SQ_WAKEUP)
> -                       wake_up(&ctx->sqo_wait);
> +                       wake_up(ctx->sqo_wait);
>                 submitted = to_submit;
>         } else if (to_submit) {
>                 mutex_lock(&ctx->uring_lock);
> @@ -7990,7 +8207,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
> 
>         if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
>                         IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
> -                       IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ))
> +                       IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
> +                       IORING_SETUP_SQPOLL_PERCPU))
>                 return -EINVAL;
> 
>         return  io_uring_create(entries, &p, params);
> @@ -8218,6 +8436,8 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
> 
>  static int __init io_uring_init(void)
>  {
> +       int cpu;
> +
>  #define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \
>         BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \
>         BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \
> @@ -8257,6 +8477,28 @@ static int __init io_uring_init(void)
>         BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
>         BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
>         req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
> +
> +
> +       percpu_threads = alloc_percpu(struct sq_thread_percpu);
> +       /*
> +        * Don't take this as fatal error, if this happens, we will just
> +        * make io sq thread not go through io_sq_thread percpu version.
> +        */
> +       if (!percpu_threads)
> +               return 0;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct sq_thread_percpu *t;
> +
> +               t = per_cpu_ptr(percpu_threads, cpu);
> +               INIT_LIST_HEAD(&t->ctx_list);
> +               init_waitqueue_head(&t->sqo_wait);
> +               init_completion(&t->sq_thread_comp);
> +               mutex_init(&t->lock);
> +               t->sqo_thread = NULL;
> +               t->sq_thread_idle = 0;
> +       }
> +
>         return 0;
>  };
>  __initcall(io_uring_init);
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index d25c42ae6052..5929b7b95368 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -94,6 +94,7 @@ enum {
>  #define IORING_SETUP_CQSIZE    (1U << 3)       /* app defines CQ size */
>  #define IORING_SETUP_CLAMP     (1U << 4)       /* clamp SQ/CQ ring sizes */
>  #define IORING_SETUP_ATTACH_WQ (1U << 5)       /* attach to existing wq */
> +#define IORING_SETUP_SQPOLL_PERCPU     (1U << 31)      /* use percpu SQ poll thread */
> 
>  enum {
>         IORING_OP_NOP,
> 
> Regards,
> Xiaoguang Wang
> 
> 
>> We support using IORING_SETUP_ATTACH_WQ to share async backends between
>> rings created by the same process, this now also allows the same to
>> happen with SQPOLL. The setup procedure remains the same, the caller
>> sets io_uring_params->wq_fd to the 'parent' context, and then the newly
>> created ring will attach to that async backend.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>   fs/io_uring.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 43 insertions(+)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 5bafc7a2c65c..07e16049e62d 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -232,6 +232,10 @@ struct io_restriction {
>>   struct io_sq_data {
>>       refcount_t        refs;
>>   +    /* global sqd lookup */
>> +    struct list_head    all_sqd_list;
>> +    int            attach_fd;
>> +
>>       /* ctx's that are using this sqd */
>>       struct list_head    ctx_list;
>>       struct list_head    ctx_new_list;
>> @@ -241,6 +245,9 @@ struct io_sq_data {
>>       struct wait_queue_head    wait;
>>   };
>>   +static LIST_HEAD(sqd_list);
>> +static DEFINE_MUTEX(sqd_lock);
>> +
>>   struct io_ring_ctx {
>>       struct {
>>           struct percpu_ref    refs;
>> @@ -6975,14 +6982,38 @@ static void io_put_sq_data(struct io_sq_data *sqd)
>>               kthread_stop(sqd->thread);
>>           }
>>   +        mutex_lock(&sqd_lock);
>> +        list_del(&sqd->all_sqd_list);
>> +        mutex_unlock(&sqd_lock);
>> +
>>           kfree(sqd);
>>       }
>>   }
>>   +static struct io_sq_data *io_attach_sq_data(struct io_uring_params *p)
>> +{
>> +    struct io_sq_data *sqd, *ret = ERR_PTR(-ENXIO);
>> +
>> +    mutex_lock(&sqd_lock);
>> +    list_for_each_entry(sqd, &sqd_list, all_sqd_list) {
>> +        if (sqd->attach_fd == p->wq_fd) {
>> +            refcount_inc(&sqd->refs);
>> +            ret = sqd;
>> +            break;
>> +        }
>> +    }
>> +    mutex_unlock(&sqd_lock);
>> +
>> +    return ret;
>> +}
>> +
>>   static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
>>   {
>>       struct io_sq_data *sqd;
>>   +    if (p->flags & IORING_SETUP_ATTACH_WQ)
>> +        return io_attach_sq_data(p);
>> +
>>       sqd = kzalloc(sizeof(*sqd), GFP_KERNEL);
>>       if (!sqd)
>>           return ERR_PTR(-ENOMEM);
>> @@ -6992,6 +7023,10 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
>>       INIT_LIST_HEAD(&sqd->ctx_new_list);
>>       mutex_init(&sqd->ctx_lock);
>>       init_waitqueue_head(&sqd->wait);
>> +
>> +    mutex_lock(&sqd_lock);
>> +    list_add_tail(&sqd->all_sqd_list, &sqd_list);
>> +    mutex_unlock(&sqd_lock);
>>       return sqd;
>>   }
>>   @@ -7675,6 +7710,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>>           if (!ctx->sq_thread_idle)
>>               ctx->sq_thread_idle = HZ;
>>   +        if (sqd->thread)
>> +            goto done;
>> +
>>           if (p->flags & IORING_SETUP_SQ_AFF) {
>>               int cpu = p->sq_thread_cpu;
>>   @@ -7701,6 +7739,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>>           goto err;
>>       }
>>   +done:
>>       ret = io_init_wq_offload(ctx, p);
>>       if (ret)
>>           goto err;
>> @@ -8831,6 +8870,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>>       if (ret < 0)
>>           goto err;
>>   +    if ((ctx->flags & (IORING_SETUP_SQPOLL | IORING_SETUP_ATTACH_WQ)) ==
>> +        IORING_SETUP_SQPOLL)
>> +        ctx->sq_data->attach_fd = ret;
>> +
>>       trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
>>       return ret;
>>   err:
>>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07 14:00     ` Pavel Begunkov
@ 2020-09-07 16:11       ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-07 16:11 UTC (permalink / raw)
  To: Pavel Begunkov, Xiaoguang Wang, io-uring

On 9/7/20 8:00 AM, Pavel Begunkov wrote:
> On 07/09/2020 11:56, Xiaoguang Wang wrote:
>> hi,
>>
>> First thanks for this patchset:) and I have some questions below:
>>
>> 1. Does a uring instance which has SQPOLL enabled but does not specify
>> a valid cpu is meaningful in real business product?
>> IMHO, in a real business product, cpu usage should be under strict management,
>> say a uring instance which has SQPOLL enabled but is not bound to fixed cpu,
>> then this uring instance's corresponding io_sq_thread could be scheduled to any
>> cpu(and can be re-scheduled many times), which may impact any other application
>> greatly because of cpu contention, especially this io_sq_thread is just doing
>> busy loop in its sq_thread_idle period time, so in a real business product, I
>> wonder whether SQPOLL is only meaningful if user specifies a fixed cpu core.
> 
> It is meaningful for a part of cases, for the other part you can set
> processes' affinities and pin each SQPOLL thread to a specific CPU, as
> you'd do even without this series. And that gives you more flexibilty,
> IMHO.

Right, and once we have "any ctx can use this poll thread", then percpu
poll threads could be a subset of that. So no reason that kind of
functionality couldn't be layered on top of that. 

You'd probably just need an extra SETUP flag for this. Xiaoguang, any
interest in attempting to layer that on top of that current code?

>> 2. Does IORING_SETUP_ATTACH_WQ is really convenient?
>> IORING_SETUP_ATTACH_WQ always needs to ensure a parent uring instance exists,
>> that means it also requires app to regulate the creation oder of uring instanes,
>> which maybe not convenient, just imagine how to integrate this new SQPOLL
>> improvements to fio util.
> 
> It may be not so convenient, but it's flexible enough and prevents
> isolation breaking issues. We've discussed that in the thread with
> your patches.

I think there's one minor issue there, and that is matching on just the
fd. That could cause different process rings to attach to the wrong one.
I think we need this on top to match the file as well. This is a bit
different on the SQPOLL side vs the io-wq workqueue.

Something like the below incremental should do it, makes it clear that
we need to be able to get at this file, and doesn't risk false attaching
globally. As a plus, it gets rid of that global list too, and makes it
consistent with io-wq for ATTACH_WQ.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0490edfcdd88..4bd18e01ae89 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -232,10 +232,6 @@ struct io_restriction {
 struct io_sq_data {
 	refcount_t		refs;
 
-	/* global sqd lookup */
-	struct list_head	all_sqd_list;
-	int			attach_fd;
-
 	/* ctx's that are using this sqd */
 	struct list_head	ctx_list;
 	struct list_head	ctx_new_list;
@@ -245,9 +241,6 @@ struct io_sq_data {
 	struct wait_queue_head	wait;
 };
 
-static LIST_HEAD(sqd_list);
-static DEFINE_MUTEX(sqd_lock);
-
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -7034,32 +7027,37 @@ static void io_put_sq_data(struct io_sq_data *sqd)
 			kthread_stop(sqd->thread);
 		}
 
-		mutex_lock(&sqd_lock);
-		list_del(&sqd->all_sqd_list);
-		mutex_unlock(&sqd_lock);
-
 		kfree(sqd);
 	}
 }
 
 static struct io_sq_data *io_attach_sq_data(struct io_uring_params *p)
 {
-	struct io_sq_data *sqd, *ret = ERR_PTR(-ENXIO);
+	struct io_ring_ctx *ctx_attach;
+	struct io_sq_data *sqd;
+	struct fd f;
 
-	mutex_lock(&sqd_lock);
-	list_for_each_entry(sqd, &sqd_list, all_sqd_list) {
-		if (sqd->attach_fd == p->wq_fd) {
-			refcount_inc(&sqd->refs);
-			ret = sqd;
-			break;
-		}
+	f = fdget(p->wq_fd);
+	if (!f.file)
+		return ERR_PTR(-ENXIO);
+	if (f.file->f_op != &io_uring_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
 	}
-	mutex_unlock(&sqd_lock);
 
-	return ret;
+	ctx_attach = f.file->private_data;
+	sqd = ctx_attach->sq_data;
+	if (!sqd) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	refcount_inc(&sqd->refs);
+	fdput(f);
+	return sqd;
 }
 
-static struct io_sq_data *io_get_sq_data(struct io_uring_params *p, int ring_fd)
+static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
 {
 	struct io_sq_data *sqd;
 
@@ -7071,15 +7069,11 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p, int ring_fd)
 		return ERR_PTR(-ENOMEM);
 
 	refcount_set(&sqd->refs, 1);
-	sqd->attach_fd = ring_fd;
 	INIT_LIST_HEAD(&sqd->ctx_list);
 	INIT_LIST_HEAD(&sqd->ctx_new_list);
 	mutex_init(&sqd->ctx_lock);
 	init_waitqueue_head(&sqd->wait);
 
-	mutex_lock(&sqd_lock);
-	list_add_tail(&sqd->all_sqd_list, &sqd_list);
-	mutex_unlock(&sqd_lock);
 	return sqd;
 }
 
@@ -7746,7 +7740,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
 }
 
 static int io_sq_offload_create(struct io_ring_ctx *ctx,
-				struct io_uring_params *p, int ring_fd)
+				struct io_uring_params *p)
 {
 	int ret;
 
@@ -7757,7 +7751,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
 			goto err;
 
-		sqd = io_get_sq_data(p, ring_fd);
+		sqd = io_get_sq_data(p);
 		if (IS_ERR(sqd)) {
 			ret = PTR_ERR(sqd);
 			goto err;
@@ -8972,7 +8966,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 		goto err;
 	}
 
-	ret = io_sq_offload_create(ctx, p, fd);
+	ret = io_sq_offload_create(ctx, p);
 	if (ret)
 		goto err;
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07  8:56   ` Xiaoguang Wang
  2020-09-07 14:00     ` Pavel Begunkov
@ 2020-09-07 16:14     ` Jens Axboe
  2020-09-07 16:18       ` Jens Axboe
  2020-09-08  2:53       ` Xiaoguang Wang
  1 sibling, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2020-09-07 16:14 UTC (permalink / raw)
  To: Xiaoguang Wang, io-uring

On 9/7/20 2:56 AM, Xiaoguang Wang wrote:
> 3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag? In
> your current implementation, if a ctx is marked as SQT_IDLE, this ctx
> will be set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work
> to do, then io_sq_thread is still running and does not need to be
> waken up, then a later wakeup form userspace is unnecessary. I think
> it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all ctxes have
> no work to do, you can have a look at my attached codes:)

That's a good point, any chance I can get you to submit a patch to fix
that up?

> 4. Is io_attach_sq_data really safe? sqd_list is a global list, but
> its search key is a fd local to process, different processes may have
> same fd, then this codes looks odd, seems that your design is to make
> io_sq_thread shared inside process.

It's really meant for thread sharing, or you could pass the fd and use
it across a process too. See the incremental I just sent in a reply to
Pavel.

That said, I do think the per-cpu approach has merrit, and I also think
it should be possible to layer it on top of the existing code in
for-5.10/io_uring. So I'd strongly encourage you to try and do that, so
you can get rid of using that private patch and just have it upstream
instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07 16:14     ` Jens Axboe
@ 2020-09-07 16:18       ` Jens Axboe
  2020-09-08  2:28         ` Xiaoguang Wang
  2020-09-08  2:53       ` Xiaoguang Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-09-07 16:18 UTC (permalink / raw)
  To: Xiaoguang Wang, io-uring

On 9/7/20 10:14 AM, Jens Axboe wrote:
> On 9/7/20 2:56 AM, Xiaoguang Wang wrote:
>> 3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag? In
>> your current implementation, if a ctx is marked as SQT_IDLE, this ctx
>> will be set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work
>> to do, then io_sq_thread is still running and does not need to be
>> waken up, then a later wakeup form userspace is unnecessary. I think
>> it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all ctxes have
>> no work to do, you can have a look at my attached codes:)
> 
> That's a good point, any chance I can get you to submit a patch to fix
> that up?

Something like this?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4bd18e01ae89..80913973337a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6747,8 +6747,6 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 			goto again;
 		}
 
-		io_ring_set_wakeup_flag(ctx);
-
 		to_submit = io_sqring_entries(ctx);
 		if (!to_submit || ret == -EBUSY)
 			return SQT_IDLE;
@@ -6825,6 +6823,8 @@ static int io_sq_thread(void *data)
 			io_run_task_work();
 			cond_resched();
 		} else if (ret == SQT_IDLE) {
+			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
+				io_ring_set_wakeup_flag(ctx);
 			schedule();
 			start_jiffies = jiffies;
 		}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07 16:18       ` Jens Axboe
@ 2020-09-08  2:28         ` Xiaoguang Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Xiaoguang Wang @ 2020-09-08  2:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring

hi,

> On 9/7/20 10:14 AM, Jens Axboe wrote:
>> On 9/7/20 2:56 AM, Xiaoguang Wang wrote:
>>> 3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag? In
>>> your current implementation, if a ctx is marked as SQT_IDLE, this ctx
>>> will be set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work
>>> to do, then io_sq_thread is still running and does not need to be
>>> waken up, then a later wakeup form userspace is unnecessary. I think
>>> it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all ctxes have
>>> no work to do, you can have a look at my attached codes:)
>>
>> That's a good point, any chance I can get you to submit a patch to fix
>> that up?
> 
> Something like this?
Yes, thanks, I'll prepare a patch soon.

Regards,
Xiaoguang Wang

> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4bd18e01ae89..80913973337a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6747,8 +6747,6 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
>   			goto again;
>   		}
>   
> -		io_ring_set_wakeup_flag(ctx);
> -
>   		to_submit = io_sqring_entries(ctx);
>   		if (!to_submit || ret == -EBUSY)
>   			return SQT_IDLE;
> @@ -6825,6 +6823,8 @@ static int io_sq_thread(void *data)
>   			io_run_task_work();
>   			cond_resched();
>   		} else if (ret == SQT_IDLE) {
> +			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
> +				io_ring_set_wakeup_flag(ctx);
>   			schedule();
>   			start_jiffies = jiffies;
>   		}
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
  2020-09-07 16:14     ` Jens Axboe
  2020-09-07 16:18       ` Jens Axboe
@ 2020-09-08  2:53       ` Xiaoguang Wang
  1 sibling, 0 replies; 16+ messages in thread
From: Xiaoguang Wang @ 2020-09-08  2:53 UTC (permalink / raw)
  To: Jens Axboe, io-uring

hi,

> On 9/7/20 2:56 AM, Xiaoguang Wang wrote:
>> 3. When it's appropriate to set ctx's IORING_SQ_NEED_WAKEUP flag? In
>> your current implementation, if a ctx is marked as SQT_IDLE, this ctx
>> will be set IORING_SQ_NEED_WAKEUP flag, but if other ctxes have work
>> to do, then io_sq_thread is still running and does not need to be
>> waken up, then a later wakeup form userspace is unnecessary. I think
>> it maybe appropriate to set IORING_SQ_NEED_WAKEUP when all ctxes have
>> no work to do, you can have a look at my attached codes:)
> 
> That's a good point, any chance I can get you to submit a patch to fix
> that up?
> 
>> 4. Is io_attach_sq_data really safe? sqd_list is a global list, but
>> its search key is a fd local to process, different processes may have
>> same fd, then this codes looks odd, seems that your design is to make
>> io_sq_thread shared inside process.
> 
> It's really meant for thread sharing, or you could pass the fd and use
> it across a process too. See the incremental I just sent in a reply to
> Pavel.
> 
> That said, I do think the per-cpu approach has merrit, and I also think
> it should be possible to layer it on top of the existing code in
> for-5.10/io_uring. So I'd strongly encourage you to try and do that, so
> you can get rid of using that private patch and just have it upstream
> instead.
Really thanks for your encouragement, I'll do my best to try to have it in upstream.

Regards,
Xiaoguang Wang

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-08  2:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03  2:20 [PATCHSET for-next 0/8] io_uring SQPOLL improvements Jens Axboe
2020-09-03  2:20 ` [PATCH 1/8] io_uring: io_sq_thread() doesn't need to flush signals Jens Axboe
2020-09-03  2:20 ` [PATCH 2/8] io_uring: allow SQPOLL with CAP_SYS_NICE privileges Jens Axboe
2020-09-03  2:20 ` [PATCH 3/8] io_uring: use private ctx wait queue entries for SQPOLL Jens Axboe
2020-09-03  2:20 ` [PATCH 4/8] io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler Jens Axboe
2020-09-03  2:20 ` [PATCH 5/8] io_uring: split work handling part of SQPOLL into helper Jens Axboe
2020-09-03  2:20 ` [PATCH 6/8] io_uring: split SQPOLL data into separate structure Jens Axboe
2020-09-03  2:20 ` [PATCH 7/8] io_uring: base SQPOLL handling off io_sq_data Jens Axboe
2020-09-03  2:20 ` [PATCH 8/8] io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too Jens Axboe
2020-09-07  8:56   ` Xiaoguang Wang
2020-09-07 14:00     ` Pavel Begunkov
2020-09-07 16:11       ` Jens Axboe
2020-09-07 16:14     ` Jens Axboe
2020-09-07 16:18       ` Jens Axboe
2020-09-08  2:28         ` Xiaoguang Wang
2020-09-08  2:53       ` Xiaoguang Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).