All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/18] Remove kthread usage from io_uring
@ 2021-02-19 17:09 Jens Axboe
  2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
                   ` (19 more replies)
  0 siblings, 20 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds

Hi,

tldr - instead of using kthreads that assume the identity of the original
tasks for work that needs offloading to a thread, setup these workers as
threads of the original task.

Here's a first cut of moving away from kthreads for io_uring. It passes
the test suite and various other testing I've done with it. It also
performs better, both for workloads actually using the async offload, but
also in general as we slim down structures and kill code from the hot path.

The series is roughly split into these parts:

- Patches 1-6, io_uring/io-wq prep patches
- Patches 7-8, Minor arch/kernel support
- Patches 9-15, switch from kthread to thread, remove state only needed
  for kthreads
- Patches 16-18, remove now dead/unneeded PF_IO_WORKER restrictions

Comments/suggestions welcome. I'm pretty happy with the series at this
point, and particularly with how we end up cutting a lot of code while
also unifying how sync vs async is presented.

If you prefer browsing this on cgit, find it here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2

 arch/alpha/kernel/process.c      |   2 +-
 arch/arc/kernel/process.c        |   2 +-
 arch/arm/kernel/process.c        |   2 +-
 arch/arm64/kernel/process.c      |   2 +-
 arch/c6x/kernel/process.c        |   2 +-
 arch/csky/kernel/process.c       |   2 +-
 arch/h8300/kernel/process.c      |   2 +-
 arch/hexagon/kernel/process.c    |   2 +-
 arch/ia64/kernel/process.c       |   2 +-
 arch/m68k/kernel/process.c       |   2 +-
 arch/microblaze/kernel/process.c |   2 +-
 arch/mips/kernel/process.c       |   2 +-
 arch/nds32/kernel/process.c      |   2 +-
 arch/nios2/kernel/process.c      |   2 +-
 arch/openrisc/kernel/process.c   |   2 +-
 arch/riscv/kernel/process.c      |   2 +-
 arch/s390/kernel/process.c       |   2 +-
 arch/sh/kernel/process_32.c      |   2 +-
 arch/sparc/kernel/process_32.c   |   2 +-
 arch/sparc/kernel/process_64.c   |   2 +-
 arch/um/kernel/process.c         |   2 +-
 arch/x86/kernel/process.c        |   2 +-
 arch/xtensa/kernel/process.c     |   2 +-
 fs/io-wq.c                       | 393 +++++--------
 fs/io-wq.h                       |  14 +-
 fs/io_uring.c                    | 917 ++++++++++---------------------
 fs/proc/self.c                   |   7 -
 fs/proc/thread_self.c            |   7 -
 include/linux/io_uring.h         |  20 +-
 include/linux/sched.h            |   3 +
 kernel/ptrace.c                  |   2 +-
 kernel/signal.c                  |   4 +-
 net/socket.c                     |  10 -
 33 files changed, 451 insertions(+), 972 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 20:25   ` Eric W. Biederman
  2021-02-22 13:46   ` Pavel Begunkov
  2021-02-19 17:09 ` [PATCH 02/18] io-wq: don't create any IO workers upfront Jens Axboe
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

We hit this case when the task is exiting, and we need somewhere to
do background cleanup of requests. Instead of relying on the io-wq
task manager to do this work for us, just stuff it somewhere where
we can safely run it ourselves directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    | 12 ------------
 fs/io-wq.h    |  2 --
 fs/io_uring.c | 38 +++++++++++++++++++++++++++++++++++---
 3 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index c36bbcd823ce..800b299f9772 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,7 +16,6 @@
 #include <linux/kthread.h>
 #include <linux/rculist_nulls.h>
 #include <linux/fs_struct.h>
-#include <linux/task_work.h>
 #include <linux/blk-cgroup.h>
 #include <linux/audit.h>
 #include <linux/cpu.h>
@@ -775,9 +774,6 @@ static int io_wq_manager(void *data)
 	complete(&wq->done);
 
 	while (!kthread_should_stop()) {
-		if (current->task_works)
-			task_work_run();
-
 		for_each_node(node) {
 			struct io_wqe *wqe = wq->wqes[node];
 			bool fork_worker[2] = { false, false };
@@ -800,9 +796,6 @@ static int io_wq_manager(void *data)
 		schedule_timeout(HZ);
 	}
 
-	if (current->task_works)
-		task_work_run();
-
 out:
 	if (refcount_dec_and_test(&wq->refs)) {
 		complete(&wq->done);
@@ -1160,11 +1153,6 @@ void io_wq_destroy(struct io_wq *wq)
 		__io_wq_destroy(wq);
 }
 
-struct task_struct *io_wq_get_task(struct io_wq *wq)
-{
-	return wq->manager;
-}
-
 static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
 {
 	struct task_struct *task = worker->task;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 096f1021018e..a1610702f222 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -124,8 +124,6 @@ typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
 enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 					void *data, bool cancel_all);
 
-struct task_struct *io_wq_get_task(struct io_wq *wq);
-
 #if defined(CONFIG_IO_WQ)
 extern void io_wq_worker_sleeping(struct task_struct *);
 extern void io_wq_worker_running(struct task_struct *);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d951acb95117..bbd1ec7aa9e9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -455,6 +455,9 @@ struct io_ring_ctx {
 
 	struct io_restriction		restrictions;
 
+	/* exit task_work */
+	struct callback_head		*exit_task_work;
+
 	/* Keep this last, we don't need it for the fast path */
 	struct work_struct		exit_work;
 };
@@ -2313,11 +2316,14 @@ static int io_req_task_work_add(struct io_kiocb *req)
 static void io_req_task_work_add_fallback(struct io_kiocb *req,
 					  task_work_func_t cb)
 {
-	struct task_struct *tsk = io_wq_get_task(req->ctx->io_wq);
+	struct io_ring_ctx *ctx = req->ctx;
+	struct callback_head *head;
 
 	init_task_work(&req->task_work, cb);
-	task_work_add(tsk, &req->task_work, TWA_NONE);
-	wake_up_process(tsk);
+	do {
+		head = ctx->exit_task_work;
+		req->task_work.next = head;
+	} while (cmpxchg(&ctx->exit_task_work, head, &req->task_work) != head);
 }
 
 static void __io_req_task_cancel(struct io_kiocb *req, int error)
@@ -9258,6 +9264,30 @@ void __io_uring_task_cancel(void)
 	io_uring_remove_task_files(tctx);
 }
 
+static void io_run_ctx_fallback(struct io_ring_ctx *ctx)
+{
+	struct callback_head *work, *head, *next;
+
+	do {
+		do {
+			head = NULL;
+			work = READ_ONCE(ctx->exit_task_work);
+			if (!work)
+				break;
+		} while (cmpxchg(&ctx->exit_task_work, work, head) != work);
+
+		if (!work)
+			break;
+
+		do {
+			next = work->next;
+			work->func(work);
+			work = next;
+			cond_resched();
+		} while (work);
+	} while (1);
+}
+
 static int io_uring_flush(struct file *file, void *data)
 {
 	struct io_uring_task *tctx = current->io_uring;
@@ -9268,6 +9298,8 @@ static int io_uring_flush(struct file *file, void *data)
 		io_req_caches_free(ctx, current);
 	}
 
+	io_run_ctx_fallback(ctx);
+
 	if (!tctx)
 		return 0;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/18] io-wq: don't create any IO workers upfront
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
  2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 17:09 ` [PATCH 03/18] io_uring: disable io-wq attaching Jens Axboe
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

When the manager thread starts up, it creates a worker per node for
the given context. Just let these get created dynamically, like we do
for adding further workers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 800b299f9772..e9e218274c76 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -759,18 +759,7 @@ static int io_wq_manager(void *data)
 	struct io_wq *wq = data;
 	int node;
 
-	/* create fixed workers */
 	refcount_set(&wq->refs, 1);
-	for_each_node(node) {
-		if (!node_online(node))
-			continue;
-		if (create_io_worker(wq, wq->wqes[node], IO_WQ_ACCT_BOUND))
-			continue;
-		set_bit(IO_WQ_BIT_ERROR, &wq->state);
-		set_bit(IO_WQ_BIT_EXIT, &wq->state);
-		goto out;
-	}
-
 	complete(&wq->done);
 
 	while (!kthread_should_stop()) {
@@ -796,7 +785,6 @@ static int io_wq_manager(void *data)
 		schedule_timeout(HZ);
 	}
 
-out:
 	if (refcount_dec_and_test(&wq->refs)) {
 		complete(&wq->done);
 		return 0;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/18] io_uring: disable io-wq attaching
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
  2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
  2021-02-19 17:09 ` [PATCH 02/18] io-wq: don't create any IO workers upfront Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 17:09 ` [PATCH 04/18] io-wq: get rid of wq->use_refs Jens Axboe
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Moving towards making the io_wq per ring per task, so we can't really
share it between rings. Which is fine, since we've now dropped some
of that fat from it.

Retain compatibility with how attaching works, so that any attempt to
attach to an fd that doesn't exist, or isn't an io_uring fd, will fail
like it did before.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 55 +++++++++++++++++++++------------------------------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bbd1ec7aa9e9..0eeb2a1596c2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8130,12 +8130,9 @@ static struct io_wq_work *io_free_work(struct io_wq_work *work)
 	return req ? &req->work : NULL;
 }
 
-static int io_init_wq_offload(struct io_ring_ctx *ctx,
-			      struct io_uring_params *p)
+static int io_init_wq_offload(struct io_ring_ctx *ctx)
 {
 	struct io_wq_data data;
-	struct fd f;
-	struct io_ring_ctx *ctx_attach;
 	unsigned int concurrency;
 	int ret = 0;
 
@@ -8143,37 +8140,15 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
 	data.free_work = io_free_work;
 	data.do_work = io_wq_submit_work;
 
-	if (!(p->flags & IORING_SETUP_ATTACH_WQ)) {
-		/* Do QD, or 4 * CPUS, whatever is smallest */
-		concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
-
-		ctx->io_wq = io_wq_create(concurrency, &data);
-		if (IS_ERR(ctx->io_wq)) {
-			ret = PTR_ERR(ctx->io_wq);
-			ctx->io_wq = NULL;
-		}
-		return ret;
-	}
-
-	f = fdget(p->wq_fd);
-	if (!f.file)
-		return -EBADF;
-
-	if (f.file->f_op != &io_uring_fops) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
+	/* Do QD, or 4 * CPUS, whatever is smallest */
+	concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
 
-	ctx_attach = f.file->private_data;
-	/* @io_wq is protected by holding the fd */
-	if (!io_wq_get(ctx_attach->io_wq, &data)) {
-		ret = -EINVAL;
-		goto out_fput;
+	ctx->io_wq = io_wq_create(concurrency, &data);
+	if (IS_ERR(ctx->io_wq)) {
+		ret = PTR_ERR(ctx->io_wq);
+		ctx->io_wq = NULL;
 	}
 
-	ctx->io_wq = ctx_attach->io_wq;
-out_fput:
-	fdput(f);
 	return ret;
 }
 
@@ -8225,6 +8200,20 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 {
 	int ret;
 
+	/* Retain compatibility with failing for an invalid attach attempt */
+	if ((ctx->flags & (IORING_SETUP_ATTACH_WQ | IORING_SETUP_SQPOLL)) ==
+				IORING_SETUP_ATTACH_WQ) {
+		struct fd f;
+
+		f = fdget(p->wq_fd);
+		if (!f.file)
+			return -ENXIO;
+		if (f.file->f_op != &io_uring_fops) {
+			fdput(f);
+			return -EINVAL;
+		}
+		fdput(f);
+	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		struct io_sq_data *sqd;
 
@@ -8282,7 +8271,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 	}
 
 done:
-	ret = io_init_wq_offload(ctx, p);
+	ret = io_init_wq_offload(ctx);
 	if (ret)
 		goto err;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/18] io-wq: get rid of wq->use_refs
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (2 preceding siblings ...)
  2021-02-19 17:09 ` [PATCH 03/18] io_uring: disable io-wq attaching Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 17:09 ` [PATCH 05/18] io_uring: tie async worker side to the task context Jens Axboe
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

We don't support attach anymore, so doesn't make sense to carry the
use_refs reference count. Get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 19 +------------------
 fs/io-wq.h |  1 -
 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index e9e218274c76..0c47febfed9b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -122,8 +122,6 @@ struct io_wq {
 	struct completion done;
 
 	struct hlist_node cpuhp_node;
-
-	refcount_t use_refs;
 };
 
 static enum cpuhp_state io_wq_online;
@@ -1086,7 +1084,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 			ret = -ENOMEM;
 			goto err;
 		}
-		refcount_set(&wq->use_refs, 1);
 		reinit_completion(&wq->done);
 		return wq;
 	}
@@ -1104,15 +1101,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	return ERR_PTR(ret);
 }
 
-bool io_wq_get(struct io_wq *wq, struct io_wq_data *data)
-{
-	if (data->free_work != wq->free_work || data->do_work != wq->do_work)
-		return false;
-
-	return refcount_inc_not_zero(&wq->use_refs);
-}
-
-static void __io_wq_destroy(struct io_wq *wq)
+void io_wq_destroy(struct io_wq *wq)
 {
 	int node;
 
@@ -1135,12 +1124,6 @@ static void __io_wq_destroy(struct io_wq *wq)
 	kfree(wq);
 }
 
-void io_wq_destroy(struct io_wq *wq)
-{
-	if (refcount_dec_and_test(&wq->use_refs))
-		__io_wq_destroy(wq);
-}
-
 static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
 {
 	struct task_struct *task = worker->task;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index a1610702f222..d2cf284b4641 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -108,7 +108,6 @@ struct io_wq_data {
 };
 
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data);
-bool io_wq_get(struct io_wq *wq, struct io_wq_data *data);
 void io_wq_destroy(struct io_wq *wq);
 
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/18] io_uring: tie async worker side to the task context
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (3 preceding siblings ...)
  2021-02-19 17:09 ` [PATCH 04/18] io-wq: get rid of wq->use_refs Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-20  8:11   ` Hao Xu
  2021-02-19 17:09 ` [PATCH 06/18] io-wq: don't pass 'wqe' needlessly around Jens Axboe
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Move it outside of the io_ring_ctx, and tie it to the io_uring task
context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c            | 84 ++++++++++++++++------------------------
 include/linux/io_uring.h |  1 +
 2 files changed, 35 insertions(+), 50 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0eeb2a1596c2..6ad3e1df6504 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -365,9 +365,6 @@ struct io_ring_ctx {
 
 	struct io_rings	*rings;
 
-	/* IO offload */
-	struct io_wq		*io_wq;
-
 	/*
 	 * For SQPOLL usage - we hold a reference to the parent task, so we
 	 * have access to the ->files
@@ -1619,10 +1616,11 @@ static struct io_kiocb *__io_queue_async_work(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_kiocb *link = io_prep_linked_timeout(req);
+	struct io_uring_task *tctx = req->task->io_uring;
 
 	trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req,
 					&req->work, req->flags);
-	io_wq_enqueue(ctx->io_wq, &req->work);
+	io_wq_enqueue(tctx->io_wq, &req->work);
 	return link;
 }
 
@@ -5969,12 +5967,15 @@ static bool io_cancel_cb(struct io_wq_work *work, void *data)
 	return req->user_data == (unsigned long) data;
 }
 
-static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
+static int io_async_cancel_one(struct io_uring_task *tctx, void *sqe_addr)
 {
 	enum io_wq_cancel cancel_ret;
 	int ret = 0;
 
-	cancel_ret = io_wq_cancel_cb(ctx->io_wq, io_cancel_cb, sqe_addr, false);
+	if (!tctx->io_wq)
+		return -ENOENT;
+
+	cancel_ret = io_wq_cancel_cb(tctx->io_wq, io_cancel_cb, sqe_addr, false);
 	switch (cancel_ret) {
 	case IO_WQ_CANCEL_OK:
 		ret = 0;
@@ -5997,7 +5998,8 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
 	unsigned long flags;
 	int ret;
 
-	ret = io_async_cancel_one(ctx, (void *) (unsigned long) sqe_addr);
+	ret = io_async_cancel_one(req->task->io_uring,
+					(void *) (unsigned long) sqe_addr);
 	if (ret != -ENOENT) {
 		spin_lock_irqsave(&ctx->completion_lock, flags);
 		goto done;
@@ -7562,16 +7564,6 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
 	}
 }
 
-static void io_finish_async(struct io_ring_ctx *ctx)
-{
-	io_sq_thread_stop(ctx);
-
-	if (ctx->io_wq) {
-		io_wq_destroy(ctx->io_wq);
-		ctx->io_wq = NULL;
-	}
-}
-
 #if defined(CONFIG_UNIX)
 /*
  * Ensure the UNIX gc is aware of our file set, so we are certain that
@@ -8130,11 +8122,10 @@ static struct io_wq_work *io_free_work(struct io_wq_work *work)
 	return req ? &req->work : NULL;
 }
 
-static int io_init_wq_offload(struct io_ring_ctx *ctx)
+static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx)
 {
 	struct io_wq_data data;
 	unsigned int concurrency;
-	int ret = 0;
 
 	data.user = ctx->user;
 	data.free_work = io_free_work;
@@ -8143,16 +8134,11 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx)
 	/* Do QD, or 4 * CPUS, whatever is smallest */
 	concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
 
-	ctx->io_wq = io_wq_create(concurrency, &data);
-	if (IS_ERR(ctx->io_wq)) {
-		ret = PTR_ERR(ctx->io_wq);
-		ctx->io_wq = NULL;
-	}
-
-	return ret;
+	return io_wq_create(concurrency, &data);
 }
 
-static int io_uring_alloc_task_context(struct task_struct *task)
+static int io_uring_alloc_task_context(struct task_struct *task,
+				       struct io_ring_ctx *ctx)
 {
 	struct io_uring_task *tctx;
 	int ret;
@@ -8167,6 +8153,14 @@ static int io_uring_alloc_task_context(struct task_struct *task)
 		return ret;
 	}
 
+	tctx->io_wq = io_init_wq_offload(ctx);
+	if (IS_ERR(tctx->io_wq)) {
+		ret = PTR_ERR(tctx->io_wq);
+		percpu_counter_destroy(&tctx->inflight);
+		kfree(tctx);
+		return ret;
+	}
+
 	xa_init(&tctx->xa);
 	init_waitqueue_head(&tctx->wait);
 	tctx->last = NULL;
@@ -8239,7 +8233,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			ctx->sq_thread_idle = HZ;
 
 		if (sqd->thread)
-			goto done;
+			return 0;
 
 		if (p->flags & IORING_SETUP_SQ_AFF) {
 			int cpu = p->sq_thread_cpu;
@@ -8261,7 +8255,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			sqd->thread = NULL;
 			goto err;
 		}
-		ret = io_uring_alloc_task_context(sqd->thread);
+		ret = io_uring_alloc_task_context(sqd->thread, ctx);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
@@ -8270,14 +8264,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		goto err;
 	}
 
-done:
-	ret = io_init_wq_offload(ctx);
-	if (ret)
-		goto err;
-
 	return 0;
 err:
-	io_finish_async(ctx);
+	io_sq_thread_stop(ctx);
 	return ret;
 }
 
@@ -8752,7 +8741,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	mutex_lock(&ctx->uring_lock);
 	mutex_unlock(&ctx->uring_lock);
 
-	io_finish_async(ctx);
+	io_sq_thread_stop(ctx);
 	io_sqe_buffers_unregister(ctx);
 
 	if (ctx->sqo_task) {
@@ -8872,13 +8861,6 @@ static void io_ring_exit_work(struct work_struct *work)
 	io_ring_ctx_free(ctx);
 }
 
-static bool io_cancel_ctx_cb(struct io_wq_work *work, void *data)
-{
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-
-	return req->ctx == data;
-}
-
 static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 {
 	mutex_lock(&ctx->uring_lock);
@@ -8897,9 +8879,6 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	io_kill_timeouts(ctx, NULL, NULL);
 	io_poll_remove_all(ctx, NULL, NULL);
 
-	if (ctx->io_wq)
-		io_wq_cancel_cb(ctx->io_wq, io_cancel_ctx_cb, ctx, true);
-
 	/* if we failed setting up the ctx, we might not have any rings */
 	io_iopoll_try_reap_events(ctx);
 
@@ -8978,13 +8957,14 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 struct files_struct *files)
 {
 	struct io_task_cancel cancel = { .task = task, .files = files, };
+	struct io_uring_task *tctx = current->io_uring;
 
 	while (1) {
 		enum io_wq_cancel cret;
 		bool ret = false;
 
-		if (ctx->io_wq) {
-			cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb,
+		if (tctx && tctx->io_wq) {
+			cret = io_wq_cancel_cb(tctx->io_wq, io_cancel_task_cb,
 					       &cancel, true);
 			ret |= (cret != IO_WQ_CANCEL_NOTFOUND);
 		}
@@ -9096,7 +9076,7 @@ static int io_uring_add_task_file(struct io_ring_ctx *ctx, struct file *file)
 	int ret;
 
 	if (unlikely(!tctx)) {
-		ret = io_uring_alloc_task_context(current);
+		ret = io_uring_alloc_task_context(current, ctx);
 		if (unlikely(ret))
 			return ret;
 		tctx = current->io_uring;
@@ -9166,8 +9146,12 @@ void __io_uring_files_cancel(struct files_struct *files)
 		io_uring_cancel_task_requests(file->private_data, files);
 	atomic_dec(&tctx->in_idle);
 
-	if (files)
+	if (files) {
 		io_uring_remove_task_files(tctx);
+	} else if (tctx->io_wq && current->flags & PF_EXITING) {
+		io_wq_destroy(tctx->io_wq);
+		tctx->io_wq = NULL;
+	}
 }
 
 static s64 tctx_inflight(struct io_uring_task *tctx)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 2eb6d19de336..0e95398998b6 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -36,6 +36,7 @@ struct io_uring_task {
 	struct xarray		xa;
 	struct wait_queue_head	wait;
 	struct file		*last;
+	void			*io_wq;
 	struct percpu_counter	inflight;
 	struct io_identity	__identity;
 	struct io_identity	*identity;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/18] io-wq: don't pass 'wqe' needlessly around
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (4 preceding siblings ...)
  2021-02-19 17:09 ` [PATCH 05/18] io_uring: tie async worker side to the task context Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 17:09 ` [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD Jens Axboe
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Just grab it from the worker itself, which we're already passing in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 0c47febfed9b..ec7f1106b659 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -201,9 +201,10 @@ static inline struct io_wqe_acct *io_work_get_acct(struct io_wqe *wqe,
 	return &wqe->acct[IO_WQ_ACCT_BOUND];
 }
 
-static inline struct io_wqe_acct *io_wqe_get_acct(struct io_wqe *wqe,
-						  struct io_worker *worker)
+static inline struct io_wqe_acct *io_wqe_get_acct(struct io_worker *worker)
 {
+	struct io_wqe *wqe = worker->wqe;
+
 	if (worker->flags & IO_WORKER_F_BOUND)
 		return &wqe->acct[IO_WQ_ACCT_BOUND];
 
@@ -213,7 +214,7 @@ static inline struct io_wqe_acct *io_wqe_get_acct(struct io_wqe *wqe,
 static void io_worker_exit(struct io_worker *worker)
 {
 	struct io_wqe *wqe = worker->wqe;
-	struct io_wqe_acct *acct = io_wqe_get_acct(wqe, worker);
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 
 	/*
 	 * If we're not at zero, someone else is holding a brief reference
@@ -303,23 +304,24 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
 		wake_up_process(wqe->wq->manager);
 }
 
-static void io_wqe_inc_running(struct io_wqe *wqe, struct io_worker *worker)
+static void io_wqe_inc_running(struct io_worker *worker)
 {
-	struct io_wqe_acct *acct = io_wqe_get_acct(wqe, worker);
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 
 	atomic_inc(&acct->nr_running);
 }
 
-static void io_wqe_dec_running(struct io_wqe *wqe, struct io_worker *worker)
+static void io_wqe_dec_running(struct io_worker *worker)
 	__must_hold(wqe->lock)
 {
-	struct io_wqe_acct *acct = io_wqe_get_acct(wqe, worker);
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
+	struct io_wqe *wqe = worker->wqe;
 
 	if (atomic_dec_and_test(&acct->nr_running) && io_wqe_run_queue(wqe))
 		io_wqe_wake_worker(wqe, acct);
 }
 
-static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
+static void io_worker_start(struct io_worker *worker)
 {
 	allow_kernel_signal(SIGINT);
 
@@ -329,7 +331,7 @@ static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
 	worker->restore_nsproxy = current->nsproxy;
-	io_wqe_inc_running(wqe, worker);
+	io_wqe_inc_running(worker);
 }
 
 /*
@@ -354,7 +356,7 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
 	worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0;
 	work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0;
 	if (worker_bound != work_bound) {
-		io_wqe_dec_running(wqe, worker);
+		io_wqe_dec_running(worker);
 		if (work_bound) {
 			worker->flags |= IO_WORKER_F_BOUND;
 			wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
@@ -366,7 +368,7 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
 			wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
 			atomic_inc(&wqe->wq->user->processes);
 		}
-		io_wqe_inc_running(wqe, worker);
+		io_wqe_inc_running(worker);
 	 }
 }
 
@@ -589,7 +591,7 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 
-	io_worker_start(wqe, worker);
+	io_worker_start(worker);
 
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -634,14 +636,13 @@ static int io_wqe_worker(void *data)
 void io_wq_worker_running(struct task_struct *tsk)
 {
 	struct io_worker *worker = kthread_data(tsk);
-	struct io_wqe *wqe = worker->wqe;
 
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 	if (worker->flags & IO_WORKER_F_RUNNING)
 		return;
 	worker->flags |= IO_WORKER_F_RUNNING;
-	io_wqe_inc_running(wqe, worker);
+	io_wqe_inc_running(worker);
 }
 
 /*
@@ -662,7 +663,7 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 	worker->flags &= ~IO_WORKER_F_RUNNING;
 
 	raw_spin_lock_irq(&wqe->lock);
-	io_wqe_dec_running(wqe, worker);
+	io_wqe_dec_running(worker);
 	raw_spin_unlock_irq(&wqe->lock);
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (5 preceding siblings ...)
  2021-02-19 17:09 ` [PATCH 06/18] io-wq: don't pass 'wqe' needlessly around Jens Axboe
@ 2021-02-19 17:09 ` Jens Axboe
  2021-02-19 22:21   ` Eric W. Biederman
  2021-02-19 17:10 ` [PATCH 08/18] kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals Jens Axboe
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:09 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the
sense that we don't assign ->set_child_tid with our own structure. Just
ensure that every arch sets up the PF_IO_WORKER threads like kthreads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/alpha/kernel/process.c      | 2 +-
 arch/arc/kernel/process.c        | 2 +-
 arch/arm/kernel/process.c        | 2 +-
 arch/arm64/kernel/process.c      | 2 +-
 arch/c6x/kernel/process.c        | 2 +-
 arch/csky/kernel/process.c       | 2 +-
 arch/h8300/kernel/process.c      | 2 +-
 arch/hexagon/kernel/process.c    | 2 +-
 arch/ia64/kernel/process.c       | 2 +-
 arch/m68k/kernel/process.c       | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c       | 2 +-
 arch/nds32/kernel/process.c      | 2 +-
 arch/nios2/kernel/process.c      | 2 +-
 arch/openrisc/kernel/process.c   | 2 +-
 arch/riscv/kernel/process.c      | 2 +-
 arch/s390/kernel/process.c       | 2 +-
 arch/sh/kernel/process_32.c      | 2 +-
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c         | 2 +-
 arch/x86/kernel/process.c        | 2 +-
 arch/xtensa/kernel/process.c     | 2 +-
 23 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 6c71554206cc..5112ab996394 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 37f724ad5e39..d838d0d57696 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index ee3aee69e444..5199a2bb4111 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -243,7 +243,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6616486a58fe..05f001b401a5 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -398,7 +398,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	ptrauth_thread_init_kernel(p);
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
index 9f4fd6a40a10..403ad4ce3db0 100644
--- a/arch/c6x/kernel/process.c
+++ b/arch/c6x/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = task_pt_regs(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* case of  __kernel_thread: we return to supervisor space */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->sp = (unsigned long)(childregs + 1);
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 69af6bc87e64..3d0ca22cd0e2 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
 	/* setup thread.sp for switch_to !!! */
 	p->thread.sp = (unsigned long)childstack;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childstack->r15 = (unsigned long) ret_from_kernel_thread;
 		childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index bc1364db58fe..46b1342ce515 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 6a980cba7b29..c61165c99ae0 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 4ebbfa076a26..7e1a1525e202 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 08359a6e058f..da83cc83e791 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	 */
 	p->thread.fs = get_fs().seg;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 657c2beb665e..62aa237180b6 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d7e288f3a1e7..f69434015be7 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -135,7 +135,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index e01ad5d17224..c1327e552ec6 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* kernel thread fn */
 		p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index 50b4eb19a6cc..c5f916ca6845 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 3c98728cce24..83fba4ee4453 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -167,7 +167,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index dd5f985b1f40..06d326caa7d8 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	/* p->thread holds context to be restored by __switch_to() */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* Kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index bc3ca54edfb4..ac7a06d5e230 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long) frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 80a5d1c66a51..1aa508eb0823 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index a02363735915..0f9c606e1e78 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -309,7 +309,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 6f8c7822fc06..7afd0a859a78 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -597,7 +597,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 81d508daf67c..c5011064b5dd 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 		unsigned long arg, struct task_struct * p, unsigned long tls)
 {
 	void (*handler)(void);
-	int kthread = current->flags & PF_KTHREAD;
+	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
 	int ret = 0;
 
 	p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 145a7ac0c19a..9c214d7085a4 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -161,7 +161,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 #endif
 
 	/* Kernel thread ? */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		kthread_frame_init(frame, sp, arg);
 		return 0;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 397a7de56377..9534ef515d74 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -217,7 +217,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 
 	p->thread.sp = (unsigned long)childregs;
 
-	if (!(p->flags & PF_KTHREAD)) {
+	if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/18] kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (6 preceding siblings ...)
  2021-02-19 17:09 ` [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 09/18] io-wq: fork worker threads from original task Jens Axboe
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/ptrace.c | 2 +-
 kernel/signal.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 61db50f7ca86..821cf1723814 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -375,7 +375,7 @@ static int ptrace_attach(struct task_struct *task, long request,
 	audit_ptrace(task);
 
 	retval = -EPERM;
-	if (unlikely(task->flags & PF_KTHREAD))
+	if (unlikely(task->flags & (PF_KTHREAD | PF_IO_WORKER)))
 		goto out;
 	if (same_thread_group(task, current))
 		goto out;
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..ba4d1ef39a9e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,7 +91,7 @@ static bool sig_task_ignored(struct task_struct *t, int sig, bool force)
 		return true;
 
 	/* Only allow kernel generated signals to this kthread */
-	if (unlikely((t->flags & PF_KTHREAD) &&
+	if (unlikely((t->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
 		     (handler == SIG_KTHREAD_KERNEL) && !force))
 		return true;
 
@@ -1096,7 +1096,7 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc
 	/*
 	 * Skip useless siginfo allocation for SIGKILL and kernel threads.
 	 */
-	if ((sig == SIGKILL) || (t->flags & PF_KTHREAD))
+	if ((sig == SIGKILL) || (t->flags & (PF_KTHREAD | PF_IO_WORKER)))
 		goto out_set;
 
 	/*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/18] io-wq: fork worker threads from original task
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (7 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 08/18] kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-03-04 12:23   ` Stefan Metzmacher
  2021-02-19 17:10 ` [PATCH 10/18] io-wq: worker idling always returns false Jens Axboe
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Instead of using regular kthread kernel threads, create kernel threads
that are like a real thread that the task would create. This ensures that
we get all the context that we need, without having to carry that state
around. This greatly reduces the code complexity, and the risk of missing
state for a given request type.

With the move away from kthread, we can also dump everything related to
assigned state to the new threads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c            | 301 +++++++++++++++---------------------------
 fs/io_uring.c         |   7 +
 include/linux/sched.h |   3 +
 3 files changed, 114 insertions(+), 197 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index ec7f1106b659..b53f569b5b4e 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -13,12 +13,9 @@
 #include <linux/sched/mm.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
-#include <linux/kthread.h>
 #include <linux/rculist_nulls.h>
-#include <linux/fs_struct.h>
-#include <linux/blk-cgroup.h>
-#include <linux/audit.h>
 #include <linux/cpu.h>
+#include <linux/tracehook.h>
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -57,13 +54,6 @@ struct io_worker {
 	spinlock_t lock;
 
 	struct rcu_head rcu;
-	struct mm_struct *mm;
-#ifdef CONFIG_BLK_CGROUP
-	struct cgroup_subsys_state *blkcg_css;
-#endif
-	const struct cred *cur_creds;
-	const struct cred *saved_creds;
-	struct nsproxy *restore_nsproxy;
 };
 
 #if BITS_PER_LONG == 64
@@ -122,6 +112,8 @@ struct io_wq {
 	struct completion done;
 
 	struct hlist_node cpuhp_node;
+
+	pid_t task_pid;
 };
 
 static enum cpuhp_state io_wq_online;
@@ -137,61 +129,6 @@ static void io_worker_release(struct io_worker *worker)
 		wake_up_process(worker->task);
 }
 
-/*
- * Note: drops the wqe->lock if returning true! The caller must re-acquire
- * the lock in that case. Some callers need to restart handling if this
- * happens, so we can't just re-acquire the lock on behalf of the caller.
- */
-static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
-{
-	bool dropped_lock = false;
-
-	if (worker->saved_creds) {
-		revert_creds(worker->saved_creds);
-		worker->cur_creds = worker->saved_creds = NULL;
-	}
-
-	if (current->files) {
-		__acquire(&wqe->lock);
-		raw_spin_unlock_irq(&wqe->lock);
-		dropped_lock = true;
-
-		task_lock(current);
-		current->files = NULL;
-		current->nsproxy = worker->restore_nsproxy;
-		task_unlock(current);
-	}
-
-	if (current->fs)
-		current->fs = NULL;
-
-	/*
-	 * If we have an active mm, we need to drop the wq lock before unusing
-	 * it. If we do, return true and let the caller retry the idle loop.
-	 */
-	if (worker->mm) {
-		if (!dropped_lock) {
-			__acquire(&wqe->lock);
-			raw_spin_unlock_irq(&wqe->lock);
-			dropped_lock = true;
-		}
-		__set_current_state(TASK_RUNNING);
-		kthread_unuse_mm(worker->mm);
-		mmput(worker->mm);
-		worker->mm = NULL;
-	}
-
-#ifdef CONFIG_BLK_CGROUP
-	if (worker->blkcg_css) {
-		kthread_associate_blkcg(NULL);
-		worker->blkcg_css = NULL;
-	}
-#endif
-	if (current->signal->rlim[RLIMIT_FSIZE].rlim_cur != RLIM_INFINITY)
-		current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
-	return dropped_lock;
-}
-
 static inline struct io_wqe_acct *io_work_get_acct(struct io_wqe *wqe,
 						   struct io_wq_work *work)
 {
@@ -237,10 +174,6 @@ static void io_worker_exit(struct io_worker *worker)
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
-	if (__io_worker_unuse(wqe, worker)) {
-		__release(&wqe->lock);
-		raw_spin_lock_irq(&wqe->lock);
-	}
 	acct->nr_workers--;
 	raw_spin_unlock_irq(&wqe->lock);
 
@@ -323,14 +256,7 @@ static void io_wqe_dec_running(struct io_worker *worker)
 
 static void io_worker_start(struct io_worker *worker)
 {
-	allow_kernel_signal(SIGINT);
-
-	current->flags |= PF_IO_WORKER;
-	current->fs = NULL;
-	current->files = NULL;
-
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
-	worker->restore_nsproxy = current->nsproxy;
 	io_wqe_inc_running(worker);
 }
 
@@ -387,7 +313,7 @@ static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker)
 		hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	}
 
-	return __io_worker_unuse(wqe, worker);
+	return false;
 }
 
 static inline unsigned int io_get_work_hash(struct io_wq_work *work)
@@ -426,96 +352,23 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
 	return NULL;
 }
 
-static void io_wq_switch_mm(struct io_worker *worker, struct io_wq_work *work)
+static void io_flush_signals(void)
 {
-	if (worker->mm) {
-		kthread_unuse_mm(worker->mm);
-		mmput(worker->mm);
-		worker->mm = NULL;
+	if (unlikely(test_tsk_thread_flag(current, TIF_NOTIFY_SIGNAL))) {
+		if (current->task_works)
+			task_work_run();
+		clear_tsk_thread_flag(current, TIF_NOTIFY_SIGNAL);
 	}
-
-	if (mmget_not_zero(work->identity->mm)) {
-		kthread_use_mm(work->identity->mm);
-		worker->mm = work->identity->mm;
-		return;
-	}
-
-	/* failed grabbing mm, ensure work gets cancelled */
-	work->flags |= IO_WQ_WORK_CANCEL;
-}
-
-static inline void io_wq_switch_blkcg(struct io_worker *worker,
-				      struct io_wq_work *work)
-{
-#ifdef CONFIG_BLK_CGROUP
-	if (!(work->flags & IO_WQ_WORK_BLKCG))
-		return;
-	if (work->identity->blkcg_css != worker->blkcg_css) {
-		kthread_associate_blkcg(work->identity->blkcg_css);
-		worker->blkcg_css = work->identity->blkcg_css;
-	}
-#endif
-}
-
-static void io_wq_switch_creds(struct io_worker *worker,
-			       struct io_wq_work *work)
-{
-	const struct cred *old_creds = override_creds(work->identity->creds);
-
-	worker->cur_creds = work->identity->creds;
-	if (worker->saved_creds)
-		put_cred(old_creds); /* creds set by previous switch */
-	else
-		worker->saved_creds = old_creds;
-}
-
-static void io_impersonate_work(struct io_worker *worker,
-				struct io_wq_work *work)
-{
-	if ((work->flags & IO_WQ_WORK_FILES) &&
-	    current->files != work->identity->files) {
-		task_lock(current);
-		current->files = work->identity->files;
-		current->nsproxy = work->identity->nsproxy;
-		task_unlock(current);
-		if (!work->identity->files) {
-			/* failed grabbing files, ensure work gets cancelled */
-			work->flags |= IO_WQ_WORK_CANCEL;
-		}
-	}
-	if ((work->flags & IO_WQ_WORK_FS) && current->fs != work->identity->fs)
-		current->fs = work->identity->fs;
-	if ((work->flags & IO_WQ_WORK_MM) && work->identity->mm != worker->mm)
-		io_wq_switch_mm(worker, work);
-	if ((work->flags & IO_WQ_WORK_CREDS) &&
-	    worker->cur_creds != work->identity->creds)
-		io_wq_switch_creds(worker, work);
-	if (work->flags & IO_WQ_WORK_FSIZE)
-		current->signal->rlim[RLIMIT_FSIZE].rlim_cur = work->identity->fsize;
-	else if (current->signal->rlim[RLIMIT_FSIZE].rlim_cur != RLIM_INFINITY)
-		current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
-	io_wq_switch_blkcg(worker, work);
-#ifdef CONFIG_AUDIT
-	current->loginuid = work->identity->loginuid;
-	current->sessionid = work->identity->sessionid;
-#endif
 }
 
 static void io_assign_current_work(struct io_worker *worker,
 				   struct io_wq_work *work)
 {
 	if (work) {
-		/* flush pending signals before assigning new work */
-		if (signal_pending(current))
-			flush_signals(current);
+		io_flush_signals();
 		cond_resched();
 	}
 
-#ifdef CONFIG_AUDIT
-	current->loginuid = KUIDT_INIT(AUDIT_UID_UNSET);
-	current->sessionid = AUDIT_SID_UNSET;
-#endif
-
 	spin_lock_irq(&worker->lock);
 	worker->cur_work = work;
 	spin_unlock_irq(&worker->lock);
@@ -556,7 +409,6 @@ static void io_worker_handle_work(struct io_worker *worker)
 			unsigned int hash = io_get_work_hash(work);
 
 			next_hashed = wq_next_work(work);
-			io_impersonate_work(worker, work);
 			wq->do_work(work);
 			io_assign_current_work(worker, NULL);
 
@@ -608,10 +460,11 @@ static int io_wqe_worker(void *data)
 			goto loop;
 		}
 		raw_spin_unlock_irq(&wqe->lock);
-		if (signal_pending(current))
-			flush_signals(current);
+		io_flush_signals();
 		if (schedule_timeout(WORKER_IDLE_TIMEOUT))
 			continue;
+		if (fatal_signal_pending(current))
+			break;
 		/* timed out, exit unless we're the fixed worker */
 		if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
 		    !(worker->flags & IO_WORKER_F_FIXED))
@@ -635,8 +488,10 @@ static int io_wqe_worker(void *data)
  */
 void io_wq_worker_running(struct task_struct *tsk)
 {
-	struct io_worker *worker = kthread_data(tsk);
+	struct io_worker *worker = tsk->pf_io_worker;
 
+	if (!worker)
+		return;
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 	if (worker->flags & IO_WORKER_F_RUNNING)
@@ -652,9 +507,10 @@ void io_wq_worker_running(struct task_struct *tsk)
  */
 void io_wq_worker_sleeping(struct task_struct *tsk)
 {
-	struct io_worker *worker = kthread_data(tsk);
-	struct io_wqe *wqe = worker->wqe;
+	struct io_worker *worker = tsk->pf_io_worker;
 
+	if (!worker)
+		return;
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 	if (!(worker->flags & IO_WORKER_F_RUNNING))
@@ -662,32 +518,27 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 
 	worker->flags &= ~IO_WORKER_F_RUNNING;
 
-	raw_spin_lock_irq(&wqe->lock);
+	raw_spin_lock_irq(&worker->wqe->lock);
 	io_wqe_dec_running(worker);
-	raw_spin_unlock_irq(&wqe->lock);
+	raw_spin_unlock_irq(&worker->wqe->lock);
 }
 
-static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
+static int task_thread(void *data, int index)
 {
+	struct io_worker *worker = data;
+	struct io_wqe *wqe = worker->wqe;
 	struct io_wqe_acct *acct = &wqe->acct[index];
-	struct io_worker *worker;
+	struct io_wq *wq = wqe->wq;
+	char buf[TASK_COMM_LEN];
 
-	worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, wqe->node);
-	if (!worker)
-		return false;
+	sprintf(buf, "iou-wrk-%d", wq->task_pid);
+	set_task_comm(current, buf);
 
-	refcount_set(&worker->ref, 1);
-	worker->nulls_node.pprev = NULL;
-	worker->wqe = wqe;
-	spin_lock_init(&worker->lock);
+	current->pf_io_worker = worker;
+	worker->task = current;
 
-	worker->task = kthread_create_on_node(io_wqe_worker, worker, wqe->node,
-				"io_wqe_worker-%d/%d", index, wqe->node);
-	if (IS_ERR(worker->task)) {
-		kfree(worker);
-		return false;
-	}
-	kthread_bind_mask(worker->task, cpumask_of_node(wqe->node));
+	set_cpus_allowed_ptr(current, cpumask_of_node(wqe->node));
+	current->flags |= PF_NO_SETAFFINITY;
 
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
@@ -703,8 +554,58 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (index == IO_WQ_ACCT_UNBOUND)
 		atomic_inc(&wq->user->processes);
 
+	io_wqe_worker(data);
+	do_exit(0);
+}
+
+static int task_thread_bound(void *data)
+{
+	return task_thread(data, IO_WQ_ACCT_BOUND);
+}
+
+static int task_thread_unbound(void *data)
+{
+	return task_thread(data, IO_WQ_ACCT_UNBOUND);
+}
+
+static pid_t fork_thread(int (*fn)(void *), void *arg)
+{
+	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
+				CLONE_IO|SIGCHLD;
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+	};
+
+	return kernel_clone(&args);
+}
+
+static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
+{
+	struct io_worker *worker;
+	pid_t pid;
+
+	worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, wqe->node);
+	if (!worker)
+		return false;
+
+	refcount_set(&worker->ref, 1);
+	worker->nulls_node.pprev = NULL;
+	worker->wqe = wqe;
+	spin_lock_init(&worker->lock);
+
+	if (index == IO_WQ_ACCT_BOUND)
+		pid = fork_thread(task_thread_bound, worker);
+	else
+		pid = fork_thread(task_thread_unbound, worker);
+	if (pid < 0) {
+		kfree(worker);
+		return false;
+	}
 	refcount_inc(&wq->refs);
-	wake_up_process(worker->task);
 	return true;
 }
 
@@ -756,12 +657,17 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 static int io_wq_manager(void *data)
 {
 	struct io_wq *wq = data;
+	char buf[TASK_COMM_LEN];
 	int node;
 
-	refcount_set(&wq->refs, 1);
+	sprintf(buf, "iou-mgr-%d", wq->task_pid);
+	set_task_comm(current, buf);
+	current->flags |= PF_IO_WORKER;
+	wq->manager = current;
+
 	complete(&wq->done);
 
-	while (!kthread_should_stop()) {
+	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		for_each_node(node) {
 			struct io_wqe *wqe = wq->wqes[node];
 			bool fork_worker[2] = { false, false };
@@ -782,11 +688,13 @@ static int io_wq_manager(void *data)
 		}
 		set_current_state(TASK_INTERRUPTIBLE);
 		schedule_timeout(HZ);
+		if (fatal_signal_pending(current))
+			set_bit(IO_WQ_BIT_EXIT, &wq->state);
 	}
 
 	if (refcount_dec_and_test(&wq->refs)) {
 		complete(&wq->done);
-		return 0;
+		do_exit(0);
 	}
 	/* if ERROR is set and we get here, we have workers to wake */
 	if (test_bit(IO_WQ_BIT_ERROR, &wq->state)) {
@@ -795,7 +703,7 @@ static int io_wq_manager(void *data)
 			io_wq_for_each_worker(wq->wqes[node], io_wq_worker_wake, NULL);
 		rcu_read_unlock();
 	}
-	return 0;
+	do_exit(0);
 }
 
 static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct,
@@ -919,7 +827,7 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
 	spin_lock_irqsave(&worker->lock, flags);
 	if (worker->cur_work &&
 	    match->fn(worker->cur_work, match->data)) {
-		send_sig(SIGINT, worker->task, 1);
+		set_notify_signal(worker->task);
 		match->nr_running++;
 	}
 	spin_unlock_irqrestore(&worker->lock, flags);
@@ -1075,22 +983,21 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 		INIT_LIST_HEAD(&wqe->all_list);
 	}
 
+	wq->task_pid = current->pid;
 	init_completion(&wq->done);
+	refcount_set(&wq->refs, 1);
 
-	wq->manager = kthread_create(io_wq_manager, wq, "io_wq_manager");
-	if (!IS_ERR(wq->manager)) {
-		wake_up_process(wq->manager);
+	current->flags |= PF_IO_WORKER;
+	ret = fork_thread(io_wq_manager, wq);
+	current->flags &= ~PF_IO_WORKER;
+	if (ret >= 0) {
 		wait_for_completion(&wq->done);
-		if (test_bit(IO_WQ_BIT_ERROR, &wq->state)) {
-			ret = -ENOMEM;
-			goto err;
-		}
 		reinit_completion(&wq->done);
 		return wq;
 	}
 
-	ret = PTR_ERR(wq->manager);
-	complete(&wq->done);
+	if (refcount_dec_and_test(&wq->refs))
+		complete(&wq->done);
 err:
 	cpuhp_state_remove_instance_nocalls(io_wq_online, &wq->cpuhp_node);
 	for_each_node(node)
@@ -1110,7 +1017,7 @@ void io_wq_destroy(struct io_wq *wq)
 
 	set_bit(IO_WQ_BIT_EXIT, &wq->state);
 	if (wq->manager)
-		kthread_stop(wq->manager);
+		wake_up_process(wq->manager);
 
 	rcu_read_lock();
 	for_each_node(node)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6ad3e1df6504..b0a7a2d3ab4f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1618,6 +1618,9 @@ static struct io_kiocb *__io_queue_async_work(struct io_kiocb *req)
 	struct io_kiocb *link = io_prep_linked_timeout(req);
 	struct io_uring_task *tctx = req->task->io_uring;
 
+	BUG_ON(!tctx);
+	BUG_ON(!tctx->io_wq);
+
 	trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req,
 					&req->work, req->flags);
 	io_wq_enqueue(tctx->io_wq, &req->work);
@@ -9266,6 +9269,10 @@ static int io_uring_flush(struct file *file, void *data)
 	struct io_uring_task *tctx = current->io_uring;
 	struct io_ring_ctx *ctx = file->private_data;
 
+	/* Ignore helper thread files exit */
+	if (current->flags & PF_IO_WORKER)
+		return 0;
+
 	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
 		io_uring_cancel_task_requests(ctx, NULL);
 		io_req_caches_free(ctx, current);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..a6a9f0323102 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -895,6 +895,9 @@ struct task_struct {
 	/* CLONE_CHILD_CLEARTID: */
 	int __user			*clear_child_tid;
 
+	/* PF_IO_WORKER */
+	void				*pf_io_worker;
+
 	u64				utime;
 	u64				stime;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/18] io-wq: worker idling always returns false
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (8 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 09/18] io-wq: fork worker threads from original task Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 11/18] io_uring: remove any grabbing of context Jens Axboe
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Remove the bool return, and the checking for it in the caller.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index b53f569b5b4e..41042119bf0f 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -305,15 +305,13 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
  * retry the loop in that case (we changed task state), we don't regrab
  * the lock if we return success.
  */
-static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker)
+static void __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker)
 	__must_hold(wqe->lock)
 {
 	if (!(worker->flags & IO_WORKER_F_FREE)) {
 		worker->flags |= IO_WORKER_F_FREE;
 		hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	}
-
-	return false;
 }
 
 static inline unsigned int io_get_work_hash(struct io_wq_work *work)
@@ -454,11 +452,7 @@ static int io_wqe_worker(void *data)
 			io_worker_handle_work(worker);
 			goto loop;
 		}
-		/* drops the lock on success, retry */
-		if (__io_worker_idle(wqe, worker)) {
-			__release(&wqe->lock);
-			goto loop;
-		}
+		__io_worker_idle(wqe, worker);
 		raw_spin_unlock_irq(&wqe->lock);
 		io_flush_signals();
 		if (schedule_timeout(WORKER_IDLE_TIMEOUT))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/18] io_uring: remove any grabbing of context
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (9 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 10/18] io-wq: worker idling always returns false Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 12/18] io_uring: remove io_identity Jens Axboe
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

The async workers are siblings of the task itself, so by definition we
have all the state that we need. Remove any of the state grabbing that
we have, and requests flagging what they need.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.h    |   7 --
 fs/io_uring.c | 215 ++------------------------------------------------
 2 files changed, 7 insertions(+), 215 deletions(-)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index d2cf284b4641..ab8029bf77b8 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -11,13 +11,6 @@ enum {
 	IO_WQ_WORK_UNBOUND	= 4,
 	IO_WQ_WORK_CONCURRENT	= 16,
 
-	IO_WQ_WORK_FILES	= 32,
-	IO_WQ_WORK_FS		= 64,
-	IO_WQ_WORK_MM		= 128,
-	IO_WQ_WORK_CREDS	= 256,
-	IO_WQ_WORK_BLKCG	= 512,
-	IO_WQ_WORK_FSIZE	= 1024,
-
 	IO_WQ_HASH_SHIFT	= 24,	/* upper 8 bits are used for hash key */
 };
 
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b0a7a2d3ab4f..872d2f1c6ea5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -837,7 +837,6 @@ struct io_op_def {
 	unsigned		plug : 1;
 	/* size of async data needed, if any */
 	unsigned short		async_size;
-	unsigned		work_flags;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -850,7 +849,6 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_async_data	= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_WRITEV] = {
 		.needs_file		= 1,
@@ -860,12 +858,9 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_async_data	= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FSIZE,
 	},
 	[IORING_OP_FSYNC] = {
 		.needs_file		= 1,
-		.work_flags		= IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
@@ -873,7 +868,6 @@ static const struct io_op_def io_op_defs[] = {
 		.pollin			= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
 	},
 	[IORING_OP_WRITE_FIXED] = {
 		.needs_file		= 1,
@@ -882,8 +876,6 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
-						IO_WQ_WORK_MM,
 	},
 	[IORING_OP_POLL_ADD] = {
 		.needs_file		= 1,
@@ -892,7 +884,6 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_POLL_REMOVE] = {},
 	[IORING_OP_SYNC_FILE_RANGE] = {
 		.needs_file		= 1,
-		.work_flags		= IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_SENDMSG] = {
 		.needs_file		= 1,
@@ -900,8 +891,6 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
 	},
 	[IORING_OP_RECVMSG] = {
 		.needs_file		= 1,
@@ -910,29 +899,23 @@ static const struct io_op_def io_op_defs[] = {
 		.buffer_select		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_timeout_data),
-		.work_flags		= IO_WQ_WORK_MM,
 	},
 	[IORING_OP_TIMEOUT_REMOVE] = {
 		/* used by timeout updates' prep() */
-		.work_flags		= IO_WQ_WORK_MM,
 	},
 	[IORING_OP_ACCEPT] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_FILES,
 	},
 	[IORING_OP_ASYNC_CANCEL] = {},
 	[IORING_OP_LINK_TIMEOUT] = {
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_timeout_data),
-		.work_flags		= IO_WQ_WORK_MM,
 	},
 	[IORING_OP_CONNECT] = {
 		.needs_file		= 1,
@@ -940,25 +923,17 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_connect),
-		.work_flags		= IO_WQ_WORK_MM,
 	},
 	[IORING_OP_FALLOCATE] = {
 		.needs_file		= 1,
-		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE,
 	},
 	[IORING_OP_OPENAT] = {
-		.work_flags		= IO_WQ_WORK_FILES | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS | IO_WQ_WORK_MM,
 	},
 	[IORING_OP_CLOSE] = {
-		.work_flags		= IO_WQ_WORK_FILES | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_FILES_UPDATE] = {
-		.work_flags		= IO_WQ_WORK_FILES | IO_WQ_WORK_MM,
 	},
 	[IORING_OP_STATX] = {
-		.work_flags		= IO_WQ_WORK_FILES | IO_WQ_WORK_MM |
-						IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_READ] = {
 		.needs_file		= 1,
@@ -967,7 +942,6 @@ static const struct io_op_def io_op_defs[] = {
 		.buffer_select		= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_file		= 1,
@@ -975,42 +949,32 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FSIZE,
 	},
 	[IORING_OP_FADVISE] = {
 		.needs_file		= 1,
-		.work_flags		= IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_MADVISE] = {
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_SEND] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollout		= 1,
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_RECV] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
 		.buffer_select		= 1,
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_OPENAT2] = {
-		.work_flags		= IO_WQ_WORK_FILES | IO_WQ_WORK_FS |
-						IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
 	},
 	[IORING_OP_EPOLL_CTL] = {
 		.unbound_nonreg_file	= 1,
-		.work_flags		= IO_WQ_WORK_FILES,
 	},
 	[IORING_OP_SPLICE] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
-		.work_flags		= IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_PROVIDE_BUFFERS] = {},
 	[IORING_OP_REMOVE_BUFFERS] = {},
@@ -1023,12 +987,8 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 	},
 	[IORING_OP_RENAMEAT] = {
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
-						IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_UNLINKAT] = {
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
-						IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
 	},
 };
 
@@ -1126,8 +1086,7 @@ static bool io_match_task(struct io_kiocb *head,
 			continue;
 		if (req->file && req->file->f_op == &io_uring_fops)
 			return true;
-		if ((req->work.flags & IO_WQ_WORK_FILES) &&
-		    req->work.identity->files == files)
+		if (req->work.identity->files == files)
 			return true;
 	}
 	return false;
@@ -1204,20 +1163,15 @@ static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx)
 static int __io_sq_thread_acquire_mm_files(struct io_ring_ctx *ctx,
 					   struct io_kiocb *req)
 {
-	const struct io_op_def *def = &io_op_defs[req->opcode];
 	int ret;
 
-	if (def->work_flags & IO_WQ_WORK_MM) {
-		ret = __io_sq_thread_acquire_mm(ctx);
-		if (unlikely(ret))
-			return ret;
-	}
+	ret = __io_sq_thread_acquire_mm(ctx);
+	if (unlikely(ret))
+		return ret;
 
-	if (def->needs_file || (def->work_flags & IO_WQ_WORK_FILES)) {
-		ret = __io_sq_thread_acquire_files(ctx);
-		if (unlikely(ret))
-			return ret;
-	}
+	ret = __io_sq_thread_acquire_files(ctx);
+	if (unlikely(ret))
+		return ret;
 
 	return 0;
 }
@@ -1401,28 +1355,6 @@ static void io_req_clean_work(struct io_kiocb *req)
 	if (!(req->flags & REQ_F_WORK_INITIALIZED))
 		return;
 
-	if (req->work.flags & IO_WQ_WORK_MM)
-		mmdrop(req->work.identity->mm);
-#ifdef CONFIG_BLK_CGROUP
-	if (req->work.flags & IO_WQ_WORK_BLKCG)
-		css_put(req->work.identity->blkcg_css);
-#endif
-	if (req->work.flags & IO_WQ_WORK_CREDS)
-		put_cred(req->work.identity->creds);
-	if (req->work.flags & IO_WQ_WORK_FS) {
-		struct fs_struct *fs = req->work.identity->fs;
-
-		spin_lock(&req->work.identity->fs->lock);
-		if (--fs->users)
-			fs = NULL;
-		spin_unlock(&req->work.identity->fs->lock);
-		if (fs)
-			free_fs_struct(fs);
-	}
-	if (req->work.flags & IO_WQ_WORK_FILES) {
-		put_files_struct(req->work.identity->files);
-		put_nsproxy(req->work.identity->nsproxy);
-	}
 	if (req->flags & REQ_F_INFLIGHT) {
 		struct io_ring_ctx *ctx = req->ctx;
 		struct io_uring_task *tctx = req->task->io_uring;
@@ -1437,56 +1369,9 @@ static void io_req_clean_work(struct io_kiocb *req)
 	}
 
 	req->flags &= ~REQ_F_WORK_INITIALIZED;
-	req->work.flags &= ~(IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG | IO_WQ_WORK_FS |
-			     IO_WQ_WORK_CREDS | IO_WQ_WORK_FILES);
 	io_put_identity(req->task->io_uring, req);
 }
 
-/*
- * Create a private copy of io_identity, since some fields don't match
- * the current context.
- */
-static bool io_identity_cow(struct io_kiocb *req)
-{
-	struct io_uring_task *tctx = current->io_uring;
-	const struct cred *creds = NULL;
-	struct io_identity *id;
-
-	if (req->work.flags & IO_WQ_WORK_CREDS)
-		creds = req->work.identity->creds;
-
-	id = kmemdup(req->work.identity, sizeof(*id), GFP_KERNEL);
-	if (unlikely(!id)) {
-		req->work.flags |= IO_WQ_WORK_CANCEL;
-		return false;
-	}
-
-	/*
-	 * We can safely just re-init the creds we copied  Either the field
-	 * matches the current one, or we haven't grabbed it yet. The only
-	 * exception is ->creds, through registered personalities, so handle
-	 * that one separately.
-	 */
-	io_init_identity(id);
-	if (creds)
-		id->creds = creds;
-
-	/* add one for this request */
-	refcount_inc(&id->count);
-
-	/* drop tctx and req identity references, if needed */
-	if (tctx->identity != &tctx->__identity &&
-	    refcount_dec_and_test(&tctx->identity->count))
-		kfree(tctx->identity);
-	if (req->work.identity != &tctx->__identity &&
-	    refcount_dec_and_test(&req->work.identity->count))
-		kfree(req->work.identity);
-
-	req->work.identity = id;
-	tctx->identity = id;
-	return true;
-}
-
 static void io_req_track_inflight(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -1501,79 +1386,6 @@ static void io_req_track_inflight(struct io_kiocb *req)
 	}
 }
 
-static bool io_grab_identity(struct io_kiocb *req)
-{
-	const struct io_op_def *def = &io_op_defs[req->opcode];
-	struct io_identity *id = req->work.identity;
-
-	if (def->work_flags & IO_WQ_WORK_FSIZE) {
-		if (id->fsize != rlimit(RLIMIT_FSIZE))
-			return false;
-		req->work.flags |= IO_WQ_WORK_FSIZE;
-	}
-#ifdef CONFIG_BLK_CGROUP
-	if (!(req->work.flags & IO_WQ_WORK_BLKCG) &&
-	    (def->work_flags & IO_WQ_WORK_BLKCG)) {
-		rcu_read_lock();
-		if (id->blkcg_css != blkcg_css()) {
-			rcu_read_unlock();
-			return false;
-		}
-		/*
-		 * This should be rare, either the cgroup is dying or the task
-		 * is moving cgroups. Just punt to root for the handful of ios.
-		 */
-		if (css_tryget_online(id->blkcg_css))
-			req->work.flags |= IO_WQ_WORK_BLKCG;
-		rcu_read_unlock();
-	}
-#endif
-	if (!(req->work.flags & IO_WQ_WORK_CREDS)) {
-		if (id->creds != current_cred())
-			return false;
-		get_cred(id->creds);
-		req->work.flags |= IO_WQ_WORK_CREDS;
-	}
-#ifdef CONFIG_AUDIT
-	if (!uid_eq(current->loginuid, id->loginuid) ||
-	    current->sessionid != id->sessionid)
-		return false;
-#endif
-	if (!(req->work.flags & IO_WQ_WORK_FS) &&
-	    (def->work_flags & IO_WQ_WORK_FS)) {
-		if (current->fs != id->fs)
-			return false;
-		spin_lock(&id->fs->lock);
-		if (!id->fs->in_exec) {
-			id->fs->users++;
-			req->work.flags |= IO_WQ_WORK_FS;
-		} else {
-			req->work.flags |= IO_WQ_WORK_CANCEL;
-		}
-		spin_unlock(&current->fs->lock);
-	}
-	if (!(req->work.flags & IO_WQ_WORK_FILES) &&
-	    (def->work_flags & IO_WQ_WORK_FILES) &&
-	    !(req->flags & REQ_F_NO_FILE_TABLE)) {
-		if (id->files != current->files ||
-		    id->nsproxy != current->nsproxy)
-			return false;
-		atomic_inc(&id->files->count);
-		get_nsproxy(id->nsproxy);
-		req->work.flags |= IO_WQ_WORK_FILES;
-		io_req_track_inflight(req);
-	}
-	if (!(req->work.flags & IO_WQ_WORK_MM) &&
-	    (def->work_flags & IO_WQ_WORK_MM)) {
-		if (id->mm != current->mm)
-			return false;
-		mmgrab(id->mm);
-		req->work.flags |= IO_WQ_WORK_MM;
-	}
-
-	return true;
-}
-
 static void io_prep_async_work(struct io_kiocb *req)
 {
 	const struct io_op_def *def = &io_op_defs[req->opcode];
@@ -1591,17 +1403,6 @@ static void io_prep_async_work(struct io_kiocb *req)
 		if (def->unbound_nonreg_file)
 			req->work.flags |= IO_WQ_WORK_UNBOUND;
 	}
-
-	/* if we fail grabbing identity, we must COW, regrab, and retry */
-	if (io_grab_identity(req))
-		return;
-
-	if (!io_identity_cow(req))
-		return;
-
-	/* can't fail at this point */
-	if (!io_grab_identity(req))
-		WARN_ON(1);
 }
 
 static void io_prep_async_link(struct io_kiocb *req)
@@ -6592,7 +6393,6 @@ static void __io_queue_sqe(struct io_kiocb *req)
 	int ret;
 
 	if ((req->flags & REQ_F_WORK_INITIALIZED) &&
-	    (req->work.flags & IO_WQ_WORK_CREDS) &&
 	    req->work.identity->creds != current_cred())
 		old_creds = override_creds(req->work.identity->creds);
 
@@ -6732,7 +6532,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		__io_req_init_async(req);
 		get_cred(iod->creds);
 		req->work.identity = iod;
-		req->work.flags |= IO_WQ_WORK_CREDS;
 	}
 
 	state = &ctx->submit_state;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/18] io_uring: remove io_identity
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (10 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 11/18] io_uring: remove any grabbing of context Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 13/18] io-wq: only remove worker from free_list, if it was there Jens Axboe
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

We are no longer grabbing state, so no need to maintain an IO identity
that we COW if there are changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c               |  26 ++++++++++
 fs/io-wq.h               |   2 +-
 fs/io_uring.c            | 104 ++++++++++-----------------------------
 include/linux/io_uring.h |  19 -------
 4 files changed, 52 insertions(+), 99 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 41042119bf0f..acc67ed3a52c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -53,6 +53,9 @@ struct io_worker {
 	struct io_wq_work *cur_work;
 	spinlock_t lock;
 
+	const struct cred *cur_creds;
+	const struct cred *saved_creds;
+
 	struct rcu_head rcu;
 };
 
@@ -171,6 +174,11 @@ static void io_worker_exit(struct io_worker *worker)
 	worker->flags = 0;
 	preempt_enable();
 
+	if (worker->saved_creds) {
+		revert_creds(worker->saved_creds);
+		worker->cur_creds = worker->saved_creds = NULL;
+	}
+
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
@@ -312,6 +320,10 @@ static void __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker)
 		worker->flags |= IO_WORKER_F_FREE;
 		hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	}
+	if (worker->saved_creds) {
+		revert_creds(worker->saved_creds);
+		worker->cur_creds = worker->saved_creds = NULL;
+	}
 }
 
 static inline unsigned int io_get_work_hash(struct io_wq_work *work)
@@ -359,6 +371,18 @@ static void io_flush_signals(void)
 	}
 }
 
+static void io_wq_switch_creds(struct io_worker *worker,
+			       struct io_wq_work *work)
+{
+	const struct cred *old_creds = override_creds(work->creds);
+
+	worker->cur_creds = work->creds;
+	if (worker->saved_creds)
+		put_cred(old_creds); /* creds set by previous switch */
+	else
+		worker->saved_creds = old_creds;
+}
+
 static void io_assign_current_work(struct io_worker *worker,
 				   struct io_wq_work *work)
 {
@@ -407,6 +431,8 @@ static void io_worker_handle_work(struct io_worker *worker)
 			unsigned int hash = io_get_work_hash(work);
 
 			next_hashed = wq_next_work(work);
+			if (work->creds && worker->cur_creds != work->creds)
+				io_wq_switch_creds(worker, work);
 			wq->do_work(work);
 			io_assign_current_work(worker, NULL);
 
diff --git a/fs/io-wq.h b/fs/io-wq.h
index ab8029bf77b8..c187d54dc5cd 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -78,7 +78,7 @@ static inline void wq_list_del(struct io_wq_work_list *list,
 
 struct io_wq_work {
 	struct io_wq_work_node list;
-	struct io_identity *identity;
+	const struct cred *creds;
 	unsigned flags;
 };
 
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 872d2f1c6ea5..980c62762359 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1086,7 +1086,7 @@ static bool io_match_task(struct io_kiocb *head,
 			continue;
 		if (req->file && req->file->f_op == &io_uring_fops)
 			return true;
-		if (req->work.identity->files == files)
+		if (req->task->files == files)
 			return true;
 	}
 	return false;
@@ -1210,31 +1210,6 @@ static inline void req_set_fail_links(struct io_kiocb *req)
 		req->flags |= REQ_F_FAIL_LINK;
 }
 
-/*
- * None of these are dereferenced, they are simply used to check if any of
- * them have changed. If we're under current and check they are still the
- * same, we're fine to grab references to them for actual out-of-line use.
- */
-static void io_init_identity(struct io_identity *id)
-{
-	id->files = current->files;
-	id->mm = current->mm;
-#ifdef CONFIG_BLK_CGROUP
-	rcu_read_lock();
-	id->blkcg_css = blkcg_css();
-	rcu_read_unlock();
-#endif
-	id->creds = current_cred();
-	id->nsproxy = current->nsproxy;
-	id->fs = current->fs;
-	id->fsize = rlimit(RLIMIT_FSIZE);
-#ifdef CONFIG_AUDIT
-	id->loginuid = current->loginuid;
-	id->sessionid = current->sessionid;
-#endif
-	refcount_set(&id->count, 1);
-}
-
 static inline void __io_req_init_async(struct io_kiocb *req)
 {
 	memset(&req->work, 0, sizeof(req->work));
@@ -1247,17 +1222,10 @@ static inline void __io_req_init_async(struct io_kiocb *req)
  */
 static inline void io_req_init_async(struct io_kiocb *req)
 {
-	struct io_uring_task *tctx = current->io_uring;
-
 	if (req->flags & REQ_F_WORK_INITIALIZED)
 		return;
 
 	__io_req_init_async(req);
-
-	/* Grab a ref if this isn't our static identity */
-	req->work.identity = tctx->identity;
-	if (tctx->identity != &tctx->__identity)
-		refcount_inc(&req->work.identity->count);
 }
 
 static void io_ring_ctx_ref_free(struct percpu_ref *ref)
@@ -1342,19 +1310,15 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
 	return false;
 }
 
-static void io_put_identity(struct io_uring_task *tctx, struct io_kiocb *req)
-{
-	if (req->work.identity == &tctx->__identity)
-		return;
-	if (refcount_dec_and_test(&req->work.identity->count))
-		kfree(req->work.identity);
-}
-
 static void io_req_clean_work(struct io_kiocb *req)
 {
 	if (!(req->flags & REQ_F_WORK_INITIALIZED))
 		return;
 
+	if (req->work.creds) {
+		put_cred(req->work.creds);
+		req->work.creds = NULL;
+	}
 	if (req->flags & REQ_F_INFLIGHT) {
 		struct io_ring_ctx *ctx = req->ctx;
 		struct io_uring_task *tctx = req->task->io_uring;
@@ -1369,7 +1333,6 @@ static void io_req_clean_work(struct io_kiocb *req)
 	}
 
 	req->flags &= ~REQ_F_WORK_INITIALIZED;
-	io_put_identity(req->task->io_uring, req);
 }
 
 static void io_req_track_inflight(struct io_kiocb *req)
@@ -1403,6 +1366,8 @@ static void io_prep_async_work(struct io_kiocb *req)
 		if (def->unbound_nonreg_file)
 			req->work.flags |= IO_WQ_WORK_UNBOUND;
 	}
+	if (!req->work.creds)
+		req->work.creds = get_current_cred();
 }
 
 static void io_prep_async_link(struct io_kiocb *req)
@@ -6392,9 +6357,9 @@ static void __io_queue_sqe(struct io_kiocb *req)
 	const struct cred *old_creds = NULL;
 	int ret;
 
-	if ((req->flags & REQ_F_WORK_INITIALIZED) &&
-	    req->work.identity->creds != current_cred())
-		old_creds = override_creds(req->work.identity->creds);
+	if ((req->flags & REQ_F_WORK_INITIALIZED) && req->work.creds &&
+	    req->work.creds != current_cred())
+		old_creds = override_creds(req->work.creds);
 
 	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
 
@@ -6522,16 +6487,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 	id = READ_ONCE(sqe->personality);
 	if (id) {
-		struct io_identity *iod;
-
-		iod = idr_find(&ctx->personality_idr, id);
-		if (unlikely(!iod))
-			return -EINVAL;
-		refcount_inc(&iod->count);
-
 		__io_req_init_async(req);
-		get_cred(iod->creds);
-		req->work.identity = iod;
+		req->work.creds = idr_find(&ctx->personality_idr, id);
+		if (unlikely(!req->work.creds))
+			return -EINVAL;
+		get_cred(req->work.creds);
 	}
 
 	state = &ctx->submit_state;
@@ -7968,8 +7928,6 @@ static int io_uring_alloc_task_context(struct task_struct *task,
 	tctx->last = NULL;
 	atomic_set(&tctx->in_idle, 0);
 	tctx->sqpoll = false;
-	io_init_identity(&tctx->__identity);
-	tctx->identity = &tctx->__identity;
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
@@ -7983,9 +7941,6 @@ void __io_uring_free(struct task_struct *tsk)
 	struct io_uring_task *tctx = tsk->io_uring;
 
 	WARN_ON_ONCE(!xa_empty(&tctx->xa));
-	WARN_ON_ONCE(refcount_read(&tctx->identity->count) != 1);
-	if (tctx->identity != &tctx->__identity)
-		kfree(tctx->identity);
 	percpu_counter_destroy(&tctx->inflight);
 	kfree(tctx);
 	tsk->io_uring = NULL;
@@ -8625,13 +8580,11 @@ static int io_uring_fasync(int fd, struct file *file, int on)
 
 static int io_unregister_personality(struct io_ring_ctx *ctx, unsigned id)
 {
-	struct io_identity *iod;
+	const struct cred *creds;
 
-	iod = idr_remove(&ctx->personality_idr, id);
-	if (iod) {
-		put_cred(iod->creds);
-		if (refcount_dec_and_test(&iod->count))
-			kfree(iod);
+	creds = idr_remove(&ctx->personality_idr, id);
+	if (creds) {
+		put_cred(creds);
 		return 0;
 	}
 
@@ -9333,8 +9286,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 #ifdef CONFIG_PROC_FS
 static int io_uring_show_cred(int id, void *p, void *data)
 {
-	struct io_identity *iod = p;
-	const struct cred *cred = iod->creds;
+	const struct cred *cred = p;
 	struct seq_file *m = data;
 	struct user_namespace *uns = seq_user_ns(m);
 	struct group_info *gi;
@@ -9765,21 +9717,15 @@ static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
 
 static int io_register_personality(struct io_ring_ctx *ctx)
 {
-	struct io_identity *id;
+	const struct cred *creds;
 	int ret;
 
-	id = kmalloc(sizeof(*id), GFP_KERNEL);
-	if (unlikely(!id))
-		return -ENOMEM;
-
-	io_init_identity(id);
-	id->creds = get_current_cred();
+	creds = get_current_cred();
 
-	ret = idr_alloc_cyclic(&ctx->personality_idr, id, 1, USHRT_MAX, GFP_KERNEL);
-	if (ret < 0) {
-		put_cred(id->creds);
-		kfree(id);
-	}
+	ret = idr_alloc_cyclic(&ctx->personality_idr, (void *) creds, 1,
+				USHRT_MAX, GFP_KERNEL);
+	if (ret < 0)
+		put_cred(creds);
 	return ret;
 }
 
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 0e95398998b6..c48fcbdc2ea8 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,23 +5,6 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 
-struct io_identity {
-	struct files_struct		*files;
-	struct mm_struct		*mm;
-#ifdef CONFIG_BLK_CGROUP
-	struct cgroup_subsys_state	*blkcg_css;
-#endif
-	const struct cred		*creds;
-	struct nsproxy			*nsproxy;
-	struct fs_struct		*fs;
-	unsigned long			fsize;
-#ifdef CONFIG_AUDIT
-	kuid_t				loginuid;
-	unsigned int			sessionid;
-#endif
-	refcount_t			count;
-};
-
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -38,8 +21,6 @@ struct io_uring_task {
 	struct file		*last;
 	void			*io_wq;
 	struct percpu_counter	inflight;
-	struct io_identity	__identity;
-	struct io_identity	*identity;
 	atomic_t		in_idle;
 	bool			sqpoll;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/18] io-wq: only remove worker from free_list, if it was there
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (11 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 12/18] io_uring: remove io_identity Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 14/18] io-wq: make io_wq_fork_thread() available to other users Jens Axboe
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

If the worker isn't on the free_list, don't attempt to delete it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index acc67ed3a52c..3a506f1c7838 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -155,6 +155,7 @@ static void io_worker_exit(struct io_worker *worker)
 {
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
+	unsigned flags;
 
 	/*
 	 * If we're not at zero, someone else is holding a brief reference
@@ -167,9 +168,11 @@ static void io_worker_exit(struct io_worker *worker)
 
 	preempt_disable();
 	current->flags &= ~PF_IO_WORKER;
-	if (worker->flags & IO_WORKER_F_RUNNING)
+	flags = worker->flags;
+	worker->flags = 0;
+	if (flags & IO_WORKER_F_RUNNING)
 		atomic_dec(&acct->nr_running);
-	if (!(worker->flags & IO_WORKER_F_BOUND))
+	if (!(flags & IO_WORKER_F_BOUND))
 		atomic_dec(&wqe->wq->user->processes);
 	worker->flags = 0;
 	preempt_enable();
@@ -180,7 +183,8 @@ static void io_worker_exit(struct io_worker *worker)
 	}
 
 	raw_spin_lock_irq(&wqe->lock);
-	hlist_nulls_del_rcu(&worker->nulls_node);
+	if (flags & IO_WORKER_F_FREE)
+		hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
 	acct->nr_workers--;
 	raw_spin_unlock_irq(&wqe->lock);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/18] io-wq: make io_wq_fork_thread() available to other users
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (12 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 13/18] io-wq: only remove worker from free_list, if it was there Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 15/18] io_uring: move SQPOLL thread io-wq forked worker Jens Axboe
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

We want to use this in io_uring proper as well, for the SQPOLL thread.
Rename it from fork_thread() to io_wq_fork_thread(), and make it
available through the io-wq.h header.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 8 ++++----
 fs/io-wq.h | 2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 3a506f1c7838..b0d09f60200b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -592,7 +592,7 @@ static int task_thread_unbound(void *data)
 	return task_thread(data, IO_WQ_ACCT_UNBOUND);
 }
 
-static pid_t fork_thread(int (*fn)(void *), void *arg)
+pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
 {
 	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
 				CLONE_IO|SIGCHLD;
@@ -622,9 +622,9 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	spin_lock_init(&worker->lock);
 
 	if (index == IO_WQ_ACCT_BOUND)
-		pid = fork_thread(task_thread_bound, worker);
+		pid = io_wq_fork_thread(task_thread_bound, worker);
 	else
-		pid = fork_thread(task_thread_unbound, worker);
+		pid = io_wq_fork_thread(task_thread_unbound, worker);
 	if (pid < 0) {
 		kfree(worker);
 		return false;
@@ -1012,7 +1012,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	refcount_set(&wq->refs, 1);
 
 	current->flags |= PF_IO_WORKER;
-	ret = fork_thread(io_wq_manager, wq);
+	ret = io_wq_fork_thread(io_wq_manager, wq);
 	current->flags &= ~PF_IO_WORKER;
 	if (ret >= 0) {
 		wait_for_completion(&wq->done);
diff --git a/fs/io-wq.h b/fs/io-wq.h
index c187d54dc5cd..3c63a99d1629 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -106,6 +106,8 @@ void io_wq_destroy(struct io_wq *wq);
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
+pid_t io_wq_fork_thread(int (*fn)(void *), void *arg);
+
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
 	return work->flags & IO_WQ_WORK_HASHED;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/18] io_uring: move SQPOLL thread io-wq forked worker
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (13 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 14/18] io-wq: make io_wq_fork_thread() available to other users Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 16/18] Revert "proc: don't allow async path resolution of /proc/thread-self components" Jens Axboe
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

Don't use a kthread for SQPOLL, use a forked worker just like the io-wq
workers. With that done, we can drop the various context grabbing we do
for SQPOLL, it already has everything it needs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 446 ++++++++++++++++++--------------------------------
 1 file changed, 162 insertions(+), 284 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 980c62762359..239eacec3f3a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -57,7 +57,6 @@
 #include <linux/mman.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
-#include <linux/kthread.h>
 #include <linux/blkdev.h>
 #include <linux/bvec.h>
 #include <linux/net.h>
@@ -253,6 +252,11 @@ struct io_restriction {
 	bool registered;
 };
 
+enum {
+	IO_SQ_THREAD_SHOULD_STOP = 0,
+	IO_SQ_THREAD_SHOULD_PARK,
+};
+
 struct io_sq_data {
 	refcount_t		refs;
 	struct mutex		lock;
@@ -266,6 +270,12 @@ struct io_sq_data {
 	struct wait_queue_head	wait;
 
 	unsigned		sq_thread_idle;
+	int			sq_cpu;
+	pid_t			task_pid;
+
+	unsigned long		state;
+	struct completion	completion;
+	struct completion	exited;
 };
 
 #define IO_IOPOLL_BATCH			8
@@ -366,18 +376,13 @@ struct io_ring_ctx {
 	struct io_rings	*rings;
 
 	/*
-	 * For SQPOLL usage - we hold a reference to the parent task, so we
-	 * have access to the ->files
+	 * For SQPOLL usage
 	 */
 	struct task_struct	*sqo_task;
 
 	/* Only used for accounting purposes */
 	struct mm_struct	*mm_account;
 
-#ifdef CONFIG_BLK_CGROUP
-	struct cgroup_subsys_state	*sqo_blkcg_css;
-#endif
-
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
 	struct wait_queue_head	sqo_sq_wait;
@@ -397,13 +402,6 @@ struct io_ring_ctx {
 
 	struct user_struct	*user;
 
-	const struct cred	*creds;
-
-#ifdef CONFIG_AUDIT
-	kuid_t			loginuid;
-	unsigned int		sessionid;
-#endif
-
 	struct completion	ref_comp;
 	struct completion	sq_thread_comp;
 
@@ -995,6 +993,7 @@ static const struct io_op_def io_op_defs[] = {
 static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 struct task_struct *task,
 					 struct files_struct *files);
+static void io_uring_cancel_sqpoll(struct io_ring_ctx *ctx);
 static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node);
 static struct fixed_rsrc_ref_node *alloc_fixed_rsrc_ref_node(
 			struct io_ring_ctx *ctx);
@@ -1092,118 +1091,6 @@ static bool io_match_task(struct io_kiocb *head,
 	return false;
 }
 
-static void io_sq_thread_drop_mm_files(void)
-{
-	struct files_struct *files = current->files;
-	struct mm_struct *mm = current->mm;
-
-	if (mm) {
-		kthread_unuse_mm(mm);
-		mmput(mm);
-		current->mm = NULL;
-	}
-	if (files) {
-		struct nsproxy *nsproxy = current->nsproxy;
-
-		task_lock(current);
-		current->files = NULL;
-		current->nsproxy = NULL;
-		task_unlock(current);
-		put_files_struct(files);
-		put_nsproxy(nsproxy);
-	}
-}
-
-static int __io_sq_thread_acquire_files(struct io_ring_ctx *ctx)
-{
-	if (!current->files) {
-		struct files_struct *files;
-		struct nsproxy *nsproxy;
-
-		task_lock(ctx->sqo_task);
-		files = ctx->sqo_task->files;
-		if (!files) {
-			task_unlock(ctx->sqo_task);
-			return -EOWNERDEAD;
-		}
-		atomic_inc(&files->count);
-		get_nsproxy(ctx->sqo_task->nsproxy);
-		nsproxy = ctx->sqo_task->nsproxy;
-		task_unlock(ctx->sqo_task);
-
-		task_lock(current);
-		current->files = files;
-		current->nsproxy = nsproxy;
-		task_unlock(current);
-	}
-	return 0;
-}
-
-static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx)
-{
-	struct mm_struct *mm;
-
-	if (current->mm)
-		return 0;
-
-	task_lock(ctx->sqo_task);
-	mm = ctx->sqo_task->mm;
-	if (unlikely(!mm || !mmget_not_zero(mm)))
-		mm = NULL;
-	task_unlock(ctx->sqo_task);
-
-	if (mm) {
-		kthread_use_mm(mm);
-		return 0;
-	}
-
-	return -EFAULT;
-}
-
-static int __io_sq_thread_acquire_mm_files(struct io_ring_ctx *ctx,
-					   struct io_kiocb *req)
-{
-	int ret;
-
-	ret = __io_sq_thread_acquire_mm(ctx);
-	if (unlikely(ret))
-		return ret;
-
-	ret = __io_sq_thread_acquire_files(ctx);
-	if (unlikely(ret))
-		return ret;
-
-	return 0;
-}
-
-static inline int io_sq_thread_acquire_mm_files(struct io_ring_ctx *ctx,
-						struct io_kiocb *req)
-{
-	if (!(ctx->flags & IORING_SETUP_SQPOLL))
-		return 0;
-	return __io_sq_thread_acquire_mm_files(ctx, req);
-}
-
-static void io_sq_thread_associate_blkcg(struct io_ring_ctx *ctx,
-					 struct cgroup_subsys_state **cur_css)
-
-{
-#ifdef CONFIG_BLK_CGROUP
-	/* puts the old one when swapping */
-	if (*cur_css != ctx->sqo_blkcg_css) {
-		kthread_associate_blkcg(ctx->sqo_blkcg_css);
-		*cur_css = ctx->sqo_blkcg_css;
-	}
-#endif
-}
-
-static void io_sq_thread_unassociate_blkcg(void)
-{
-#ifdef CONFIG_BLK_CGROUP
-	kthread_associate_blkcg(NULL);
-#endif
-}
-
 static inline void req_set_fail_links(struct io_kiocb *req)
 {
 	if ((req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) == REQ_F_LINK)
@@ -2124,15 +2011,11 @@ static void __io_req_task_submit(struct io_kiocb *req)
 
 	/* ctx stays valid until unlock, even if we drop all ours ctx->refs */
 	mutex_lock(&ctx->uring_lock);
-	if (!ctx->sqo_dead && !(current->flags & PF_EXITING) &&
-	    !io_sq_thread_acquire_mm_files(ctx, req))
+	if (!ctx->sqo_dead && !(current->flags & PF_EXITING))
 		__io_queue_sqe(req);
 	else
 		__io_req_task_cancel(req, -EFAULT);
 	mutex_unlock(&ctx->uring_lock);
-
-	if (ctx->flags & IORING_SETUP_SQPOLL)
-		io_sq_thread_drop_mm_files();
 }
 
 static void io_req_task_submit(struct callback_head *cb)
@@ -2596,7 +2479,6 @@ static bool io_rw_reissue(struct io_kiocb *req)
 {
 #ifdef CONFIG_BLOCK
 	umode_t mode = file_inode(req->file)->i_mode;
-	int ret;
 
 	if (!S_ISBLK(mode) && !S_ISREG(mode))
 		return false;
@@ -2605,9 +2487,7 @@ static bool io_rw_reissue(struct io_kiocb *req)
 
 	lockdep_assert_held(&req->ctx->uring_lock);
 
-	ret = io_sq_thread_acquire_mm_files(req->ctx, req);
-
-	if (!ret && io_resubmit_prep(req)) {
+	if (io_resubmit_prep(req)) {
 		refcount_inc(&req->refs);
 		io_queue_async_work(req);
 		return true;
@@ -6475,9 +6355,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (unlikely(req->opcode >= IORING_OP_LAST))
 		return -EINVAL;
 
-	if (unlikely(io_sq_thread_acquire_mm_files(ctx, req)))
-		return -EFAULT;
-
 	if (unlikely(!io_check_restriction(ctx, req, sqe_flags)))
 		return -EACCES;
 
@@ -6793,41 +6670,81 @@ static void io_sqd_init_new(struct io_sq_data *sqd)
 	io_sqd_update_thread_idle(sqd);
 }
 
+static bool io_sq_thread_should_stop(struct io_sq_data *sqd)
+{
+	return test_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+}
+
+static bool io_sq_thread_should_park(struct io_sq_data *sqd)
+{
+	return test_bit(IO_SQ_THREAD_SHOULD_PARK, &sqd->state);
+}
+
+static void io_sq_thread_parkme(struct io_sq_data *sqd)
+{
+	for (;;) {
+		/*
+		 * TASK_PARKED is a special state; we must serialize against
+		 * possible pending wakeups to avoid store-store collisions on
+		 * task->state.
+		 *
+		 * Such a collision might possibly result in the task state
+		 * changin from TASK_PARKED and us failing the
+		 * wait_task_inactive() in kthread_park().
+		 */
+		set_special_state(TASK_PARKED);
+		if (!test_bit(IO_SQ_THREAD_SHOULD_PARK, &sqd->state))
+			break;
+
+		/*
+		 * Thread is going to call schedule(), do not preempt it,
+		 * or the caller of kthread_park() may spend more time in
+		 * wait_task_inactive().
+		 */
+		preempt_disable();
+		complete(&sqd->completion);
+		schedule_preempt_disabled();
+		preempt_enable();
+	}
+	__set_current_state(TASK_RUNNING);
+}
+
 static int io_sq_thread(void *data)
 {
-	struct cgroup_subsys_state *cur_css = NULL;
-	struct files_struct *old_files = current->files;
-	struct nsproxy *old_nsproxy = current->nsproxy;
-	const struct cred *old_cred = NULL;
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
+	char buf[TASK_COMM_LEN];
 	DEFINE_WAIT(wait);
 
-	task_lock(current);
-	current->files = NULL;
-	current->nsproxy = NULL;
-	task_unlock(current);
+	sprintf(buf, "iou-sqp-%d", sqd->task_pid);
+	set_task_comm(current, buf);
+	sqd->thread = current;
+	current->pf_io_worker = NULL;
+
+	if (sqd->sq_cpu != -1)
+		set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
+	else
+		set_cpus_allowed_ptr(current, cpu_online_mask);
+	current->flags |= PF_NO_SETAFFINITY;
+
+	complete(&sqd->completion);
 
-	while (!kthread_should_stop()) {
+	while (!io_sq_thread_should_stop(sqd)) {
 		int ret;
 		bool cap_entries, sqt_spin, needs_sched;
 
 		/*
 		 * Any changes to the sqd lists are synchronized through the
-		 * kthread parking. This synchronizes the thread vs users,
+		 * thread parking. This synchronizes the thread vs users,
 		 * the users are synchronized on the sqd->ctx_lock.
 		 */
-		if (kthread_should_park()) {
-			kthread_parkme();
-			/*
-			 * When sq thread is unparked, in case the previous park operation
-			 * comes from io_put_sq_data(), which means that sq thread is going
-			 * to be stopped, so here needs to have a check.
-			 */
-			if (kthread_should_stop())
-				break;
+		if (io_sq_thread_should_park(sqd)) {
+			io_sq_thread_parkme(sqd);
+			continue;
 		}
+		if (fatal_signal_pending(current))
+			break;
 
 		if (unlikely(!list_empty(&sqd->ctx_new_list))) {
 			io_sqd_init_new(sqd);
@@ -6837,27 +6754,13 @@ static int io_sq_thread(void *data)
 		sqt_spin = false;
 		cap_entries = !list_is_singular(&sqd->ctx_list);
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
-			if (current->cred != ctx->creds) {
-				if (old_cred)
-					revert_creds(old_cred);
-				old_cred = override_creds(ctx->creds);
-			}
-			io_sq_thread_associate_blkcg(ctx, &cur_css);
-#ifdef CONFIG_AUDIT
-			current->loginuid = ctx->loginuid;
-			current->sessionid = ctx->sessionid;
-#endif
-
 			ret = __io_sq_thread(ctx, cap_entries);
 			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
-
-			io_sq_thread_drop_mm_files();
 		}
 
 		if (sqt_spin || !time_after(jiffies, timeout)) {
 			io_run_task_work();
-			io_sq_thread_drop_mm_files();
 			cond_resched();
 			if (sqt_spin)
 				timeout = jiffies + sqd->sq_thread_idle;
@@ -6878,7 +6781,7 @@ static int io_sq_thread(void *data)
 			}
 		}
 
-		if (needs_sched && !kthread_should_park()) {
+		if (needs_sched && !io_sq_thread_should_park(sqd)) {
 			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
 				io_ring_set_wakeup_flag(ctx);
 
@@ -6891,22 +6794,14 @@ static int io_sq_thread(void *data)
 		timeout = jiffies + sqd->sq_thread_idle;
 	}
 
-	io_run_task_work();
-	io_sq_thread_drop_mm_files();
+	list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
+		io_uring_cancel_sqpoll(ctx);
 
-	if (cur_css)
-		io_sq_thread_unassociate_blkcg();
-	if (old_cred)
-		revert_creds(old_cred);
-
-	task_lock(current);
-	current->files = old_files;
-	current->nsproxy = old_nsproxy;
-	task_unlock(current);
-
-	kthread_parkme();
+	io_run_task_work();
 
-	return 0;
+	complete_all(&sqd->completion);
+	complete(&sqd->exited);
+	do_exit(0);
 }
 
 struct io_wait_queue {
@@ -7214,20 +7109,78 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	return 0;
 }
 
+static void io_sq_thread_unpark(struct io_sq_data *sqd)
+	__releases(&sqd->lock)
+{
+	if (!sqd->thread)
+		return;
+	if (sqd->thread == current)
+		return;
+	clear_bit(IO_SQ_THREAD_SHOULD_PARK, &sqd->state);
+	wake_up_state(sqd->thread, TASK_PARKED);
+	mutex_unlock(&sqd->lock);
+}
+
+static void io_sq_thread_park(struct io_sq_data *sqd)
+	__acquires(&sqd->lock)
+{
+	if (!sqd->thread)
+		return;
+	if (sqd->thread == current)
+		return;
+	mutex_lock(&sqd->lock);
+	set_bit(IO_SQ_THREAD_SHOULD_PARK, &sqd->state);
+	wake_up_process(sqd->thread);
+	wait_for_completion(&sqd->completion);
+}
+
+static void io_sq_thread_stop(struct io_sq_data *sqd)
+{
+	if (!sqd->thread)
+		return;
+
+	set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+	WARN_ON_ONCE(test_bit(IO_SQ_THREAD_SHOULD_PARK, &sqd->state));
+	wake_up_process(sqd->thread);
+	wait_for_completion(&sqd->exited);
+}
+
 static void io_put_sq_data(struct io_sq_data *sqd)
 {
 	if (refcount_dec_and_test(&sqd->refs)) {
-		/*
-		 * The park is a bit of a work-around, without it we get
-		 * warning spews on shutdown with SQPOLL set and affinity
-		 * set to a single CPU.
-		 */
+		io_sq_thread_stop(sqd);
+		kfree(sqd);
+	}
+}
+
+static void io_sq_thread_finish(struct io_ring_ctx *ctx)
+{
+	struct io_sq_data *sqd = ctx->sq_data;
+
+	if (sqd) {
 		if (sqd->thread) {
-			kthread_park(sqd->thread);
-			kthread_stop(sqd->thread);
+			/*
+			 * We may arrive here from the error branch in
+			 * io_sq_offload_create() where the kthread is created
+			 * without being waked up, thus wake it up now to make
+			 * sure the wait will complete.
+			 */
+			wake_up_process(sqd->thread);
+			wait_for_completion(&ctx->sq_thread_comp);
+
+			io_sq_thread_park(sqd);
 		}
 
-		kfree(sqd);
+		mutex_lock(&sqd->ctx_lock);
+		list_del(&ctx->sqd_list);
+		io_sqd_update_thread_idle(sqd);
+		mutex_unlock(&sqd->ctx_lock);
+
+		if (sqd->thread)
+			io_sq_thread_unpark(sqd);
+
+		io_put_sq_data(sqd);
+		ctx->sq_data = NULL;
 	}
 }
 
@@ -7274,58 +7227,11 @@ static struct io_sq_data *io_get_sq_data(struct io_uring_params *p)
 	mutex_init(&sqd->ctx_lock);
 	mutex_init(&sqd->lock);
 	init_waitqueue_head(&sqd->wait);
+	init_completion(&sqd->completion);
+	init_completion(&sqd->exited);
 	return sqd;
 }
 
-static void io_sq_thread_unpark(struct io_sq_data *sqd)
-	__releases(&sqd->lock)
-{
-	if (!sqd->thread)
-		return;
-	kthread_unpark(sqd->thread);
-	mutex_unlock(&sqd->lock);
-}
-
-static void io_sq_thread_park(struct io_sq_data *sqd)
-	__acquires(&sqd->lock)
-{
-	if (!sqd->thread)
-		return;
-	mutex_lock(&sqd->lock);
-	kthread_park(sqd->thread);
-}
-
-static void io_sq_thread_stop(struct io_ring_ctx *ctx)
-{
-	struct io_sq_data *sqd = ctx->sq_data;
-
-	if (sqd) {
-		if (sqd->thread) {
-			/*
-			 * We may arrive here from the error branch in
-			 * io_sq_offload_create() where the kthread is created
-			 * without being waked up, thus wake it up now to make
-			 * sure the wait will complete.
-			 */
-			wake_up_process(sqd->thread);
-			wait_for_completion(&ctx->sq_thread_comp);
-
-			io_sq_thread_park(sqd);
-		}
-
-		mutex_lock(&sqd->ctx_lock);
-		list_del(&ctx->sqd_list);
-		io_sqd_update_thread_idle(sqd);
-		mutex_unlock(&sqd->ctx_lock);
-
-		if (sqd->thread)
-			io_sq_thread_unpark(sqd);
-
-		io_put_sq_data(sqd);
-		ctx->sq_data = NULL;
-	}
-}
-
 #if defined(CONFIG_UNIX)
 /*
  * Ensure the UNIX gc is aware of our file set, so we are certain that
@@ -8001,17 +7907,20 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 			if (!cpu_online(cpu))
 				goto err;
 
-			sqd->thread = kthread_create_on_cpu(io_sq_thread, sqd,
-							cpu, "io_uring-sq");
+			sqd->sq_cpu = cpu;
 		} else {
-			sqd->thread = kthread_create(io_sq_thread, sqd,
-							"io_uring-sq");
+			sqd->sq_cpu = -1;
 		}
-		if (IS_ERR(sqd->thread)) {
-			ret = PTR_ERR(sqd->thread);
+
+		sqd->task_pid = current->pid;
+		current->flags |= PF_IO_WORKER;
+		ret = io_wq_fork_thread(io_sq_thread, sqd);
+		current->flags &= ~PF_IO_WORKER;
+		if (ret < 0) {
 			sqd->thread = NULL;
 			goto err;
 		}
+		wait_for_completion(&sqd->completion);
 		ret = io_uring_alloc_task_context(sqd->thread, ctx);
 		if (ret)
 			goto err;
@@ -8023,7 +7932,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 	return 0;
 err:
-	io_sq_thread_stop(ctx);
+	io_sq_thread_finish(ctx);
 	return ret;
 }
 
@@ -8498,21 +8407,14 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	mutex_lock(&ctx->uring_lock);
 	mutex_unlock(&ctx->uring_lock);
 
-	io_sq_thread_stop(ctx);
+	io_sq_thread_finish(ctx);
 	io_sqe_buffers_unregister(ctx);
 
-	if (ctx->sqo_task) {
-		put_task_struct(ctx->sqo_task);
-		ctx->sqo_task = NULL;
+	if (ctx->mm_account) {
 		mmdrop(ctx->mm_account);
 		ctx->mm_account = NULL;
 	}
 
-#ifdef CONFIG_BLK_CGROUP
-	if (ctx->sqo_blkcg_css)
-		css_put(ctx->sqo_blkcg_css);
-#endif
-
 	mutex_lock(&ctx->uring_lock);
 	io_sqe_files_unregister(ctx);
 	mutex_unlock(&ctx->uring_lock);
@@ -8532,7 +8434,6 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 
 	percpu_ref_exit(&ctx->refs);
 	free_uid(ctx->user);
-	put_cred(ctx->creds);
 	io_req_caches_free(ctx, NULL);
 	kfree(ctx->cancel_hash);
 	kfree(ctx);
@@ -9544,12 +9445,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	ctx->compat = in_compat_syscall();
 	ctx->limit_mem = !capable(CAP_IPC_LOCK);
 	ctx->user = user;
-	ctx->creds = get_current_cred();
-#ifdef CONFIG_AUDIT
-	ctx->loginuid = current->loginuid;
-	ctx->sessionid = current->sessionid;
-#endif
-	ctx->sqo_task = get_task_struct(current);
+	ctx->sqo_task = current;
 
 	/*
 	 * This is just grabbed for accounting purposes. When a process exits,
@@ -9560,24 +9456,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	mmgrab(current->mm);
 	ctx->mm_account = current->mm;
 
-#ifdef CONFIG_BLK_CGROUP
-	/*
-	 * The sq thread will belong to the original cgroup it was inited in.
-	 * If the cgroup goes offline (e.g. disabling the io controller), then
-	 * issued bios will be associated with the closest cgroup later in the
-	 * block layer.
-	 */
-	rcu_read_lock();
-	ctx->sqo_blkcg_css = blkcg_css();
-	ret = css_tryget_online(ctx->sqo_blkcg_css);
-	rcu_read_unlock();
-	if (!ret) {
-		/* don't init against a dying cgroup, have the user try again */
-		ctx->sqo_blkcg_css = NULL;
-		ret = -ENODEV;
-		goto err;
-	}
-#endif
 	ret = io_allocate_scq_urings(ctx, p);
 	if (ret)
 		goto err;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/18] Revert "proc: don't allow async path resolution of /proc/thread-self components"
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (14 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 15/18] io_uring: move SQPOLL thread io-wq forked worker Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 17/18] Revert "proc: don't allow async path resolution of /proc/self components" Jens Axboe
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

This reverts commit 0d4370cfe36b7f1719123b621a4ec4d9c7a25f89.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/proc/self.c        | 2 +-
 fs/proc/thread_self.c | 7 -------
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/proc/self.c b/fs/proc/self.c
index a4012154e109..cc71ce3466dc 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -20,7 +20,7 @@ static const char *proc_self_get_link(struct dentry *dentry,
 	 * Not currently supported. Once we can inherit all of struct pid,
 	 * we can allow this.
 	 */
-	if (current->flags & PF_IO_WORKER)
+	if (current->flags & PF_KTHREAD)
 		return ERR_PTR(-EOPNOTSUPP);
 
 	if (!tgid)
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index d56681d86d28..a553273fbd41 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -17,13 +17,6 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
 	pid_t pid = task_pid_nr_ns(current, ns);
 	char *name;
 
-	/*
-	 * Not currently supported. Once we can inherit all of struct pid,
-	 * we can allow this.
-	 */
-	if (current->flags & PF_IO_WORKER)
-		return ERR_PTR(-EOPNOTSUPP);
-
 	if (!pid)
 		return ERR_PTR(-ENOENT);
 	name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/18] Revert "proc: don't allow async path resolution of /proc/self components"
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (15 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 16/18] Revert "proc: don't allow async path resolution of /proc/thread-self components" Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 17:10 ` [PATCH 18/18] net: remove cmsg restriction from io_uring based send/recvmsg calls Jens Axboe
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

This reverts commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/proc/self.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/proc/self.c b/fs/proc/self.c
index cc71ce3466dc..72cd69bcaf4a 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -16,13 +16,6 @@ static const char *proc_self_get_link(struct dentry *dentry,
 	pid_t tgid = task_tgid_nr_ns(current, ns);
 	char *name;
 
-	/*
-	 * Not currently supported. Once we can inherit all of struct pid,
-	 * we can allow this.
-	 */
-	if (current->flags & PF_KTHREAD)
-		return ERR_PTR(-EOPNOTSUPP);
-
 	if (!tgid)
 		return ERR_PTR(-ENOENT);
 	/* max length of unsigned int in decimal + NULL term */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 18/18] net: remove cmsg restriction from io_uring based send/recvmsg calls
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (16 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 17/18] Revert "proc: don't allow async path resolution of /proc/self components" Jens Axboe
@ 2021-02-19 17:10 ` Jens Axboe
  2021-02-19 23:44 ` [PATCHSET RFC 0/18] Remove kthread usage from io_uring Stefan Metzmacher
  2021-02-21  5:04 ` Linus Torvalds
  19 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 17:10 UTC (permalink / raw)
  To: io-uring; +Cc: ebiederm, viro, torvalds, Jens Axboe

No need to restrict these anymore, as the worker threads are direct
clones of the original task. Hence we know for a fact that we can
support anything that the regular task can.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 net/socket.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 33e8b6c4e1d3..71fb2af118f5 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2408,10 +2408,6 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 long __sys_sendmsg_sock(struct socket *sock, struct msghdr *msg,
 			unsigned int flags)
 {
-	/* disallow ancillary data requests from this path */
-	if (msg->msg_control || msg->msg_controllen)
-		return -EINVAL;
-
 	return ____sys_sendmsg(sock, msg, flags, NULL, 0);
 }
 
@@ -2620,12 +2616,6 @@ long __sys_recvmsg_sock(struct socket *sock, struct msghdr *msg,
 			struct user_msghdr __user *umsg,
 			struct sockaddr __user *uaddr, unsigned int flags)
 {
-	if (msg->msg_control || msg->msg_controllen) {
-		/* disallow ancillary data reqs unless cmsg is plain data */
-		if (!(sock->ops->flags & PROTO_CMSG_DATA_ONLY))
-			return -EINVAL;
-	}
-
 	return ____sys_recvmsg(sock, msg, umsg, uaddr, flags, 0);
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker
  2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
@ 2021-02-19 20:25   ` Eric W. Biederman
  2021-02-19 20:37     ` Jens Axboe
  2021-02-22 13:46   ` Pavel Begunkov
  1 sibling, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2021-02-19 20:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, torvalds

Jens Axboe <axboe@kernel.dk> writes:

> We hit this case when the task is exiting, and we need somewhere to
> do background cleanup of requests. Instead of relying on the io-wq
> task manager to do this work for us, just stuff it somewhere where
> we can safely run it ourselves directly.

Minor nits below.

Eric

>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io-wq.c    | 12 ------------
>  fs/io-wq.h    |  2 --
>  fs/io_uring.c | 38 +++++++++++++++++++++++++++++++++++---
>  3 files changed, 35 insertions(+), 17 deletions(-)
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index c36bbcd823ce..800b299f9772 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,7 +16,6 @@
>  #include <linux/kthread.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/fs_struct.h>
> -#include <linux/task_work.h>
>  #include <linux/blk-cgroup.h>
>  #include <linux/audit.h>
>  #include <linux/cpu.h>
> @@ -775,9 +774,6 @@ static int io_wq_manager(void *data)
>  	complete(&wq->done);
>  
>  	while (!kthread_should_stop()) {
> -		if (current->task_works)
> -			task_work_run();
> -
>  		for_each_node(node) {
>  			struct io_wqe *wqe = wq->wqes[node];
>  			bool fork_worker[2] = { false, false };
> @@ -800,9 +796,6 @@ static int io_wq_manager(void *data)
>  		schedule_timeout(HZ);
>  	}
>  
> -	if (current->task_works)
> -		task_work_run();
> -
>  out:
>  	if (refcount_dec_and_test(&wq->refs)) {
>  		complete(&wq->done);
> @@ -1160,11 +1153,6 @@ void io_wq_destroy(struct io_wq *wq)
>  		__io_wq_destroy(wq);
>  }
>  
> -struct task_struct *io_wq_get_task(struct io_wq *wq)
> -{
> -	return wq->manager;
> -}
> -
>  static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
>  {
>  	struct task_struct *task = worker->task;
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 096f1021018e..a1610702f222 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -124,8 +124,6 @@ typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
>  enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>  					void *data, bool cancel_all);
>  
> -struct task_struct *io_wq_get_task(struct io_wq *wq);
> -
>  #if defined(CONFIG_IO_WQ)
>  extern void io_wq_worker_sleeping(struct task_struct *);
>  extern void io_wq_worker_running(struct task_struct *);
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index d951acb95117..bbd1ec7aa9e9 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -455,6 +455,9 @@ struct io_ring_ctx {
>  
>  	struct io_restriction		restrictions;
>  
> +	/* exit task_work */
> +	struct callback_head		*exit_task_work;
> +
>  	/* Keep this last, we don't need it for the fast path */
>  	struct work_struct		exit_work;
>  };
> @@ -2313,11 +2316,14 @@ static int io_req_task_work_add(struct io_kiocb *req)
>  static void io_req_task_work_add_fallback(struct io_kiocb *req,
>  					  task_work_func_t cb)
>  {
> -	struct task_struct *tsk = io_wq_get_task(req->ctx->io_wq);
> +	struct io_ring_ctx *ctx = req->ctx;
> +	struct callback_head *head;
>  
>  	init_task_work(&req->task_work, cb);
> -	task_work_add(tsk, &req->task_work, TWA_NONE);
> -	wake_up_process(tsk);
> +	do {
> +		head = ctx->exit_task_work;
                       ^^^^^^^^^^^^^^^^^^^^
This feels like this should be READ_ONCE to prevent tearing reads.

You use READ_ONCE on this same variable below which really suggests
this should be a READ_ONCE.


> +		req->task_work.next = head;
> +	} while (cmpxchg(&ctx->exit_task_work, head, &req->task_work) != head);
>  }
>  
>  static void __io_req_task_cancel(struct io_kiocb *req, int error)
> @@ -9258,6 +9264,30 @@ void __io_uring_task_cancel(void)
>  	io_uring_remove_task_files(tctx);
>  }
>  
> +static void io_run_ctx_fallback(struct io_ring_ctx *ctx)
> +{
> +	struct callback_head *work, *head, *next;
> +
> +	do {
> +		do {
> +			head = NULL;
> +			work = READ_ONCE(ctx->exit_task_work);
> +			if (!work)
> +				break;
> +		} while (cmpxchg(&ctx->exit_task_work, work, head) != work);
> +
> +		if (!work)
> +			break;

Why the double break on "!work"?  It seems like either the first should
be goto out, or only the second should be here.

> +
> +		do {
> +			next = work->next;
> +			work->func(work);
> +			work = next;
> +			cond_resched();
> +		} while (work);
> +	} while (1);
> +}
> +
>  static int io_uring_flush(struct file *file, void *data)
>  {
>  	struct io_uring_task *tctx = current->io_uring;
> @@ -9268,6 +9298,8 @@ static int io_uring_flush(struct file *file, void *data)
>  		io_req_caches_free(ctx, current);
>  	}
>  
> +	io_run_ctx_fallback(ctx);
> +
>  	if (!tctx)
>  		return 0;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker
  2021-02-19 20:25   ` Eric W. Biederman
@ 2021-02-19 20:37     ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 20:37 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: io-uring, viro, torvalds

On 2/19/21 1:25 PM, Eric W. Biederman wrote:
>> @@ -2313,11 +2316,14 @@ static int io_req_task_work_add(struct io_kiocb *req)
>>  static void io_req_task_work_add_fallback(struct io_kiocb *req,
>>  					  task_work_func_t cb)
>>  {
>> -	struct task_struct *tsk = io_wq_get_task(req->ctx->io_wq);
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	struct callback_head *head;
>>  
>>  	init_task_work(&req->task_work, cb);
>> -	task_work_add(tsk, &req->task_work, TWA_NONE);
>> -	wake_up_process(tsk);
>> +	do {
>> +		head = ctx->exit_task_work;
>                        ^^^^^^^^^^^^^^^^^^^^
> This feels like this should be READ_ONCE to prevent tearing reads.
> 
> You use READ_ONCE on this same variable below which really suggests
> this should be a READ_ONCE.

It should, added.

>> +		req->task_work.next = head;
>> +	} while (cmpxchg(&ctx->exit_task_work, head, &req->task_work) != head);
>>  }
>>  
>>  static void __io_req_task_cancel(struct io_kiocb *req, int error)
>> @@ -9258,6 +9264,30 @@ void __io_uring_task_cancel(void)
>>  	io_uring_remove_task_files(tctx);
>>  }
>>  
>> +static void io_run_ctx_fallback(struct io_ring_ctx *ctx)
>> +{
>> +	struct callback_head *work, *head, *next;
>> +
>> +	do {
>> +		do {
>> +			head = NULL;
>> +			work = READ_ONCE(ctx->exit_task_work);
>> +			if (!work)
>> +				break;
>> +		} while (cmpxchg(&ctx->exit_task_work, work, head) != work);
>> +
>> +		if (!work)
>> +			break;
> 
> Why the double break on "!work"?  It seems like either the first should
> be goto out, or only the second should be here.

Yes good point, the first one should go away. I've made that change,
thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD
  2021-02-19 17:09 ` [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD Jens Axboe
@ 2021-02-19 22:21   ` Eric W. Biederman
  2021-02-19 23:26     ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2021-02-19 22:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, viro, torvalds, Christian Brauner

Jens Axboe <axboe@kernel.dk> writes:

> PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the
> sense that we don't assign ->set_child_tid with our own structure. Just
> ensure that every arch sets up the PF_IO_WORKER threads like kthreads.

I think it is worth calling out that this is only for the arch
implementation of copy_thread.

This looks good for now.  But I am wondering if eventually we want to
refactor the copy_thread interface to more cleanly handle the
difference between tasks that only run in the kernel and userspace
tasks.

Eric

> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  arch/alpha/kernel/process.c      | 2 +-
>  arch/arc/kernel/process.c        | 2 +-
>  arch/arm/kernel/process.c        | 2 +-
>  arch/arm64/kernel/process.c      | 2 +-
>  arch/c6x/kernel/process.c        | 2 +-
>  arch/csky/kernel/process.c       | 2 +-
>  arch/h8300/kernel/process.c      | 2 +-
>  arch/hexagon/kernel/process.c    | 2 +-
>  arch/ia64/kernel/process.c       | 2 +-
>  arch/m68k/kernel/process.c       | 2 +-
>  arch/microblaze/kernel/process.c | 2 +-
>  arch/mips/kernel/process.c       | 2 +-
>  arch/nds32/kernel/process.c      | 2 +-
>  arch/nios2/kernel/process.c      | 2 +-
>  arch/openrisc/kernel/process.c   | 2 +-
>  arch/riscv/kernel/process.c      | 2 +-
>  arch/s390/kernel/process.c       | 2 +-
>  arch/sh/kernel/process_32.c      | 2 +-
>  arch/sparc/kernel/process_32.c   | 2 +-
>  arch/sparc/kernel/process_64.c   | 2 +-
>  arch/um/kernel/process.c         | 2 +-
>  arch/x86/kernel/process.c        | 2 +-
>  arch/xtensa/kernel/process.c     | 2 +-
>  23 files changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
> index 6c71554206cc..5112ab996394 100644
> --- a/arch/alpha/kernel/process.c
> +++ b/arch/alpha/kernel/process.c
> @@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>  	childti->pcb.ksp = (unsigned long) childstack;
>  	childti->pcb.flags = 1;	/* set FEN, clear everything else */
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* kernel thread */
>  		memset(childstack, 0,
>  			sizeof(struct switch_stack) + sizeof(struct pt_regs));
> diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
> index 37f724ad5e39..d838d0d57696 100644
> --- a/arch/arc/kernel/process.c
> +++ b/arch/arc/kernel/process.c
> @@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>  	childksp[0] = 0;			/* fp */
>  	childksp[1] = (unsigned long)ret_from_fork; /* blink */
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(c_regs, 0, sizeof(struct pt_regs));
>  
>  		c_callee->r13 = kthread_arg;
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index ee3aee69e444..5199a2bb4111 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -243,7 +243,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>  	thread->cpu_domain = get_domain();
>  #endif
>  
> -	if (likely(!(p->flags & PF_KTHREAD))) {
> +	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
>  		*childregs = *current_pt_regs();
>  		childregs->ARM_r0 = 0;
>  		if (stack_start)
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6616486a58fe..05f001b401a5 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -398,7 +398,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>  
>  	ptrauth_thread_init_kernel(p);
>  
> -	if (likely(!(p->flags & PF_KTHREAD))) {
> +	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
>  		*childregs = *current_pt_regs();
>  		childregs->regs[0] = 0;
>  
> diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
> index 9f4fd6a40a10..403ad4ce3db0 100644
> --- a/arch/c6x/kernel/process.c
> +++ b/arch/c6x/kernel/process.c
> @@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>  
>  	childregs = task_pt_regs(p);
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* case of  __kernel_thread: we return to supervisor space */
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		childregs->sp = (unsigned long)(childregs + 1);
> diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
> index 69af6bc87e64..3d0ca22cd0e2 100644
> --- a/arch/csky/kernel/process.c
> +++ b/arch/csky/kernel/process.c
> @@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
>  	/* setup thread.sp for switch_to !!! */
>  	p->thread.sp = (unsigned long)childstack;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		childstack->r15 = (unsigned long) ret_from_kernel_thread;
>  		childstack->r10 = kthread_arg;
> diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
> index bc1364db58fe..46b1342ce515 100644
> --- a/arch/h8300/kernel/process.c
> +++ b/arch/h8300/kernel/process.c
> @@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>  
>  	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		childregs->retpc = (unsigned long) ret_from_kernel_thread;
>  		childregs->er4 = topstk; /* arg */
> diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
> index 6a980cba7b29..c61165c99ae0 100644
> --- a/arch/hexagon/kernel/process.c
> +++ b/arch/hexagon/kernel/process.c
> @@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  						    sizeof(*ss));
>  	ss->lr = (unsigned long)ret_from_fork;
>  	p->thread.switch_sp = ss;
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		/* r24 <- fn, r25 <- arg */
>  		ss->r24 = usp;
> diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
> index 4ebbfa076a26..7e1a1525e202 100644
> --- a/arch/ia64/kernel/process.c
> +++ b/arch/ia64/kernel/process.c
> @@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
>  
>  	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		if (unlikely(!user_stack_base)) {
>  			/* fork_idle() called us */
>  			return 0;
> diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
> index 08359a6e058f..da83cc83e791 100644
> --- a/arch/m68k/kernel/process.c
> +++ b/arch/m68k/kernel/process.c
> @@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  	 */
>  	p->thread.fs = get_fs().seg;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* kernel thread */
>  		memset(frame, 0, sizeof(struct fork_frame));
>  		frame->regs.sr = PS_S;
> diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
> index 657c2beb665e..62aa237180b6 100644
> --- a/arch/microblaze/kernel/process.c
> +++ b/arch/microblaze/kernel/process.c
> @@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  	struct pt_regs *childregs = task_pt_regs(p);
>  	struct thread_info *ti = task_thread_info(p);
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* if we're creating a new kernel thread then just zeroing all
>  		 * the registers. That's OK for a brand new thread.*/
>  		memset(childregs, 0, sizeof(struct pt_regs));
> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> index d7e288f3a1e7..f69434015be7 100644
> --- a/arch/mips/kernel/process.c
> +++ b/arch/mips/kernel/process.c
> @@ -135,7 +135,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
>  	/*  Put the stack after the struct pt_regs.  */
>  	childksp = (unsigned long) childregs;
>  	p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* kernel thread */
>  		unsigned long status = p->thread.cp0_status;
>  		memset(childregs, 0, sizeof(struct pt_regs));
> diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
> index e01ad5d17224..c1327e552ec6 100644
> --- a/arch/nds32/kernel/process.c
> +++ b/arch/nds32/kernel/process.c
> @@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>  
>  	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		/* kernel thread fn */
>  		p->thread.cpu_context.r6 = stack_start;
> diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
> index 50b4eb19a6cc..c5f916ca6845 100644
> --- a/arch/nios2/kernel/process.c
> +++ b/arch/nios2/kernel/process.c
> @@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  	struct switch_stack *childstack =
>  		((struct switch_stack *)childregs) - 1;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childstack, 0,
>  			sizeof(struct switch_stack) + sizeof(struct pt_regs));
>  
> diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
> index 3c98728cce24..83fba4ee4453 100644
> --- a/arch/openrisc/kernel/process.c
> +++ b/arch/openrisc/kernel/process.c
> @@ -167,7 +167,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  	sp -= sizeof(struct pt_regs);
>  	kregs = (struct pt_regs *)sp;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(kregs, 0, sizeof(struct pt_regs));
>  		kregs->gpr[20] = usp; /* fn, kernel thread */
>  		kregs->gpr[22] = arg;
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index dd5f985b1f40..06d326caa7d8 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  	struct pt_regs *childregs = task_pt_regs(p);
>  
>  	/* p->thread holds context to be restored by __switch_to() */
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* Kernel thread */
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		childregs->gp = gp_in_global;
> diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
> index bc3ca54edfb4..ac7a06d5e230 100644
> --- a/arch/s390/kernel/process.c
> +++ b/arch/s390/kernel/process.c
> @@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
>  	frame->sf.gprs[9] = (unsigned long) frame;
>  
>  	/* Store access registers to kernel stack of new process. */
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		/* kernel thread */
>  		memset(&frame->childregs, 0, sizeof(struct pt_regs));
>  		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
> diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
> index 80a5d1c66a51..1aa508eb0823 100644
> --- a/arch/sh/kernel/process_32.c
> +++ b/arch/sh/kernel/process_32.c
> @@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  
>  	childregs = task_pt_regs(p);
>  	p->thread.sp = (unsigned long) childregs;
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		p->thread.pc = (unsigned long) ret_from_kernel_thread;
>  		childregs->regs[4] = arg;
> diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
> index a02363735915..0f9c606e1e78 100644
> --- a/arch/sparc/kernel/process_32.c
> +++ b/arch/sparc/kernel/process_32.c
> @@ -309,7 +309,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  	ti->ksp = (unsigned long) new_stack;
>  	p->thread.kregs = childregs;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		extern int nwindows;
>  		unsigned long psr;
>  		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
> diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
> index 6f8c7822fc06..7afd0a859a78 100644
> --- a/arch/sparc/kernel/process_64.c
> +++ b/arch/sparc/kernel/process_64.c
> @@ -597,7 +597,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  				       sizeof(struct sparc_stackf));
>  	t->fpsaved[0] = 0;
>  
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(child_trap_frame, 0, child_stack_sz);
>  		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
>  			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
> diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
> index 81d508daf67c..c5011064b5dd 100644
> --- a/arch/um/kernel/process.c
> +++ b/arch/um/kernel/process.c
> @@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
>  		unsigned long arg, struct task_struct * p, unsigned long tls)
>  {
>  	void (*handler)(void);
> -	int kthread = current->flags & PF_KTHREAD;
> +	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
>  	int ret = 0;
>  
>  	p->thread = (struct thread_struct) INIT_THREAD;
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 145a7ac0c19a..9c214d7085a4 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -161,7 +161,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  #endif
>  
>  	/* Kernel thread ? */
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		kthread_frame_init(frame, sp, arg);
>  		return 0;
> diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
> index 397a7de56377..9534ef515d74 100644
> --- a/arch/xtensa/kernel/process.c
> +++ b/arch/xtensa/kernel/process.c
> @@ -217,7 +217,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
>  
>  	p->thread.sp = (unsigned long)childregs;
>  
> -	if (!(p->flags & PF_KTHREAD)) {
> +	if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		struct pt_regs *regs = current_pt_regs();
>  		unsigned long usp = usp_thread_fn ?
>  			usp_thread_fn : regs->areg[1];

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD
  2021-02-19 22:21   ` Eric W. Biederman
@ 2021-02-19 23:26     ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 23:26 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: io-uring, viro, torvalds, Christian Brauner

On 2/19/21 3:21 PM, Eric W. Biederman wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the
>> sense that we don't assign ->set_child_tid with our own structure. Just
>> ensure that every arch sets up the PF_IO_WORKER threads like kthreads.
> 
> I think it is worth calling out that this is only for the arch
> implementation of copy_thread.

True, that would make it clearer. I'll add that to the commit message.

> This looks good for now.  But I am wondering if eventually we want to
> refactor the copy_thread interface to more cleanly handle the
> difference between tasks that only run in the kernel and userspace
> tasks.

Probably would be a worthwhile future cleanup of the code in general.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET RFC 0/18] Remove kthread usage from io_uring
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (17 preceding siblings ...)
  2021-02-19 17:10 ` [PATCH 18/18] net: remove cmsg restriction from io_uring based send/recvmsg calls Jens Axboe
@ 2021-02-19 23:44 ` Stefan Metzmacher
  2021-02-19 23:51   ` Jens Axboe
  2021-02-21  5:04 ` Linus Torvalds
  19 siblings, 1 reply; 49+ messages in thread
From: Stefan Metzmacher @ 2021-02-19 23:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds


[-- Attachment #1.1: Type: text/plain, Size: 1320 bytes --]

Hi Jens,

> tldr - instead of using kthreads that assume the identity of the original
> tasks for work that needs offloading to a thread, setup these workers as
> threads of the original task.
> 
> Here's a first cut of moving away from kthreads for io_uring. It passes
> the test suite and various other testing I've done with it. It also
> performs better, both for workloads actually using the async offload, but
> also in general as we slim down structures and kill code from the hot path.
> 
> The series is roughly split into these parts:
> 
> - Patches 1-6, io_uring/io-wq prep patches
> - Patches 7-8, Minor arch/kernel support
> - Patches 9-15, switch from kthread to thread, remove state only needed
>   for kthreads
> - Patches 16-18, remove now dead/unneeded PF_IO_WORKER restrictions
> 
> Comments/suggestions welcome. I'm pretty happy with the series at this
> point, and particularly with how we end up cutting a lot of code while
> also unifying how sync vs async is presented.

Thanks a lot! I was thinking hard about how to make all this easier to understand
and perform better in order to have the whole context available natively for
sendmsg/recvmsg, but also for the upcoming uring_cmd().

And now all that code magically disappeared completely, wonderful :-)

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET RFC 0/18] Remove kthread usage from io_uring
  2021-02-19 23:44 ` [PATCHSET RFC 0/18] Remove kthread usage from io_uring Stefan Metzmacher
@ 2021-02-19 23:51   ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-19 23:51 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: ebiederm, viro, torvalds

On 2/19/21 4:44 PM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> tldr - instead of using kthreads that assume the identity of the original
>> tasks for work that needs offloading to a thread, setup these workers as
>> threads of the original task.
>>
>> Here's a first cut of moving away from kthreads for io_uring. It passes
>> the test suite and various other testing I've done with it. It also
>> performs better, both for workloads actually using the async offload, but
>> also in general as we slim down structures and kill code from the hot path.
>>
>> The series is roughly split into these parts:
>>
>> - Patches 1-6, io_uring/io-wq prep patches
>> - Patches 7-8, Minor arch/kernel support
>> - Patches 9-15, switch from kthread to thread, remove state only needed
>>   for kthreads
>> - Patches 16-18, remove now dead/unneeded PF_IO_WORKER restrictions
>>
>> Comments/suggestions welcome. I'm pretty happy with the series at this
>> point, and particularly with how we end up cutting a lot of code while
>> also unifying how sync vs async is presented.
> 
> Thanks a lot! I was thinking hard about how to make all this easier to
> understand and perform better in order to have the whole context
> available natively for sendmsg/recvmsg, but also for the upcoming
> uring_cmd().
> 
> And now all that code magically disappeared completely, wonderful :-)

Glad to hear you like the approach! Yes, this will help both
readability, performance, and maintainability. Pretty much a win all
around imho.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] io_uring: tie async worker side to the task context
  2021-02-19 17:09 ` [PATCH 05/18] io_uring: tie async worker side to the task context Jens Axboe
@ 2021-02-20  8:11   ` Hao Xu
  2021-02-20 14:38     ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Hao Xu @ 2021-02-20  8:11 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds

在 2021/2/20 上午1:09, Jens Axboe 写道:
> Move it outside of the io_ring_ctx, and tie it to the io_uring task
> context.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   fs/io_uring.c            | 84 ++++++++++++++++------------------------
>   include/linux/io_uring.h |  1 +
>   2 files changed, 35 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 0eeb2a1596c2..6ad3e1df6504 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -365,9 +365,6 @@ struct io_ring_ctx {
>   
>   	struct io_rings	*rings;
>   
> -	/* IO offload */
> -	struct io_wq		*io_wq;
> -
>   	/*
>   	 * For SQPOLL usage - we hold a reference to the parent task, so we
>   	 * have access to the ->files
> @@ -1619,10 +1616,11 @@ static struct io_kiocb *__io_queue_async_work(struct io_kiocb *req)
>   {
>   	struct io_ring_ctx *ctx = req->ctx;
>   	struct io_kiocb *link = io_prep_linked_timeout(req);
> +	struct io_uring_task *tctx = req->task->io_uring;
>   
>   	trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req,
>   					&req->work, req->flags);
> -	io_wq_enqueue(ctx->io_wq, &req->work);
> +	io_wq_enqueue(tctx->io_wq, &req->work);
>   	return link;
>   }
>   
> @@ -5969,12 +5967,15 @@ static bool io_cancel_cb(struct io_wq_work *work, void *data)
>   	return req->user_data == (unsigned long) data;
>   }
>   
> -static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
> +static int io_async_cancel_one(struct io_uring_task *tctx, void *sqe_addr)
>   {
>   	enum io_wq_cancel cancel_ret;
>   	int ret = 0;
>   
> -	cancel_ret = io_wq_cancel_cb(ctx->io_wq, io_cancel_cb, sqe_addr, false);
> +	if (!tctx->io_wq)
> +		return -ENOENT;
> +
> +	cancel_ret = io_wq_cancel_cb(tctx->io_wq, io_cancel_cb, sqe_addr, false);
>   	switch (cancel_ret) {
>   	case IO_WQ_CANCEL_OK:
>   		ret = 0;
> @@ -5997,7 +5998,8 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
>   	unsigned long flags;
>   	int ret;
>   
> -	ret = io_async_cancel_one(ctx, (void *) (unsigned long) sqe_addr);
> +	ret = io_async_cancel_one(req->task->io_uring,
> +					(void *) (unsigned long) sqe_addr);
>   	if (ret != -ENOENT) {
>   		spin_lock_irqsave(&ctx->completion_lock, flags);
>   		goto done;
> @@ -7562,16 +7564,6 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
>   	}
>   }
>   
> -static void io_finish_async(struct io_ring_ctx *ctx)
> -{
> -	io_sq_thread_stop(ctx);
> -
> -	if (ctx->io_wq) {
> -		io_wq_destroy(ctx->io_wq);
> -		ctx->io_wq = NULL;
> -	}
> -}
> -
>   #if defined(CONFIG_UNIX)
>   /*
>    * Ensure the UNIX gc is aware of our file set, so we are certain that
> @@ -8130,11 +8122,10 @@ static struct io_wq_work *io_free_work(struct io_wq_work *work)
>   	return req ? &req->work : NULL;
>   }
>   
> -static int io_init_wq_offload(struct io_ring_ctx *ctx)
> +static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx)
>   {
>   	struct io_wq_data data;
>   	unsigned int concurrency;
> -	int ret = 0;
>   
>   	data.user = ctx->user;
>   	data.free_work = io_free_work;
> @@ -8143,16 +8134,11 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx)
>   	/* Do QD, or 4 * CPUS, whatever is smallest */
>   	concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
>   
> -	ctx->io_wq = io_wq_create(concurrency, &data);
> -	if (IS_ERR(ctx->io_wq)) {
> -		ret = PTR_ERR(ctx->io_wq);
> -		ctx->io_wq = NULL;
> -	}
> -
> -	return ret;
> +	return io_wq_create(concurrency, &data);
>   }
>   
> -static int io_uring_alloc_task_context(struct task_struct *task)
> +static int io_uring_alloc_task_context(struct task_struct *task,
> +				       struct io_ring_ctx *ctx)
>   {
>   	struct io_uring_task *tctx;
>   	int ret;
> @@ -8167,6 +8153,14 @@ static int io_uring_alloc_task_context(struct task_struct *task)
>   		return ret;
>   	}
>   
> +	tctx->io_wq = io_init_wq_offload(ctx);
> +	if (IS_ERR(tctx->io_wq)) {
> +		ret = PTR_ERR(tctx->io_wq);
> +		percpu_counter_destroy(&tctx->inflight);
> +		kfree(tctx);
> +		return ret;
> +	}
> +
How about putting this before initing tctx->inflight so that
we don't need to destroy tctx->inflight in the error path?
>   	xa_init(&tctx->xa);
>   	init_waitqueue_head(&tctx->wait);
>   	tctx->last = NULL;
> @@ -8239,7 +8233,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>   			ctx->sq_thread_idle = HZ;
>   
>   		if (sqd->thread)
> -			goto done;
> +			return 0;
>   
>   		if (p->flags & IORING_SETUP_SQ_AFF) {
>   			int cpu = p->sq_thread_cpu;
> @@ -8261,7 +8255,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>   			sqd->thread = NULL;
>   			goto err;
>   		}
> -		ret = io_uring_alloc_task_context(sqd->thread);
> +		ret = io_uring_alloc_task_context(sqd->thread, ctx);
>   		if (ret)
>   			goto err;
>   	} else if (p->flags & IORING_SETUP_SQ_AFF) {
> @@ -8270,14 +8264,9 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
>   		goto err;
>   	}
>   
> -done:
> -	ret = io_init_wq_offload(ctx);
> -	if (ret)
> -		goto err;
> -
>   	return 0;
>   err:
> -	io_finish_async(ctx);
> +	io_sq_thread_stop(ctx);
>   	return ret;
>   }
>   
> @@ -8752,7 +8741,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>   	mutex_lock(&ctx->uring_lock);
>   	mutex_unlock(&ctx->uring_lock);
>   
> -	io_finish_async(ctx);
> +	io_sq_thread_stop(ctx);
>   	io_sqe_buffers_unregister(ctx);
>   
>   	if (ctx->sqo_task) {
> @@ -8872,13 +8861,6 @@ static void io_ring_exit_work(struct work_struct *work)
>   	io_ring_ctx_free(ctx);
>   }
>   
> -static bool io_cancel_ctx_cb(struct io_wq_work *work, void *data)
> -{
> -	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> -
> -	return req->ctx == data;
> -}
> -
>   static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>   {
>   	mutex_lock(&ctx->uring_lock);
> @@ -8897,9 +8879,6 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>   	io_kill_timeouts(ctx, NULL, NULL);
>   	io_poll_remove_all(ctx, NULL, NULL);
>   
> -	if (ctx->io_wq)
> -		io_wq_cancel_cb(ctx->io_wq, io_cancel_ctx_cb, ctx, true);
> -
>   	/* if we failed setting up the ctx, we might not have any rings */
>   	io_iopoll_try_reap_events(ctx);
>   
> @@ -8978,13 +8957,14 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>   					 struct files_struct *files)
>   {
>   	struct io_task_cancel cancel = { .task = task, .files = files, };
> +	struct io_uring_task *tctx = current->io_uring;
>   
>   	while (1) {
>   		enum io_wq_cancel cret;
>   		bool ret = false;
>   
> -		if (ctx->io_wq) {
> -			cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb,
> +		if (tctx && tctx->io_wq) {
> +			cret = io_wq_cancel_cb(tctx->io_wq, io_cancel_task_cb,
>   					       &cancel, true);
>   			ret |= (cret != IO_WQ_CANCEL_NOTFOUND);
>   		}
> @@ -9096,7 +9076,7 @@ static int io_uring_add_task_file(struct io_ring_ctx *ctx, struct file *file)
>   	int ret;
>   
>   	if (unlikely(!tctx)) {
> -		ret = io_uring_alloc_task_context(current);
> +		ret = io_uring_alloc_task_context(current, ctx);
>   		if (unlikely(ret))
>   			return ret;
>   		tctx = current->io_uring;
> @@ -9166,8 +9146,12 @@ void __io_uring_files_cancel(struct files_struct *files)
>   		io_uring_cancel_task_requests(file->private_data, files);
>   	atomic_dec(&tctx->in_idle);
>   
> -	if (files)
> +	if (files) {
>   		io_uring_remove_task_files(tctx);
> +	} else if (tctx->io_wq && current->flags & PF_EXITING) {
> +		io_wq_destroy(tctx->io_wq);
> +		tctx->io_wq = NULL;
> +	}
>   }
>   
>   static s64 tctx_inflight(struct io_uring_task *tctx)
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 2eb6d19de336..0e95398998b6 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -36,6 +36,7 @@ struct io_uring_task {
>   	struct xarray		xa;
>   	struct wait_queue_head	wait;
>   	struct file		*last;
> +	void			*io_wq;
>   	struct percpu_counter	inflight;
>   	struct io_identity	__identity;
>   	struct io_identity	*identity;
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] io_uring: tie async worker side to the task context
  2021-02-20  8:11   ` Hao Xu
@ 2021-02-20 14:38     ` Jens Axboe
  2021-02-21  9:16       ` Hao Xu
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-02-20 14:38 UTC (permalink / raw)
  To: Hao Xu, io-uring; +Cc: ebiederm, viro, torvalds

On 2/20/21 1:11 AM, Hao Xu wrote:
>> @@ -8167,6 +8153,14 @@ static int io_uring_alloc_task_context(struct task_struct *task)
>>   		return ret;
>>   	}
>>   
>> +	tctx->io_wq = io_init_wq_offload(ctx);
>> +	if (IS_ERR(tctx->io_wq)) {
>> +		ret = PTR_ERR(tctx->io_wq);
>> +		percpu_counter_destroy(&tctx->inflight);
>> +		kfree(tctx);
>> +		return ret;
>> +	}
>> +
> How about putting this before initing tctx->inflight so that
> we don't need to destroy tctx->inflight in the error path?

Sure, but then we'd need to destroy the workqueue in the error path if
percpu_counter_init() fails instead.

Can you elaborate on why you think that'd be an improvement to the error
path?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET RFC 0/18] Remove kthread usage from io_uring
  2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
                   ` (18 preceding siblings ...)
  2021-02-19 23:44 ` [PATCHSET RFC 0/18] Remove kthread usage from io_uring Stefan Metzmacher
@ 2021-02-21  5:04 ` Linus Torvalds
  2021-02-21 21:22   ` Jens Axboe
  19 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2021-02-21  5:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Eric W. Biederman, Al Viro

On Fri, Feb 19, 2021 at 9:10 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> tldr - instead of using kthreads that assume the identity of the original
> tasks for work that needs offloading to a thread, setup these workers as
> threads of the original task.

Ok, from a quick look-through of the patch series this most definitely
seems to be the right way to go, getting rid of a lot of subtle (and
not-so-subtle) issues.

                Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] io_uring: tie async worker side to the task context
  2021-02-20 14:38     ` Jens Axboe
@ 2021-02-21  9:16       ` Hao Xu
  0 siblings, 0 replies; 49+ messages in thread
From: Hao Xu @ 2021-02-21  9:16 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds

在 2021/2/20 下午10:38, Jens Axboe 写道:
> On 2/20/21 1:11 AM, Hao Xu wrote:
>>> @@ -8167,6 +8153,14 @@ static int io_uring_alloc_task_context(struct task_struct *task)
>>>    		return ret;
>>>    	}
>>>    
>>> +	tctx->io_wq = io_init_wq_offload(ctx);
>>> +	if (IS_ERR(tctx->io_wq)) {
>>> +		ret = PTR_ERR(tctx->io_wq);
>>> +		percpu_counter_destroy(&tctx->inflight);
>>> +		kfree(tctx);
>>> +		return ret;
>>> +	}
>>> +
>> How about putting this before initing tctx->inflight so that
>> we don't need to destroy tctx->inflight in the error path?
> 
> Sure, but then we'd need to destroy the workqueue in the error path if
> percpu_counter_init() fails instead.
> 
> Can you elaborate on why you think that'd be an improvement to the error
> path?
> 
You're right. I didn't realize it..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCHSET RFC 0/18] Remove kthread usage from io_uring
  2021-02-21  5:04 ` Linus Torvalds
@ 2021-02-21 21:22   ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-02-21 21:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, Eric W. Biederman, Al Viro

On 2/20/21 10:04 PM, Linus Torvalds wrote:
> On Fri, Feb 19, 2021 at 9:10 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> tldr - instead of using kthreads that assume the identity of the original
>> tasks for work that needs offloading to a thread, setup these workers as
>> threads of the original task.
> 
> Ok, from a quick look-through of the patch series this most definitely
> seems to be the right way to go, getting rid of a lot of subtle (and
> not-so-subtle) issues.

Yeah, it's hard to find any downsides to doing this... FWIW, ran it
through prod testing and so far so good. I may actually push this for
5.12-rc1 if things continue looking good. As a separate pull request, so
you can make the call on whether you want it simmering for another
release or not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker
  2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
  2021-02-19 20:25   ` Eric W. Biederman
@ 2021-02-22 13:46   ` Pavel Begunkov
  1 sibling, 0 replies; 49+ messages in thread
From: Pavel Begunkov @ 2021-02-22 13:46 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds

On 19/02/2021 17:09, Jens Axboe wrote:
> We hit this case when the task is exiting, and we need somewhere to
> do background cleanup of requests. Instead of relying on the io-wq
> task manager to do this work for us, just stuff it somewhere where
> we can safely run it ourselves directly.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io-wq.c    | 12 ------------
>  fs/io-wq.h    |  2 --
>  fs/io_uring.c | 38 +++++++++++++++++++++++++++++++++++---
>  3 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index c36bbcd823ce..800b299f9772 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,7 +16,6 @@
>  #include <linux/kthread.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/fs_struct.h>
> -#include <linux/task_work.h>
>  #include <linux/blk-cgroup.h>
>  #include <linux/audit.h>
>  #include <linux/cpu.h>
> @@ -775,9 +774,6 @@ static int io_wq_manager(void *data)
>  	complete(&wq->done);
>  
>  	while (!kthread_should_stop()) {
> -		if (current->task_works)
> -			task_work_run();
> -
>  		for_each_node(node) {
>  			struct io_wqe *wqe = wq->wqes[node];
>  			bool fork_worker[2] = { false, false };
> @@ -800,9 +796,6 @@ static int io_wq_manager(void *data)
>  		schedule_timeout(HZ);
>  	}
>  
> -	if (current->task_works)
> -		task_work_run();
> -
>  out:
>  	if (refcount_dec_and_test(&wq->refs)) {
>  		complete(&wq->done);
> @@ -1160,11 +1153,6 @@ void io_wq_destroy(struct io_wq *wq)
>  		__io_wq_destroy(wq);
>  }
>  
> -struct task_struct *io_wq_get_task(struct io_wq *wq)
> -{
> -	return wq->manager;
> -}
> -
>  static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
>  {
>  	struct task_struct *task = worker->task;
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 096f1021018e..a1610702f222 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -124,8 +124,6 @@ typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
>  enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>  					void *data, bool cancel_all);
>  
> -struct task_struct *io_wq_get_task(struct io_wq *wq);
> -
>  #if defined(CONFIG_IO_WQ)
>  extern void io_wq_worker_sleeping(struct task_struct *);
>  extern void io_wq_worker_running(struct task_struct *);
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index d951acb95117..bbd1ec7aa9e9 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -455,6 +455,9 @@ struct io_ring_ctx {
>  
>  	struct io_restriction		restrictions;
>  
> +	/* exit task_work */
> +	struct callback_head		*exit_task_work;
> +
>  	/* Keep this last, we don't need it for the fast path */
>  	struct work_struct		exit_work;
>  };
> @@ -2313,11 +2316,14 @@ static int io_req_task_work_add(struct io_kiocb *req)
>  static void io_req_task_work_add_fallback(struct io_kiocb *req,
>  					  task_work_func_t cb)
>  {
> -	struct task_struct *tsk = io_wq_get_task(req->ctx->io_wq);
> +	struct io_ring_ctx *ctx = req->ctx;
> +	struct callback_head *head;
>  
>  	init_task_work(&req->task_work, cb);
> -	task_work_add(tsk, &req->task_work, TWA_NONE);
> -	wake_up_process(tsk);
> +	do {
> +		head = ctx->exit_task_work;
> +		req->task_work.next = head;
> +	} while (cmpxchg(&ctx->exit_task_work, head, &req->task_work) != head);
>  }
>  
>  static void __io_req_task_cancel(struct io_kiocb *req, int error)
> @@ -9258,6 +9264,30 @@ void __io_uring_task_cancel(void)
>  	io_uring_remove_task_files(tctx);
>  }
>  
> +static void io_run_ctx_fallback(struct io_ring_ctx *ctx)
> +{
> +	struct callback_head *work, *head, *next;
> +
> +	do {
> +		do {
> +			head = NULL;
> +			work = READ_ONCE(ctx->exit_task_work);
> +			if (!work)
> +				break;
> +		} while (cmpxchg(&ctx->exit_task_work, work, head) != work);

Looking at io_uring-worker.v3, it's actually just xchg() without do-while.

work = xchg(&ctx->exit_task_work, NULL);

> +
> +		if (!work)
> +			break;
> +
> +		do {
> +			next = work->next;
> +			work->func(work);
> +			work = next;
> +			cond_resched();
> +		} while (work);
> +	} while (1);
> +}
> +
>  static int io_uring_flush(struct file *file, void *data)
>  {
>  	struct io_uring_task *tctx = current->io_uring;
> @@ -9268,6 +9298,8 @@ static int io_uring_flush(struct file *file, void *data)
>  		io_req_caches_free(ctx, current);
>  	}
>  
> +	io_run_ctx_fallback(ctx);
> +
>  	if (!tctx)
>  		return 0;
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-02-19 17:10 ` [PATCH 09/18] io-wq: fork worker threads from original task Jens Axboe
@ 2021-03-04 12:23   ` Stefan Metzmacher
  2021-03-04 13:05     ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Metzmacher @ 2021-03-04 12:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds


Hi Jens,

> +static pid_t fork_thread(int (*fn)(void *), void *arg)
> +{
> +	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
> +				CLONE_IO|SIGCHLD;
> +	struct kernel_clone_args args = {
> +		.flags		= ((lower_32_bits(flags) | CLONE_VM |
> +				    CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
> +		.stack		= (unsigned long)fn,
> +		.stack_size	= (unsigned long)arg,
> +	};
> +
> +	return kernel_clone(&args);
> +}

Can you please explain why CLONE_SIGHAND is used here?

Will the userspace signal handlers executed from the kernel thread?

Will SIGCHLD be posted to the userspace signal handlers in a userspace
process? Will wait() from userspace see the exit of a thread?

Thanks!
metze

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 12:23   ` Stefan Metzmacher
@ 2021-03-04 13:05     ` Jens Axboe
  2021-03-04 13:19       ` Stefan Metzmacher
  2021-03-05 19:00       ` Eric W. Biederman
  0 siblings, 2 replies; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 13:05 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: ebiederm, viro, torvalds

On 3/4/21 5:23 AM, Stefan Metzmacher wrote:
> 
> Hi Jens,
> 
>> +static pid_t fork_thread(int (*fn)(void *), void *arg)
>> +{
>> +	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>> +				CLONE_IO|SIGCHLD;
>> +	struct kernel_clone_args args = {
>> +		.flags		= ((lower_32_bits(flags) | CLONE_VM |
>> +				    CLONE_UNTRACED) & ~CSIGNAL),
>> +		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
>> +		.stack		= (unsigned long)fn,
>> +		.stack_size	= (unsigned long)arg,
>> +	};
>> +
>> +	return kernel_clone(&args);
>> +}
> 
> Can you please explain why CLONE_SIGHAND is used here?

We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
don't really care about signals, we don't use them internally.

> Will the userspace signal handlers executed from the kernel thread?

No

> Will SIGCHLD be posted to the userspace signal handlers in a userspace
> process? Will wait() from userspace see the exit of a thread?

Currently actually it does, but I think that's just an oversight. As far
as I can tell, we want to add something like the below. Untested... I'll
give this a spin in a bit.

diff --git a/kernel/signal.c b/kernel/signal.c
index ba4d1ef39a9e..e5db1d8f18e5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1912,6 +1912,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	bool autoreap = false;
 	u64 utime, stime;
 
+	/* Don't notify a parent task if an io_uring worker exits */
+	if (tsk->flags & PF_IO_WORKER)
+		return true;
+
 	BUG_ON(sig == -1);
 
  	/* do_notify_parent_cldstop should have been called instead.  */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 13:05     ` Jens Axboe
@ 2021-03-04 13:19       ` Stefan Metzmacher
  2021-03-04 16:13         ` Stefan Metzmacher
  2021-03-05 19:00       ` Eric W. Biederman
  1 sibling, 1 reply; 49+ messages in thread
From: Stefan Metzmacher @ 2021-03-04 13:19 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds

Hi Jens,

>> Can you please explain why CLONE_SIGHAND is used here?
> 
> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
> don't really care about signals, we don't use them internally.

I'm 100% sure, but I heard rumors that in some situations signals get
randomly delivered to any thread of a userspace process.

My fear was that the related logic may select a kernel thread if they
share the same signal handlers.

>> Will the userspace signal handlers executed from the kernel thread?
> 
> No

Good.

Are these threads immutable against signals from userspace?

>> Will SIGCHLD be posted to the userspace signal handlers in a userspace
>> process? Will wait() from userspace see the exit of a thread?
> 
> Currently actually it does, but I think that's just an oversight. As far
> as I can tell, we want to add something like the below. Untested... I'll
> give this a spin in a bit.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index ba4d1ef39a9e..e5db1d8f18e5 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1912,6 +1912,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  	bool autoreap = false;
>  	u64 utime, stime;
>  
> +	/* Don't notify a parent task if an io_uring worker exits */
> +	if (tsk->flags & PF_IO_WORKER)
> +		return true;
> +
>  	BUG_ON(sig == -1);
>  
>   	/* do_notify_parent_cldstop should have been called instead.  */
> 

Ok, thanks!

metze

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 13:19       ` Stefan Metzmacher
@ 2021-03-04 16:13         ` Stefan Metzmacher
  2021-03-04 16:42           ` Jens Axboe
  2021-03-05 19:16           ` Eric W. Biederman
  0 siblings, 2 replies; 49+ messages in thread
From: Stefan Metzmacher @ 2021-03-04 16:13 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds


Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
> Hi Jens,
> 
>>> Can you please explain why CLONE_SIGHAND is used here?
>>
>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>> don't really care about signals, we don't use them internally.
> 
> I'm 100% sure, but I heard rumors that in some situations signals get
> randomly delivered to any thread of a userspace process.

Ok, from task_struct:

        /* Signal handlers: */
        struct signal_struct            *signal;
        struct sighand_struct __rcu             *sighand;
        sigset_t                        blocked;
        sigset_t                        real_blocked;
        /* Restored if set_restore_sigmask() was used: */
        sigset_t                        saved_sigmask;
        struct sigpending               pending;

The signal handlers are shared, but 'blocked' is per thread/task.

> My fear was that the related logic may select a kernel thread if they
> share the same signal handlers.

I found the related logic in the interaction between
complete_signal() and wants_signal().

static inline bool wants_signal(int sig, struct task_struct *p)
{
        if (sigismember(&p->blocked, sig))
                return false;

...

Would it make sense to set up task->blocked to block all signals?

Something like this:

--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
 {
        unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
                                CLONE_IO|SIGCHLD;
-       struct kernel_clone_args args = {
-               .flags          = ((lower_32_bits(flags) | CLONE_VM |
-                                   CLONE_UNTRACED) & ~CSIGNAL),
-               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
-               .stack          = (unsigned long)fn,
-               .stack_size     = (unsigned long)arg,
-       };
+       sigset_t mask, oldmask;
+       pid_t pid;

-       return kernel_clone(&args);
+       sigfillset(&mask);
+       sigprocmask(SIG_BLOCK, &mask, &oldmask);
+       pid = kernel_thread(fn, arg, flags);
+       sigprocmask(SIG_SETMASK, &oldmask, NULL);
+
+       return ret;
 }

I think using kernel_thread() would be a good simplification anyway.

sig_task_ignored() has some PF_IO_WORKER logic.

Or is there any PF_IO_WORKER related logic that prevents
an io_wq thread to be excluded in complete_signal().

Or PF_IO_WORKER would teach kernel_clone to ignore CLONE_SIGHAND
and create a fresh handler and alter the copy_signal() and copy_sighand()
checks...

metze

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 16:13         ` Stefan Metzmacher
@ 2021-03-04 16:42           ` Jens Axboe
  2021-03-04 17:09             ` Stefan Metzmacher
  2021-03-05 19:16           ` Eric W. Biederman
  1 sibling, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 16:42 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: ebiederm, viro, torvalds

On 3/4/21 9:13 AM, Stefan Metzmacher wrote:
> 
> Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
>> Hi Jens,
>>
>>>> Can you please explain why CLONE_SIGHAND is used here?
>>>
>>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>>> don't really care about signals, we don't use them internally.
>>
>> I'm 100% sure, but I heard rumors that in some situations signals get
>> randomly delivered to any thread of a userspace process.
> 
> Ok, from task_struct:
> 
>         /* Signal handlers: */
>         struct signal_struct            *signal;
>         struct sighand_struct __rcu             *sighand;
>         sigset_t                        blocked;
>         sigset_t                        real_blocked;
>         /* Restored if set_restore_sigmask() was used: */
>         sigset_t                        saved_sigmask;
>         struct sigpending               pending;
> 
> The signal handlers are shared, but 'blocked' is per thread/task.
> 
>> My fear was that the related logic may select a kernel thread if they
>> share the same signal handlers.
> 
> I found the related logic in the interaction between
> complete_signal() and wants_signal().
> 
> static inline bool wants_signal(int sig, struct task_struct *p)
> {
>         if (sigismember(&p->blocked, sig))
>                 return false;
> 
> ...
> 
> Would it make sense to set up task->blocked to block all signals?
> 
> Something like this:
> 
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
>  {
>         unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>                                 CLONE_IO|SIGCHLD;
> -       struct kernel_clone_args args = {
> -               .flags          = ((lower_32_bits(flags) | CLONE_VM |
> -                                   CLONE_UNTRACED) & ~CSIGNAL),
> -               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
> -               .stack          = (unsigned long)fn,
> -               .stack_size     = (unsigned long)arg,
> -       };
> +       sigset_t mask, oldmask;
> +       pid_t pid;
> 
> -       return kernel_clone(&args);
> +       sigfillset(&mask);
> +       sigprocmask(SIG_BLOCK, &mask, &oldmask);
> +       pid = kernel_thread(fn, arg, flags);
> +       sigprocmask(SIG_SETMASK, &oldmask, NULL);
> +
> +       return ret;
>  }
> 
> I think using kernel_thread() would be a good simplification anyway.

I like this approach, we're really not interested in signals for those
threads, and this makes it explicit. Ditto on just using the kernel_thread()
helper, looks fine too. I'll run this through the testing. Do you want to
send this as a "real" patch, or should I just attribute you in the commit
message?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 16:42           ` Jens Axboe
@ 2021-03-04 17:09             ` Stefan Metzmacher
  2021-03-04 17:32               ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Metzmacher @ 2021-03-04 17:09 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: ebiederm, viro, torvalds


Am 04.03.21 um 17:42 schrieb Jens Axboe:
> On 3/4/21 9:13 AM, Stefan Metzmacher wrote:
>>
>> Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
>>> Hi Jens,
>>>
>>>>> Can you please explain why CLONE_SIGHAND is used here?
>>>>
>>>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>>>> don't really care about signals, we don't use them internally.
>>>
>>> I'm 100% sure, but I heard rumors that in some situations signals get
>>> randomly delivered to any thread of a userspace process.
>>
>> Ok, from task_struct:
>>
>>         /* Signal handlers: */
>>         struct signal_struct            *signal;
>>         struct sighand_struct __rcu             *sighand;
>>         sigset_t                        blocked;
>>         sigset_t                        real_blocked;
>>         /* Restored if set_restore_sigmask() was used: */
>>         sigset_t                        saved_sigmask;
>>         struct sigpending               pending;
>>
>> The signal handlers are shared, but 'blocked' is per thread/task.
>>
>>> My fear was that the related logic may select a kernel thread if they
>>> share the same signal handlers.
>>
>> I found the related logic in the interaction between
>> complete_signal() and wants_signal().
>>
>> static inline bool wants_signal(int sig, struct task_struct *p)
>> {
>>         if (sigismember(&p->blocked, sig))
>>                 return false;
>>
>> ...
>>
>> Would it make sense to set up task->blocked to block all signals?
>>
>> Something like this:
>>
>> --- a/fs/io-wq.c
>> +++ b/fs/io-wq.c
>> @@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
>>  {
>>         unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>>                                 CLONE_IO|SIGCHLD;
>> -       struct kernel_clone_args args = {
>> -               .flags          = ((lower_32_bits(flags) | CLONE_VM |
>> -                                   CLONE_UNTRACED) & ~CSIGNAL),
>> -               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
>> -               .stack          = (unsigned long)fn,
>> -               .stack_size     = (unsigned long)arg,
>> -       };
>> +       sigset_t mask, oldmask;
>> +       pid_t pid;
>>
>> -       return kernel_clone(&args);
>> +       sigfillset(&mask);
>> +       sigprocmask(SIG_BLOCK, &mask, &oldmask);
>> +       pid = kernel_thread(fn, arg, flags);
>> +       sigprocmask(SIG_SETMASK, &oldmask, NULL);
>> +
>> +       return ret;
>>  }
>>
>> I think using kernel_thread() would be a good simplification anyway.
> 
> I like this approach, we're really not interested in signals for those
> threads, and this makes it explicit. Ditto on just using the kernel_thread()
> helper, looks fine too. I'll run this through the testing. Do you want to
> send this as a "real" patch, or should I just attribute you in the commit
> message?

You can do the patch, it was mostly an example.
I'm not sure if sigprocmask() is the correct function here.

Or if we better use something like this:

        set_restore_sigmask();
        current->saved_sigmask = current->blocked;
        set_current_blocked(&kmask);
        pid = kernel_thread(fn, arg, flags);
        restore_saved_sigmask();

I think current->flags |= PF_IO_WORKER;
should also move into io_wq_fork_thread()
and maybe passed differently to kernel_clone() that
abusing current->flags (where current is not an IO_WORKER),
so in general I think it would be better to handle all this within kernel_clone()
natively, rather than temporary modifying current->flags or current->blocked.

What there be problems with handling everything in copy_process() and related helpers
and avoid the CLONE_SIGHAND behavior for PF_IO_WORKER tasks.

kernel_clone_args could get an unsigned int task_flags to fill p->flags in copy_process().
Then kernel_thread() could also get a task_flags argument and in all other places will use
fill that with current->flags.

metze

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 17:09             ` Stefan Metzmacher
@ 2021-03-04 17:32               ` Jens Axboe
  2021-03-04 18:19                 ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 17:32 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: ebiederm, viro, torvalds

On 3/4/21 10:09 AM, Stefan Metzmacher wrote:
> 
> Am 04.03.21 um 17:42 schrieb Jens Axboe:
>> On 3/4/21 9:13 AM, Stefan Metzmacher wrote:
>>>
>>> Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
>>>> Hi Jens,
>>>>
>>>>>> Can you please explain why CLONE_SIGHAND is used here?
>>>>>
>>>>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>>>>> don't really care about signals, we don't use them internally.
>>>>
>>>> I'm 100% sure, but I heard rumors that in some situations signals get
>>>> randomly delivered to any thread of a userspace process.
>>>
>>> Ok, from task_struct:
>>>
>>>         /* Signal handlers: */
>>>         struct signal_struct            *signal;
>>>         struct sighand_struct __rcu             *sighand;
>>>         sigset_t                        blocked;
>>>         sigset_t                        real_blocked;
>>>         /* Restored if set_restore_sigmask() was used: */
>>>         sigset_t                        saved_sigmask;
>>>         struct sigpending               pending;
>>>
>>> The signal handlers are shared, but 'blocked' is per thread/task.
>>>
>>>> My fear was that the related logic may select a kernel thread if they
>>>> share the same signal handlers.
>>>
>>> I found the related logic in the interaction between
>>> complete_signal() and wants_signal().
>>>
>>> static inline bool wants_signal(int sig, struct task_struct *p)
>>> {
>>>         if (sigismember(&p->blocked, sig))
>>>                 return false;
>>>
>>> ...
>>>
>>> Would it make sense to set up task->blocked to block all signals?
>>>
>>> Something like this:
>>>
>>> --- a/fs/io-wq.c
>>> +++ b/fs/io-wq.c
>>> @@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
>>>  {
>>>         unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>>>                                 CLONE_IO|SIGCHLD;
>>> -       struct kernel_clone_args args = {
>>> -               .flags          = ((lower_32_bits(flags) | CLONE_VM |
>>> -                                   CLONE_UNTRACED) & ~CSIGNAL),
>>> -               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
>>> -               .stack          = (unsigned long)fn,
>>> -               .stack_size     = (unsigned long)arg,
>>> -       };
>>> +       sigset_t mask, oldmask;
>>> +       pid_t pid;
>>>
>>> -       return kernel_clone(&args);
>>> +       sigfillset(&mask);
>>> +       sigprocmask(SIG_BLOCK, &mask, &oldmask);
>>> +       pid = kernel_thread(fn, arg, flags);
>>> +       sigprocmask(SIG_SETMASK, &oldmask, NULL);
>>> +
>>> +       return ret;
>>>  }
>>>
>>> I think using kernel_thread() would be a good simplification anyway.
>>
>> I like this approach, we're really not interested in signals for those
>> threads, and this makes it explicit. Ditto on just using the kernel_thread()
>> helper, looks fine too. I'll run this through the testing. Do you want to
>> send this as a "real" patch, or should I just attribute you in the commit
>> message?
> 
> You can do the patch, it was mostly an example.
> I'm not sure if sigprocmask() is the correct function here.
> 
> Or if we better use something like this:
> 
>         set_restore_sigmask();
>         current->saved_sigmask = current->blocked;
>         set_current_blocked(&kmask);
>         pid = kernel_thread(fn, arg, flags);
>         restore_saved_sigmask();

Might be cleaner, and allows fatal signals.

> I think current->flags |= PF_IO_WORKER;
> should also move into io_wq_fork_thread()
> and maybe passed differently to kernel_clone() that
> abusing current->flags (where current is not an IO_WORKER),
> so in general I think it would be better to handle all this within kernel_clone()
> natively, rather than temporary modifying current->flags or current->blocked.
> 
> What there be problems with handling everything in copy_process() and related helpers
> and avoid the CLONE_SIGHAND behavior for PF_IO_WORKER tasks.
> 
> kernel_clone_args could get an unsigned int task_flags to fill p->flags in copy_process().
> Then kernel_thread() could also get a task_flags argument and in all other places will use
> fill that with current->flags.

I agree there are cleanups possible there, but I'd rather defer those until all
the dust has settled.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 17:32               ` Jens Axboe
@ 2021-03-04 18:19                 ` Jens Axboe
  2021-03-04 18:56                   ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 18:19 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: ebiederm, viro, torvalds

On 3/4/21 10:32 AM, Jens Axboe wrote:
> On 3/4/21 10:09 AM, Stefan Metzmacher wrote:
>>
>> Am 04.03.21 um 17:42 schrieb Jens Axboe:
>>> On 3/4/21 9:13 AM, Stefan Metzmacher wrote:
>>>>
>>>> Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
>>>>> Hi Jens,
>>>>>
>>>>>>> Can you please explain why CLONE_SIGHAND is used here?
>>>>>>
>>>>>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>>>>>> don't really care about signals, we don't use them internally.
>>>>>
>>>>> I'm 100% sure, but I heard rumors that in some situations signals get
>>>>> randomly delivered to any thread of a userspace process.
>>>>
>>>> Ok, from task_struct:
>>>>
>>>>         /* Signal handlers: */
>>>>         struct signal_struct            *signal;
>>>>         struct sighand_struct __rcu             *sighand;
>>>>         sigset_t                        blocked;
>>>>         sigset_t                        real_blocked;
>>>>         /* Restored if set_restore_sigmask() was used: */
>>>>         sigset_t                        saved_sigmask;
>>>>         struct sigpending               pending;
>>>>
>>>> The signal handlers are shared, but 'blocked' is per thread/task.
>>>>
>>>>> My fear was that the related logic may select a kernel thread if they
>>>>> share the same signal handlers.
>>>>
>>>> I found the related logic in the interaction between
>>>> complete_signal() and wants_signal().
>>>>
>>>> static inline bool wants_signal(int sig, struct task_struct *p)
>>>> {
>>>>         if (sigismember(&p->blocked, sig))
>>>>                 return false;
>>>>
>>>> ...
>>>>
>>>> Would it make sense to set up task->blocked to block all signals?
>>>>
>>>> Something like this:
>>>>
>>>> --- a/fs/io-wq.c
>>>> +++ b/fs/io-wq.c
>>>> @@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
>>>>  {
>>>>         unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>>>>                                 CLONE_IO|SIGCHLD;
>>>> -       struct kernel_clone_args args = {
>>>> -               .flags          = ((lower_32_bits(flags) | CLONE_VM |
>>>> -                                   CLONE_UNTRACED) & ~CSIGNAL),
>>>> -               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
>>>> -               .stack          = (unsigned long)fn,
>>>> -               .stack_size     = (unsigned long)arg,
>>>> -       };
>>>> +       sigset_t mask, oldmask;
>>>> +       pid_t pid;
>>>>
>>>> -       return kernel_clone(&args);
>>>> +       sigfillset(&mask);
>>>> +       sigprocmask(SIG_BLOCK, &mask, &oldmask);
>>>> +       pid = kernel_thread(fn, arg, flags);
>>>> +       sigprocmask(SIG_SETMASK, &oldmask, NULL);
>>>> +
>>>> +       return ret;
>>>>  }
>>>>
>>>> I think using kernel_thread() would be a good simplification anyway.
>>>
>>> I like this approach, we're really not interested in signals for those
>>> threads, and this makes it explicit. Ditto on just using the kernel_thread()
>>> helper, looks fine too. I'll run this through the testing. Do you want to
>>> send this as a "real" patch, or should I just attribute you in the commit
>>> message?
>>
>> You can do the patch, it was mostly an example.
>> I'm not sure if sigprocmask() is the correct function here.
>>
>> Or if we better use something like this:
>>
>>         set_restore_sigmask();
>>         current->saved_sigmask = current->blocked;
>>         set_current_blocked(&kmask);
>>         pid = kernel_thread(fn, arg, flags);
>>         restore_saved_sigmask();
> 
> Might be cleaner, and allows fatal signals.

How about this - it moves the signal fiddling into the task
itself, and leaves the parent alone. Also allows future cleanups
of how we wait for thread creation. And moves the PF_IO_WORKER
into a contained spot, instead of having it throughout where we
call the worker fork.

Later cleanups can focus on having copy_process() do the right
thing and we can kill this PF_IO_WORKER dance completely


commit eeb485abb7a189058858f941fb3432bee945a861
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Mar 4 09:46:37 2021 -0700

    io-wq: block signals by default for any io-wq worker
    
    We're not interested in signals, so let's make it explicit and block
    it for any worker. Wrap the thread creation in our own handler, so we
    can set blocked signals and serialize creation of them. A future cleanup
    can now simplify the waiting on thread creation from the other paths,
    just pushing the 'startup' completion further down.
    
    Move the PF_IO_WORKER flag dance in there as well. This will go away in
    the future when we teach copy_process() how to deal with this
    automatically.
    
    Reported-by: Stefan Metzmacher <metze@samba.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 19f18389ead2..bf5df1a31a0e 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -607,19 +607,54 @@ static int task_thread_unbound(void *data)
 	return task_thread(data, IO_WQ_ACCT_UNBOUND);
 }
 
+struct thread_start_data {
+	struct completion startup;
+	int (*fn)(void *);
+	void *arg;
+};
+
+static int thread_start(void *data)
+{
+	struct thread_start_data *d = data;
+	int (*fn)(void *) = d->fn;
+	void *arg = d->arg;
+	sigset_t mask;
+	int ret;
+
+	sigfillset(&mask);
+	set_restore_sigmask();
+	current->saved_sigmask = current->blocked;
+	set_current_blocked(&mask);
+	current->flags |= PF_IO_WORKER;
+	complete(&d->startup);
+	ret = fn(arg);
+	restore_saved_sigmask();
+	return ret;
+}
+
 pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
 {
 	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
 				CLONE_IO|SIGCHLD;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
+	struct thread_start_data data = {
+		.startup	= COMPLETION_INITIALIZER_ONSTACK(data.startup),
+		.fn		= fn,
+		.arg		= arg
 	};
+	bool clear = false;
+	pid_t pid;
 
-	return kernel_clone(&args);
+	/* task path doesn't have it, manager path does */
+	if (!(current->flags & PF_IO_WORKER)) {
+		current->flags |= PF_IO_WORKER;
+		clear = true;
+	}
+	pid = kernel_thread(thread_start, &data, flags);
+	if (clear)
+		current->flags &= ~PF_IO_WORKER;
+	if (pid >= 0)
+		wait_for_completion(&data.startup);
+	return pid;
 }
 
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
@@ -752,7 +787,6 @@ static int io_wq_manager(void *data)
 
 	sprintf(buf, "iou-mgr-%d", wq->task_pid);
 	set_task_comm(current, buf);
-	current->flags |= PF_IO_WORKER;
 	wq->manager = get_task_struct(current);
 
 	complete(&wq->started);
@@ -821,9 +855,7 @@ static int io_wq_fork_manager(struct io_wq *wq)
 		return 0;
 
 	reinit_completion(&wq->worker_done);
-	current->flags |= PF_IO_WORKER;
 	ret = io_wq_fork_thread(io_wq_manager, wq);
-	current->flags &= ~PF_IO_WORKER;
 	if (ret >= 0) {
 		wait_for_completion(&wq->started);
 		return 0;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e55369555e5c..995f506e3a60 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7824,9 +7824,7 @@ static int io_sq_thread_fork(struct io_sq_data *sqd, struct io_ring_ctx *ctx)
 	reinit_completion(&sqd->completion);
 	ctx->sqo_exec = 0;
 	sqd->task_pid = current->pid;
-	current->flags |= PF_IO_WORKER;
 	ret = io_wq_fork_thread(io_sq_thread, sqd);
-	current->flags &= ~PF_IO_WORKER;
 	if (ret < 0) {
 		sqd->thread = NULL;
 		return ret;
@@ -7896,9 +7894,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		}
 
 		sqd->task_pid = current->pid;
-		current->flags |= PF_IO_WORKER;
 		ret = io_wq_fork_thread(io_sq_thread, sqd);
-		current->flags &= ~PF_IO_WORKER;
 		if (ret < 0) {
 			sqd->thread = NULL;
 			goto err;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 18:19                 ` Jens Axboe
@ 2021-03-04 18:56                   ` Linus Torvalds
  2021-03-04 19:19                     ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2021-03-04 18:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On Thu, Mar 4, 2021 at 10:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> How about this - it moves the signal fiddling into the task
> itself, and leaves the parent alone. Also allows future cleanups
> of how we wait for thread creation.

Ugh, I think this is wrong.

You shouldn't usekernel_thread() at all, and you shouldn't need to set
the sigmask in the parent, only to have it copied to the child, and
then restore it in the parent.

You shouldn't have to have that silly extra scheduling rendezvous with
the completion, which forces two schedules (first a schedule to the
child to do what it wants to do, and then "complete()" there to wake
up the parent that is waiting for the completion.

The thing is, our internal thread creation functionality is already
written explicitly to not need any of this: the creation of a new
thread is a separate phase, and then you do some setup, and then you
actually tell the new thread "ok, go go go".

See the kernel_clone() function kernel/fork.c for the structure of this all.

You really should just do

 (a) copy_thread() to create a new child that is inactive and cannot yet run

 (b) do any setup in that new child (like setting the signal mask in
it, but also perhaps setting the PF_IO_WORKER flag etc)

 (c) actually say "go go go": wake_up_new_task(p);

and you're done. No completions, no "set temporary mask in parent to
be copied", no nothing like that.

And for the IO worker threads, you really don't want all the other
stuff that kernel_clone() does. You don't want the magic VFORK "wait
for the child to release the VM we gave it". You don't want the clone
ptrace setup, because you can't ptrace those IO workler threads
anyway. You might want a tracepoint, but you probably want a
_different_ tracepoint than the "normal clone" one. You don't want the
latent entropy addition, because honestly, the thing has no entropy to
add either.

So I think you really want to just add a new "create_io_thread()"
inside kernel/fork.c, which is a very cut-down and specialized version
of kernel_clone().

It's actually going to be _less_ code than what you have now, and it's
going to avoid all the problems with anmy half-way state or "set
parent state to something that gets copied and then undo the parent
state after the copy".

                   Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 18:56                   ` Linus Torvalds
@ 2021-03-04 19:19                     ` Jens Axboe
  2021-03-04 19:46                       ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 19:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On 3/4/21 11:56 AM, Linus Torvalds wrote:
> On Thu, Mar 4, 2021 at 10:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> How about this - it moves the signal fiddling into the task
>> itself, and leaves the parent alone. Also allows future cleanups
>> of how we wait for thread creation.
> 
> Ugh, I think this is wrong.
> 
> You shouldn't usekernel_thread() at all, and you shouldn't need to set
> the sigmask in the parent, only to have it copied to the child, and
> then restore it in the parent.
> 
> You shouldn't have to have that silly extra scheduling rendezvous with
> the completion, which forces two schedules (first a schedule to the
> child to do what it wants to do, and then "complete()" there to wake
> up the parent that is waiting for the completion.
> 
> The thing is, our internal thread creation functionality is already
> written explicitly to not need any of this: the creation of a new
> thread is a separate phase, and then you do some setup, and then you
> actually tell the new thread "ok, go go go".
> 
> See the kernel_clone() function kernel/fork.c for the structure of this all.
> 
> You really should just do
> 
>  (a) copy_thread() to create a new child that is inactive and cannot yet run
> 
>  (b) do any setup in that new child (like setting the signal mask in
> it, but also perhaps setting the PF_IO_WORKER flag etc)
> 
>  (c) actually say "go go go": wake_up_new_task(p);
> 
> and you're done. No completions, no "set temporary mask in parent to
> be copied", no nothing like that.
> 
> And for the IO worker threads, you really don't want all the other
> stuff that kernel_clone() does. You don't want the magic VFORK "wait
> for the child to release the VM we gave it". You don't want the clone
> ptrace setup, because you can't ptrace those IO workler threads
> anyway. You might want a tracepoint, but you probably want a
> _different_ tracepoint than the "normal clone" one. You don't want the
> latent entropy addition, because honestly, the thing has no entropy to
> add either.
> 
> So I think you really want to just add a new "create_io_thread()"
> inside kernel/fork.c, which is a very cut-down and specialized version
> of kernel_clone().
> 
> It's actually going to be _less_ code than what you have now, and it's
> going to avoid all the problems with anmy half-way state or "set
> parent state to something that gets copied and then undo the parent
> state after the copy".

Took a quick look at this, and I agree that's _much_ better. In fact, it
boils down to just calling copy_process() and then having the caller do
wake_up_new_task(). So not sure if it's worth adding an
create_io_thread() helper, or just make copy_process() available
instead. This is ignoring the trace point for now...

I'll try and spin this up, should be pretty trivial and indeed remove
even more code and useless wait_for_completion+complete slowdowns...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 19:19                     ` Jens Axboe
@ 2021-03-04 19:46                       ` Linus Torvalds
  2021-03-04 19:54                         ` Jens Axboe
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2021-03-04 19:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On Thu, Mar 4, 2021 at 11:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Took a quick look at this, and I agree that's _much_ better. In fact, it
> boils down to just calling copy_process() and then having the caller do
> wake_up_new_task(). So not sure if it's worth adding an
> create_io_thread() helper, or just make copy_process() available
> instead. This is ignoring the trace point for now...

I really don't want to expose copy_process() outside of kernel/fork.c.

The whole three-phase "copy - setup - activate" model is a really
really good thing, and it's how we've done things internally almost
forever, but I really don't want to expose those middle stages to any
outsiders.

So I'd really prefer a simple new "create_io_worker()", even if it's
literally just some four-line function that does

   p = copy_process(..);
   if (!IS_ERR(p)) {
      block all signals in p
      set PF_IO_WORKER flag
      wake_up_new_task(p);
   }
   return p;

I very much want that to be inside kernel/fork.c and have all these
rules about creating new threads localized there.

              Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 19:46                       ` Linus Torvalds
@ 2021-03-04 19:54                         ` Jens Axboe
  2021-03-04 20:00                           ` Jens Axboe
  2021-03-04 20:50                           ` Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 19:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]

On 3/4/21 12:46 PM, Linus Torvalds wrote:
> On Thu, Mar 4, 2021 at 11:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Took a quick look at this, and I agree that's _much_ better. In fact, it
>> boils down to just calling copy_process() and then having the caller do
>> wake_up_new_task(). So not sure if it's worth adding an
>> create_io_thread() helper, or just make copy_process() available
>> instead. This is ignoring the trace point for now...
> 
> I really don't want to expose copy_process() outside of kernel/fork.c.
> 
> The whole three-phase "copy - setup - activate" model is a really
> really good thing, and it's how we've done things internally almost
> forever, but I really don't want to expose those middle stages to any
> outsiders.
> 
> So I'd really prefer a simple new "create_io_worker()", even if it's
> literally just some four-line function that does
> 
>    p = copy_process(..);
>    if (!IS_ERR(p)) {
>       block all signals in p
>       set PF_IO_WORKER flag
>       wake_up_new_task(p);
>    }
>    return p;
> 
> I very much want that to be inside kernel/fork.c and have all these
> rules about creating new threads localized there.

I agree, here are the two current patches. Just need to add the signal
blocking, which I'd love to do in create_io_thread(), but seems to
require either an allocation or provide a helper to do it in the thread
itself (with an on-stack mask).

Removes code, even with comment added.

 fs/io-wq.c                 | 68 ++++++++++++++++---------------------------------------------
 fs/io-wq.h                 |  2 --
 fs/io_uring.c              | 29 ++++++++++++++------------
 include/linux/sched/task.h |  2 ++
 kernel/fork.c              | 24 ++++++++++++++++++++++
 5 files changed, 59 insertions(+), 66 deletions(-)


-- 
Jens Axboe


[-- Attachment #2: 0001-kernel-provide-create_io_thread-helper.patch --]
[-- Type: text/x-patch, Size: 2660 bytes --]

From 396142d9878cc1a02149616c7032b3e647205341 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:21:05 -0700
Subject: [PATCH 1/2] kernel: provide create_io_thread() helper

Provide a generic helper for setting up an io_uring worker. Returns a
task_struct so that the caller can do whatever setup is needed, then call
wake_up_new_task() to kick it into gear.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sched/task.h |  2 ++
 kernel/fork.c              | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c0f71f2e7160..ef02be869cf2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
+	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
@@ -82,6 +83,7 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..549acc6324f0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1940,6 +1940,8 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
+	if (args->io_thread)
+		p->flags |= PF_IO_WORKER;
 
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
@@ -2410,6 +2412,28 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
+/*
+ * This is like kernel_clone(), but shaved down and tailored to just
+ * creating io_uring workers. It returns a created task, or an error pointer.
+ * The returned task is inactive, and the caller must fire it up through
+ * wake_up_new_task(p).
+ */
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
+{
+	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
+				CLONE_IO|SIGCHLD;
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.io_thread	= 1,
+	};
+
+	return copy_process(NULL, 0, node, &args);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.30.1


[-- Attachment #3: 0002-io_uring-move-to-using-create_io_thread.patch --]
[-- Type: text/x-patch, Size: 7839 bytes --]

From 9dee8128025806e74c7fd3915294649dc0b11f5f Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:39:36 -0700
Subject: [PATCH 2/2] io_uring: move to using create_io_thread()

This allows us to do task creation and setup without needing to use
completions to try and synchronize with the starting thread.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    | 68 +++++++++++++--------------------------------------
 fs/io-wq.h    |  2 --
 fs/io_uring.c | 29 ++++++++++++----------
 3 files changed, 33 insertions(+), 66 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 19f18389ead2..693239ed4de5 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -54,7 +54,6 @@ struct io_worker {
 	spinlock_t lock;
 
 	struct completion ref_done;
-	struct completion started;
 
 	struct rcu_head rcu;
 };
@@ -116,7 +115,6 @@ struct io_wq {
 	struct io_wq_hash *hash;
 
 	refcount_t refs;
-	struct completion started;
 	struct completion exited;
 
 	atomic_t worker_refs;
@@ -273,14 +271,6 @@ static void io_wqe_dec_running(struct io_worker *worker)
 		io_wqe_wake_worker(wqe, acct);
 }
 
-static void io_worker_start(struct io_worker *worker)
-{
-	current->flags |= PF_NOFREEZE;
-	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
-	io_wqe_inc_running(worker);
-	complete(&worker->started);
-}
-
 /*
  * Worker will start processing some work. Move it to the busy list, if
  * it's currently on the freelist
@@ -490,8 +480,6 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 
-	io_worker_start(worker);
-
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
@@ -576,12 +564,6 @@ static int task_thread(void *data, int index)
 	sprintf(buf, "iou-wrk-%d", wq->task_pid);
 	set_task_comm(current, buf);
 
-	current->pf_io_worker = worker;
-	worker->task = current;
-
-	set_cpus_allowed_ptr(current, cpumask_of_node(wqe->node));
-	current->flags |= PF_NO_SETAFFINITY;
-
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
@@ -607,25 +589,10 @@ static int task_thread_unbound(void *data)
 	return task_thread(data, IO_WQ_ACCT_UNBOUND);
 }
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO|SIGCHLD;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-	};
-
-	return kernel_clone(&args);
-}
-
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 {
 	struct io_worker *worker;
-	pid_t pid;
+	struct task_struct *tsk;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -638,21 +605,26 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	worker->wqe = wqe;
 	spin_lock_init(&worker->lock);
 	init_completion(&worker->ref_done);
-	init_completion(&worker->started);
 
 	atomic_inc(&wq->worker_refs);
 
 	if (index == IO_WQ_ACCT_BOUND)
-		pid = io_wq_fork_thread(task_thread_bound, worker);
+		tsk = create_io_thread(task_thread_bound, worker, wqe->node);
 	else
-		pid = io_wq_fork_thread(task_thread_unbound, worker);
-	if (pid < 0) {
+		tsk = create_io_thread(task_thread_unbound, worker, wqe->node);
+	if (IS_ERR(tsk)) {
 		if (atomic_dec_and_test(&wq->worker_refs))
 			complete(&wq->worker_done);
 		kfree(worker);
 		return false;
 	}
-	wait_for_completion(&worker->started);
+	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
+	io_wqe_inc_running(worker);
+	tsk->pf_io_worker = worker;
+	worker->task = tsk;
+	set_cpus_allowed_ptr(tsk, cpumask_of_node(wqe->node));
+	tsk->flags |= PF_NOFREEZE | PF_NO_SETAFFINITY;
+	wake_up_new_task(tsk);
 	return true;
 }
 
@@ -752,10 +724,6 @@ static int io_wq_manager(void *data)
 
 	sprintf(buf, "iou-mgr-%d", wq->task_pid);
 	set_task_comm(current, buf);
-	current->flags |= PF_IO_WORKER;
-	wq->manager = get_task_struct(current);
-
-	complete(&wq->started);
 
 	do {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -815,21 +783,20 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 
 static int io_wq_fork_manager(struct io_wq *wq)
 {
-	int ret;
+	struct task_struct *tsk;
 
 	if (wq->manager)
 		return 0;
 
 	reinit_completion(&wq->worker_done);
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_wq_manager, wq);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret >= 0) {
-		wait_for_completion(&wq->started);
+	tsk = create_io_thread(io_wq_manager, wq, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		wq->manager = get_task_struct(tsk);
+		wake_up_new_task(tsk);
 		return 0;
 	}
 
-	return ret;
+	return PTR_ERR(tsk);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
@@ -1062,7 +1029,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	}
 
 	wq->task_pid = current->pid;
-	init_completion(&wq->started);
 	init_completion(&wq->exited);
 	refcount_set(&wq->refs, 1);
 
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 42f0be64a84d..5fbf7997149e 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -119,8 +119,6 @@ void io_wq_put_and_exit(struct io_wq *wq);
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg);
-
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
 	return work->flags & IO_WQ_WORK_HASHED;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e55369555e5c..04f04ac3c4cf 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6677,8 +6677,6 @@ static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 	current->flags |= PF_NO_SETAFFINITY;
 
-	complete(&sqd->completion);
-
 	wait_for_completion(&sqd->startup);
 
 	while (!io_sq_thread_should_stop(sqd)) {
@@ -7818,21 +7816,24 @@ void __io_uring_free(struct task_struct *tsk)
 
 static int io_sq_thread_fork(struct io_sq_data *sqd, struct io_ring_ctx *ctx)
 {
+	struct task_struct *tsk;
 	int ret;
 
 	clear_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
 	reinit_completion(&sqd->completion);
 	ctx->sqo_exec = 0;
 	sqd->task_pid = current->pid;
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_sq_thread, sqd);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret < 0) {
+	tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+	if (IS_ERR(tsk)) {
 		sqd->thread = NULL;
 		return ret;
 	}
-	wait_for_completion(&sqd->completion);
-	return io_uring_alloc_task_context(sqd->thread, ctx);
+	sqd->thread = tsk;
+	ret = io_uring_alloc_task_context(tsk, ctx);
+	if (ret)
+		set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+	wake_up_new_task(tsk);
+	return ret;
 }
 
 static int io_sq_offload_create(struct io_ring_ctx *ctx,
@@ -7855,6 +7856,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 
 		ret = -EPERM;
@@ -7896,15 +7898,16 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		}
 
 		sqd->task_pid = current->pid;
-		current->flags |= PF_IO_WORKER;
-		ret = io_wq_fork_thread(io_sq_thread, sqd);
-		current->flags &= ~PF_IO_WORKER;
-		if (ret < 0) {
+		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		if (IS_ERR(tsk)) {
 			sqd->thread = NULL;
 			goto err;
 		}
-		wait_for_completion(&sqd->completion);
 		ret = io_uring_alloc_task_context(sqd->thread, ctx);
+		if (ret)
+			set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+		sqd->thread = tsk;
+		wake_up_new_task(tsk);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 19:54                         ` Jens Axboe
@ 2021-03-04 20:00                           ` Jens Axboe
  2021-03-04 20:23                             ` Jens Axboe
  2021-03-04 20:50                           ` Linus Torvalds
  1 sibling, 1 reply; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 20:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On 3/4/21 12:54 PM, Jens Axboe wrote:
> On 3/4/21 12:46 PM, Linus Torvalds wrote:
>> On Thu, Mar 4, 2021 at 11:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> Took a quick look at this, and I agree that's _much_ better. In fact, it
>>> boils down to just calling copy_process() and then having the caller do
>>> wake_up_new_task(). So not sure if it's worth adding an
>>> create_io_thread() helper, or just make copy_process() available
>>> instead. This is ignoring the trace point for now...
>>
>> I really don't want to expose copy_process() outside of kernel/fork.c.
>>
>> The whole three-phase "copy - setup - activate" model is a really
>> really good thing, and it's how we've done things internally almost
>> forever, but I really don't want to expose those middle stages to any
>> outsiders.
>>
>> So I'd really prefer a simple new "create_io_worker()", even if it's
>> literally just some four-line function that does
>>
>>    p = copy_process(..);
>>    if (!IS_ERR(p)) {
>>       block all signals in p
>>       set PF_IO_WORKER flag
>>       wake_up_new_task(p);
>>    }
>>    return p;
>>
>> I very much want that to be inside kernel/fork.c and have all these
>> rules about creating new threads localized there.
> 
> I agree, here are the two current patches. Just need to add the signal
> blocking, which I'd love to do in create_io_thread(), but seems to
> require either an allocation or provide a helper to do it in the thread
> itself (with an on-stack mask).

Nevermind, it's actually copied, so we can do it in create_io_thread().
I know you'd prefer not to expose the 'task created but not active' state,
but:

1) That allows us to do further setup in the creator and hence eliminate
   wait+complete for that

2) It's not exported, so not available to drivers etc.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 20:00                           ` Jens Axboe
@ 2021-03-04 20:23                             ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

On 3/4/21 1:00 PM, Jens Axboe wrote:
> On 3/4/21 12:54 PM, Jens Axboe wrote:
>> On 3/4/21 12:46 PM, Linus Torvalds wrote:
>>> On Thu, Mar 4, 2021 at 11:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> Took a quick look at this, and I agree that's _much_ better. In fact, it
>>>> boils down to just calling copy_process() and then having the caller do
>>>> wake_up_new_task(). So not sure if it's worth adding an
>>>> create_io_thread() helper, or just make copy_process() available
>>>> instead. This is ignoring the trace point for now...
>>>
>>> I really don't want to expose copy_process() outside of kernel/fork.c.
>>>
>>> The whole three-phase "copy - setup - activate" model is a really
>>> really good thing, and it's how we've done things internally almost
>>> forever, but I really don't want to expose those middle stages to any
>>> outsiders.
>>>
>>> So I'd really prefer a simple new "create_io_worker()", even if it's
>>> literally just some four-line function that does
>>>
>>>    p = copy_process(..);
>>>    if (!IS_ERR(p)) {
>>>       block all signals in p
>>>       set PF_IO_WORKER flag
>>>       wake_up_new_task(p);
>>>    }
>>>    return p;
>>>
>>> I very much want that to be inside kernel/fork.c and have all these
>>> rules about creating new threads localized there.
>>
>> I agree, here are the two current patches. Just need to add the signal
>> blocking, which I'd love to do in create_io_thread(), but seems to
>> require either an allocation or provide a helper to do it in the thread
>> itself (with an on-stack mask).
> 
> Nevermind, it's actually copied, so we can do it in create_io_thread().
> I know you'd prefer not to expose the 'task created but not active' state,
> but:
> 
> 1) That allows us to do further setup in the creator and hence eliminate
>    wait+complete for that
> 
> 2) It's not exported, so not available to drivers etc.
> 

Here's a version that includes the signal blocking too, inside the
create_io_thread() helper. I'll run this through the usual testing.

-- 
Jens Axboe


[-- Attachment #2: 0001-kernel-provide-create_io_thread-helper.patch --]
[-- Type: text/x-patch, Size: 2906 bytes --]

From 81910fbd73e7eecea2827c407dbcaab49085c5e3 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:21:05 -0700
Subject: [PATCH 1/2] kernel: provide create_io_thread() helper

Provide a generic helper for setting up an io_uring worker. Returns a
task_struct so that the caller can do whatever setup is needed, then call
wake_up_new_task() to kick it into gear.

Add a kernel_clone_args member, io_thread, which tells copy_process() to
mark the task with PF_IO_WORKER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sched/task.h |  2 ++
 kernel/fork.c              | 28 ++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c0f71f2e7160..ef02be869cf2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
+	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
@@ -82,6 +83,7 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..08708865c58f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1940,6 +1940,8 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
+	if (args->io_thread)
+		p->flags |= PF_IO_WORKER;
 
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
@@ -2410,6 +2412,32 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
+/*
+ * This is like kernel_clone(), but shaved down and tailored to just
+ * creating io_uring workers. It returns a created task, or an error pointer.
+ * The returned task is inactive, and the caller must fire it up through
+ * wake_up_new_task(p). All signals are blocked in the created task.
+ */
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
+{
+	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
+				CLONE_IO|SIGCHLD;
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.io_thread	= 1,
+	};
+	struct task_struct *tsk;
+
+	tsk = copy_process(NULL, 0, node, &args);
+	if (!IS_ERR(tsk))
+		sigfillset(&tsk->blocked);
+	return tsk;
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.30.1


[-- Attachment #3: 0002-io_uring-move-to-using-create_io_thread.patch --]
[-- Type: text/x-patch, Size: 8493 bytes --]

From f11913b472cdc46082466dbb6cd56f105e5dcdd7 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:39:36 -0700
Subject: [PATCH 2/2] io_uring: move to using create_io_thread()

This allows us to do task creation and setup without needing to use
completions to try and synchronize with the starting thread. Get rid of
the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
completion events - we can now do setup before the task is running.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    | 69 ++++++++++++++-------------------------------------
 fs/io-wq.h    |  2 --
 fs/io_uring.c | 36 +++++++++++++--------------
 3 files changed, 35 insertions(+), 72 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 19f18389ead2..cee41b81747c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -54,7 +54,6 @@ struct io_worker {
 	spinlock_t lock;
 
 	struct completion ref_done;
-	struct completion started;
 
 	struct rcu_head rcu;
 };
@@ -116,7 +115,6 @@ struct io_wq {
 	struct io_wq_hash *hash;
 
 	refcount_t refs;
-	struct completion started;
 	struct completion exited;
 
 	atomic_t worker_refs;
@@ -273,14 +271,6 @@ static void io_wqe_dec_running(struct io_worker *worker)
 		io_wqe_wake_worker(wqe, acct);
 }
 
-static void io_worker_start(struct io_worker *worker)
-{
-	current->flags |= PF_NOFREEZE;
-	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
-	io_wqe_inc_running(worker);
-	complete(&worker->started);
-}
-
 /*
  * Worker will start processing some work. Move it to the busy list, if
  * it's currently on the freelist
@@ -490,8 +480,6 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 
-	io_worker_start(worker);
-
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
@@ -576,12 +564,6 @@ static int task_thread(void *data, int index)
 	sprintf(buf, "iou-wrk-%d", wq->task_pid);
 	set_task_comm(current, buf);
 
-	current->pf_io_worker = worker;
-	worker->task = current;
-
-	set_cpus_allowed_ptr(current, cpumask_of_node(wqe->node));
-	current->flags |= PF_NO_SETAFFINITY;
-
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
@@ -607,25 +589,10 @@ static int task_thread_unbound(void *data)
 	return task_thread(data, IO_WQ_ACCT_UNBOUND);
 }
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO|SIGCHLD;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-	};
-
-	return kernel_clone(&args);
-}
-
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 {
 	struct io_worker *worker;
-	pid_t pid;
+	struct task_struct *tsk;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -638,21 +605,26 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	worker->wqe = wqe;
 	spin_lock_init(&worker->lock);
 	init_completion(&worker->ref_done);
-	init_completion(&worker->started);
 
 	atomic_inc(&wq->worker_refs);
 
 	if (index == IO_WQ_ACCT_BOUND)
-		pid = io_wq_fork_thread(task_thread_bound, worker);
+		tsk = create_io_thread(task_thread_bound, worker, wqe->node);
 	else
-		pid = io_wq_fork_thread(task_thread_unbound, worker);
-	if (pid < 0) {
+		tsk = create_io_thread(task_thread_unbound, worker, wqe->node);
+	if (IS_ERR(tsk)) {
 		if (atomic_dec_and_test(&wq->worker_refs))
 			complete(&wq->worker_done);
 		kfree(worker);
 		return false;
 	}
-	wait_for_completion(&worker->started);
+	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
+	io_wqe_inc_running(worker);
+	tsk->pf_io_worker = worker;
+	worker->task = tsk;
+	set_cpus_allowed_ptr(tsk, cpumask_of_node(wqe->node));
+	tsk->flags |= PF_NOFREEZE | PF_NO_SETAFFINITY;
+	wake_up_new_task(tsk);
 	return true;
 }
 
@@ -696,6 +668,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
 
 static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 {
+	set_notify_signal(worker->task);
 	wake_up_process(worker->task);
 	return false;
 }
@@ -752,10 +725,6 @@ static int io_wq_manager(void *data)
 
 	sprintf(buf, "iou-mgr-%d", wq->task_pid);
 	set_task_comm(current, buf);
-	current->flags |= PF_IO_WORKER;
-	wq->manager = get_task_struct(current);
-
-	complete(&wq->started);
 
 	do {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -815,21 +784,20 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 
 static int io_wq_fork_manager(struct io_wq *wq)
 {
-	int ret;
+	struct task_struct *tsk;
 
 	if (wq->manager)
 		return 0;
 
 	reinit_completion(&wq->worker_done);
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_wq_manager, wq);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret >= 0) {
-		wait_for_completion(&wq->started);
+	tsk = create_io_thread(io_wq_manager, wq, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		wq->manager = get_task_struct(tsk);
+		wake_up_new_task(tsk);
 		return 0;
 	}
 
-	return ret;
+	return PTR_ERR(tsk);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
@@ -1062,7 +1030,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	}
 
 	wq->task_pid = current->pid;
-	init_completion(&wq->started);
 	init_completion(&wq->exited);
 	refcount_set(&wq->refs, 1);
 
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 42f0be64a84d..5fbf7997149e 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -119,8 +119,6 @@ void io_wq_put_and_exit(struct io_wq *wq);
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg);
-
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
 	return work->flags & IO_WQ_WORK_HASHED;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e55369555e5c..d885fbd53bbc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6668,7 +6668,6 @@ static int io_sq_thread(void *data)
 
 	sprintf(buf, "iou-sqp-%d", sqd->task_pid);
 	set_task_comm(current, buf);
-	sqd->thread = current;
 	current->pf_io_worker = NULL;
 
 	if (sqd->sq_cpu != -1)
@@ -6677,8 +6676,6 @@ static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 	current->flags |= PF_NO_SETAFFINITY;
 
-	complete(&sqd->completion);
-
 	wait_for_completion(&sqd->startup);
 
 	while (!io_sq_thread_should_stop(sqd)) {
@@ -7818,21 +7815,22 @@ void __io_uring_free(struct task_struct *tsk)
 
 static int io_sq_thread_fork(struct io_sq_data *sqd, struct io_ring_ctx *ctx)
 {
+	struct task_struct *tsk;
 	int ret;
 
 	clear_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
 	reinit_completion(&sqd->completion);
 	ctx->sqo_exec = 0;
 	sqd->task_pid = current->pid;
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_sq_thread, sqd);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret < 0) {
-		sqd->thread = NULL;
+	tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+	if (IS_ERR(tsk))
 		return ret;
-	}
-	wait_for_completion(&sqd->completion);
-	return io_uring_alloc_task_context(sqd->thread, ctx);
+	ret = io_uring_alloc_task_context(tsk, ctx);
+	if (ret)
+		set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+	sqd->thread = tsk;
+	wake_up_new_task(tsk);
+	return ret;
 }
 
 static int io_sq_offload_create(struct io_ring_ctx *ctx,
@@ -7855,6 +7853,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 
 		ret = -EPERM;
@@ -7896,15 +7895,14 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		}
 
 		sqd->task_pid = current->pid;
-		current->flags |= PF_IO_WORKER;
-		ret = io_wq_fork_thread(io_sq_thread, sqd);
-		current->flags &= ~PF_IO_WORKER;
-		if (ret < 0) {
-			sqd->thread = NULL;
+		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		if (IS_ERR(tsk))
 			goto err;
-		}
-		wait_for_completion(&sqd->completion);
-		ret = io_uring_alloc_task_context(sqd->thread, ctx);
+		ret = io_uring_alloc_task_context(tsk, ctx);
+		if (ret)
+			set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+		sqd->thread = tsk;
+		wake_up_new_task(tsk);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 19:54                         ` Jens Axboe
  2021-03-04 20:00                           ` Jens Axboe
@ 2021-03-04 20:50                           ` Linus Torvalds
  2021-03-04 20:54                             ` Jens Axboe
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2021-03-04 20:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On Thu, Mar 4, 2021 at 11:54 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I agree, here are the two current patches. Just need to add the signal
> blocking, which I'd love to do in create_io_thread(), but seems to
> require either an allocation or provide a helper to do it in the thread
> itself (with an on-stack mask).

Hmm. Why do you set SIGCHLD in create_io_thread()? Do you actually use
it? Shouldn't IO thread exit be a silent thing?

And why do you have those task_thread_bound() and
task_thread_unbound() functions? As far as I can tell, you have those
two functions just to set the process worker flags.

Why don't you just do that now inside create_io_worker(), and the
whole task_thread_[un]bound() thing goes away, and you just always
start the IO thread in task_thread() itself?

              Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 20:50                           ` Linus Torvalds
@ 2021-03-04 20:54                             ` Jens Axboe
  0 siblings, 0 replies; 49+ messages in thread
From: Jens Axboe @ 2021-03-04 20:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, io-uring, Eric W. Biederman, Al Viro

On 3/4/21 1:50 PM, Linus Torvalds wrote:
> On Thu, Mar 4, 2021 at 11:54 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I agree, here are the two current patches. Just need to add the signal
>> blocking, which I'd love to do in create_io_thread(), but seems to
>> require either an allocation or provide a helper to do it in the thread
>> itself (with an on-stack mask).
> 
> Hmm. Why do you set SIGCHLD in create_io_thread()? Do you actually use
> it? Shouldn't IO thread exit be a silent thing?

Good catch, I will drop that.

> And why do you have those task_thread_bound() and
> task_thread_unbound() functions? As far as I can tell, you have those
> two functions just to set the process worker flags.

Already dropped that, it's not in the current set.

> Why don't you just do that now inside create_io_worker(), and the
> whole task_thread_[un]bound() thing goes away, and you just always
> start the IO thread in task_thread() itself?

That's exactly what I did:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.12&id=dda3b920327093a2bb8c2e5db26db9203d7f60e6

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 13:05     ` Jens Axboe
  2021-03-04 13:19       ` Stefan Metzmacher
@ 2021-03-05 19:00       ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2021-03-05 19:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, viro, torvalds

Jens Axboe <axboe@kernel.dk> writes:

> On 3/4/21 5:23 AM, Stefan Metzmacher wrote:
>> 
>> Hi Jens,
>> 
>>> +static pid_t fork_thread(int (*fn)(void *), void *arg)
>>> +{
>>> +	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>>> +				CLONE_IO|SIGCHLD;
>>> +	struct kernel_clone_args args = {
>>> +		.flags		= ((lower_32_bits(flags) | CLONE_VM |
>>> +				    CLONE_UNTRACED) & ~CSIGNAL),
>>> +		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
>>> +		.stack		= (unsigned long)fn,
>>> +		.stack_size	= (unsigned long)arg,
>>> +	};
>>> +
>>> +	return kernel_clone(&args);
>>> +}
>> 
>> Can you please explain why CLONE_SIGHAND is used here?
>
> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
> don't really care about signals, we don't use them internally.
>
>> Will the userspace signal handlers executed from the kernel thread?
>
> No
>
>> Will SIGCHLD be posted to the userspace signal handlers in a userspace
>> process? Will wait() from userspace see the exit of a thread?
>
> Currently actually it does, but I think that's just an oversight. As far
> as I can tell, we want to add something like the below. Untested... I'll
> give this a spin in a bit.

How do you mean?  Where do you see do_notify_parent being called?

It should not happen in exit_notify, as the new threads should
be neither ptraced nor the thread_group_leader.  Nor should
do_notify_parent be called from wait_task_zombie as PF_IO_WORKERS
are not ptraceable.  Nor should do_notify_parent be called
reparent_leader as the PF_IO_WORKER is not the thread_group_leader.
Non-leader threads always autoreap and their exit_state is either 0
or EXIT_DEAD.

Which leaves calling do_notify_parent in release_task which is perfectly
appropriate if the io_worker is the last thread in the thread_group.

I can see modifying eligible_child so __WCLONE will not cause wait to
show the kernel thread.  I don't think wait_task_stopped or
wait_task_continued will register on PF_IO_WORKER thread if it does not
process signals but I just skimmed those two functions when I was
looking.

It definitely looks like it would be worth modifying do_signal_stop so
that the PF_IO_WORKERs are not included.  Or else modifying the
PF_IO_WORKER threads to stop with the rest of the process in that case.

Eric

> diff --git a/kernel/signal.c b/kernel/signal.c
> index ba4d1ef39a9e..e5db1d8f18e5 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1912,6 +1912,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  	bool autoreap = false;
>  	u64 utime, stime;
>  
> +	/* Don't notify a parent task if an io_uring worker exits */
> +	if (tsk->flags & PF_IO_WORKER)
> +		return true;
> +
>  	BUG_ON(sig == -1);
>  
>   	/* do_notify_parent_cldstop should have been called instead.  */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] io-wq: fork worker threads from original task
  2021-03-04 16:13         ` Stefan Metzmacher
  2021-03-04 16:42           ` Jens Axboe
@ 2021-03-05 19:16           ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2021-03-05 19:16 UTC (permalink / raw)
  To: Stefan Metzmacher; +Cc: Jens Axboe, io-uring, viro, torvalds

Stefan Metzmacher <metze@samba.org> writes:

> Am 04.03.21 um 14:19 schrieb Stefan Metzmacher:
>> Hi Jens,
>> 
>>>> Can you please explain why CLONE_SIGHAND is used here?
>>>
>>> We can't have CLONE_THREAD without CLONE_SIGHAND... The io-wq workers
>>> don't really care about signals, we don't use them internally.
>> 
>> I'm 100% sure, but I heard rumors that in some situations signals get
>> randomly delivered to any thread of a userspace process.
>
> Ok, from task_struct:
>
>         /* Signal handlers: */
>         struct signal_struct            *signal;
>         struct sighand_struct __rcu             *sighand;
>         sigset_t                        blocked;
>         sigset_t                        real_blocked;
>         /* Restored if set_restore_sigmask() was used: */
>         sigset_t                        saved_sigmask;
>         struct sigpending               pending;
>
> The signal handlers are shared, but 'blocked' is per thread/task.

Doing something so that wants_signal won't try and route
a signal to a PF_IO_WORKER seems sensible.

Either blocking the signal or modifying wants_signal.

>> My fear was that the related logic may select a kernel thread if they
>> share the same signal handlers.
>
> I found the related logic in the interaction between
> complete_signal() and wants_signal().
>
> static inline bool wants_signal(int sig, struct task_struct *p)
> {
>         if (sigismember(&p->blocked, sig))
>                 return false;
>
> ...
>
> Would it make sense to set up task->blocked to block all signals?
>
> Something like this:
>
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -611,15 +611,15 @@ pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
>  {
>         unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
>                                 CLONE_IO|SIGCHLD;
> -       struct kernel_clone_args args = {
> -               .flags          = ((lower_32_bits(flags) | CLONE_VM |
> -                                   CLONE_UNTRACED) & ~CSIGNAL),
> -               .exit_signal    = (lower_32_bits(flags) & CSIGNAL),
> -               .stack          = (unsigned long)fn,
> -               .stack_size     = (unsigned long)arg,
> -       };
> +       sigset_t mask, oldmask;
> +       pid_t pid;
>
> -       return kernel_clone(&args);
> +       sigfillset(&mask);
> +       sigprocmask(SIG_BLOCK, &mask, &oldmask);
> +       pid = kernel_thread(fn, arg, flags);
> +       sigprocmask(SIG_SETMASK, &oldmask, NULL);
> +
> +       return ret;
>  }
>
> I think using kernel_thread() would be a good simplification anyway.

I have a memory of kernel_thread having a built in assumption that it is
being called from a kthreadd, but I am not seeing it now so that would
be a nice simplification if we can do that.

> sig_task_ignored() has some PF_IO_WORKER logic.
>
> Or is there any PF_IO_WORKER related logic that prevents
> an io_wq thread to be excluded in complete_signal().
>
> Or PF_IO_WORKER would teach kernel_clone to ignore CLONE_SIGHAND
> and create a fresh handler and alter the copy_signal() and copy_sighand()
> checks...

I believe it is desirable for SIGKILL to the process to kill all of it's
PF_IO_WORKERS as well.

All that wants_signal allows/prevents is a wake up to request the task
to call get_signal.  No matter what complete_signal suggests any thread
can still dequeue the signal and process it.

It probably makes sense to block everything except SIGKILL (and
SIGSTOP?) in task_thread so that wants_signal doesn't fail to wake up an
ordinary thread that could handle the signal when the signal arrives.

Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-03-05 19:17 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
2021-02-19 20:25   ` Eric W. Biederman
2021-02-19 20:37     ` Jens Axboe
2021-02-22 13:46   ` Pavel Begunkov
2021-02-19 17:09 ` [PATCH 02/18] io-wq: don't create any IO workers upfront Jens Axboe
2021-02-19 17:09 ` [PATCH 03/18] io_uring: disable io-wq attaching Jens Axboe
2021-02-19 17:09 ` [PATCH 04/18] io-wq: get rid of wq->use_refs Jens Axboe
2021-02-19 17:09 ` [PATCH 05/18] io_uring: tie async worker side to the task context Jens Axboe
2021-02-20  8:11   ` Hao Xu
2021-02-20 14:38     ` Jens Axboe
2021-02-21  9:16       ` Hao Xu
2021-02-19 17:09 ` [PATCH 06/18] io-wq: don't pass 'wqe' needlessly around Jens Axboe
2021-02-19 17:09 ` [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD Jens Axboe
2021-02-19 22:21   ` Eric W. Biederman
2021-02-19 23:26     ` Jens Axboe
2021-02-19 17:10 ` [PATCH 08/18] kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals Jens Axboe
2021-02-19 17:10 ` [PATCH 09/18] io-wq: fork worker threads from original task Jens Axboe
2021-03-04 12:23   ` Stefan Metzmacher
2021-03-04 13:05     ` Jens Axboe
2021-03-04 13:19       ` Stefan Metzmacher
2021-03-04 16:13         ` Stefan Metzmacher
2021-03-04 16:42           ` Jens Axboe
2021-03-04 17:09             ` Stefan Metzmacher
2021-03-04 17:32               ` Jens Axboe
2021-03-04 18:19                 ` Jens Axboe
2021-03-04 18:56                   ` Linus Torvalds
2021-03-04 19:19                     ` Jens Axboe
2021-03-04 19:46                       ` Linus Torvalds
2021-03-04 19:54                         ` Jens Axboe
2021-03-04 20:00                           ` Jens Axboe
2021-03-04 20:23                             ` Jens Axboe
2021-03-04 20:50                           ` Linus Torvalds
2021-03-04 20:54                             ` Jens Axboe
2021-03-05 19:16           ` Eric W. Biederman
2021-03-05 19:00       ` Eric W. Biederman
2021-02-19 17:10 ` [PATCH 10/18] io-wq: worker idling always returns false Jens Axboe
2021-02-19 17:10 ` [PATCH 11/18] io_uring: remove any grabbing of context Jens Axboe
2021-02-19 17:10 ` [PATCH 12/18] io_uring: remove io_identity Jens Axboe
2021-02-19 17:10 ` [PATCH 13/18] io-wq: only remove worker from free_list, if it was there Jens Axboe
2021-02-19 17:10 ` [PATCH 14/18] io-wq: make io_wq_fork_thread() available to other users Jens Axboe
2021-02-19 17:10 ` [PATCH 15/18] io_uring: move SQPOLL thread io-wq forked worker Jens Axboe
2021-02-19 17:10 ` [PATCH 16/18] Revert "proc: don't allow async path resolution of /proc/thread-self components" Jens Axboe
2021-02-19 17:10 ` [PATCH 17/18] Revert "proc: don't allow async path resolution of /proc/self components" Jens Axboe
2021-02-19 17:10 ` [PATCH 18/18] net: remove cmsg restriction from io_uring based send/recvmsg calls Jens Axboe
2021-02-19 23:44 ` [PATCHSET RFC 0/18] Remove kthread usage from io_uring Stefan Metzmacher
2021-02-19 23:51   ` Jens Axboe
2021-02-21  5:04 ` Linus Torvalds
2021-02-21 21:22   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.