io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET 0/9] io_uring: use polled async retry
@ 2020-02-20 20:31 Jens Axboe
  2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence

We currently need to go async (meaning punt to a worker thread helper)
when we complete a poll request, if we have a linked request after it.
This isn't as fast as it could be. Similarly, if we try to read from
a socket (or similar) and we get -EAGAIN, we punt to an async worker
thread.

This clearly isn't optimal, both in terms of latency and system
resources.

This patchset attempts to rectify that by revamping the poll setup
infrastructure, and using that same infrastructure to handle async IO
on file types that support polling for availability of data and/or
space. The end result is a lot faster than it was before. On an echo
server example, I gain about a 4x performance improvement in throughput
for a single client case.

Just as important, this also means that an application can simply issue
an IORING_OP_RECV or IORING_OP_RECVMSG and have it complete when data
is available. It's no longer needed (or useful) to use a poll link
prior to the receive. Once data becomes available, it is read
immediately. Honestly, this almost feels like magic! This can completely
replace setups that currently use epoll to poll for data availability,
and then need to issue a receive after that. Just one system call for
the whole operation. This isn't specific to receive, that is just an
example. The send side works the same.

This is accomplished by adding a per-task sched_work handler. The work
queued there is automatically run when a task is scheduled in or out.
When a poll request completes (either an explicit one, or one just armed
on behalf of a request that would otherwise block), the bottom half side
of the work is queued as sched_work and the task is woken.

This patchset passes my test suite, but I'd be hugely surprised if there
isn't a few corner cases that still need fixing.

-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 20:31 ` [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6e249aa97ba3..bd3a39b0f4ee 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2255,10 +2255,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 	 * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so
 	 * we know to async punt it even if it was opened O_NONBLOCK
 	 */
-	if (force_nonblock && !io_file_supports_async(req->file)) {
-		req->flags |= REQ_F_MUST_PUNT;
+	if (force_nonblock && !io_file_supports_async(req->file))
 		goto copy_iov;
-	}
 
 	iov_count = iov_iter_count(&iter);
 	ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
@@ -2279,6 +2277,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 						inline_vecs, &iter);
 			if (ret)
 				goto out_free;
+			/* any defer here is final, must blocking retry */
+			req->flags |= REQ_F_MUST_PUNT;
 			return -EAGAIN;
 		}
 	}
@@ -2344,10 +2344,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 	 * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so
 	 * we know to async punt it even if it was opened O_NONBLOCK
 	 */
-	if (force_nonblock && !io_file_supports_async(req->file)) {
-		req->flags |= REQ_F_MUST_PUNT;
+	if (force_nonblock && !io_file_supports_async(req->file))
 		goto copy_iov;
-	}
 
 	/* file path doesn't support NOWAIT for non-direct_IO */
 	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
@@ -2392,6 +2390,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 						inline_vecs, &iter);
 			if (ret)
 				goto out_free;
+			/* any defer here is final, must blocking retry */
+			req->flags |= REQ_F_MUST_PUNT;
 			return -EAGAIN;
 		}
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
  2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 20:31 ` [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers Jens Axboe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bd3a39b0f4ee..c3fe2022e343 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3353,6 +3353,8 @@ static void io_accept_finish(struct io_wq_work **workptr)
 	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
 	struct io_kiocb *nxt = NULL;
 
+	io_put_req(req);
+
 	if (io_req_cancelled(req))
 		return;
 	__io_accept(req, &nxt, false);
@@ -3370,7 +3372,6 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt,
 	ret = __io_accept(req, nxt, force_nonblock);
 	if (ret == -EAGAIN && force_nonblock) {
 		req->work.func = io_accept_finish;
-		io_put_req(req);
 		return -EAGAIN;
 	}
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
  2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
  2020-02-20 20:31 ` [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 20:31 ` [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

We already have sched_update_worker() which calls the "I woke up" handler
for io-wq and workqueue threads, rename it to sched_in_update(). The code
that is called when the threads are going to sleep is moved into an
identical helper, sched_out_update(), so that it mirrors the schedule in
side.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/sched/core.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..c7bab13f9caa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4102,11 +4102,8 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
-static inline void sched_submit_work(struct task_struct *tsk)
+static void sched_out_update(struct task_struct *tsk)
 {
-	if (!tsk->state)
-		return;
-
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
 	 * it wants to wake up a task to maintain concurrency.
@@ -4122,6 +4119,24 @@ static inline void sched_submit_work(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
+}
+
+static void sched_in_update(struct task_struct *tsk)
+{
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+		if (tsk->flags & PF_WQ_WORKER)
+			wq_worker_running(tsk);
+		else
+			io_wq_worker_running(tsk);
+	}
+}
+
+static inline void sched_submit_work(struct task_struct *tsk)
+{
+	if (!tsk->state)
+		return;
+
+	sched_out_update(tsk);
 
 	if (tsk_is_pi_blocked(tsk))
 		return;
@@ -4134,16 +4149,6 @@ static inline void sched_submit_work(struct task_struct *tsk)
 		blk_schedule_flush_plug(tsk);
 }
 
-static void sched_update_worker(struct task_struct *tsk)
-{
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
-		if (tsk->flags & PF_WQ_WORKER)
-			wq_worker_running(tsk);
-		else
-			io_wq_worker_running(tsk);
-	}
-}
-
 asmlinkage __visible void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
@@ -4154,7 +4159,7 @@ asmlinkage __visible void __sched schedule(void)
 		__schedule(false);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
-	sched_update_worker(tsk);
+	sched_in_update(tsk);
 }
 EXPORT_SYMBOL(schedule);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (2 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Oleg Nesterov, Jens Axboe

From: Oleg Nesterov <oleg@redhat.com>

As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
if task->task_works == NULL && !PF_EXITING.

And in fact the only reason why task_work_run() needs ->pi_lock is
the possible race with task_work_cancel(), we can optimize this code
and make the locking more clear.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/task_work.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0fef395662a6..825f28259a19 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -97,16 +97,26 @@ void task_work_run(void)
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
-		raw_spin_lock_irq(&task->pi_lock);
 		do {
+			head = NULL;
 			work = READ_ONCE(task->task_works);
-			head = !work && (task->flags & PF_EXITING) ?
-				&work_exited : NULL;
+			if (!work) {
+				if (task->flags & PF_EXITING)
+					head = &work_exited;
+				else
+					break;
+			}
 		} while (cmpxchg(&task->task_works, work, head) != work);
-		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)
 			break;
+		/*
+		 * Synchronize with task_work_cancel(). It can not remove
+		 * the first entry == work, cmpxchg(task_works) must fail.
+		 * But it can remove another entry from the ->next list.
+		 */
+		raw_spin_lock_irq(&task->pi_lock);
+		raw_spin_unlock_irq(&task->pi_lock);
 
 		do {
 			next = work->next;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/9] kernel: abstract out task work helpers
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (3 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 21:07   ` Peter Zijlstra
  2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

This is in preparation for adding a ditto sched_work list.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/task_work.c | 88 +++++++++++++++++++++++++++++-----------------
 1 file changed, 56 insertions(+), 32 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 825f28259a19..3445421266e7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -5,6 +5,22 @@
 
 static struct callback_head work_exited; /* all we need is ->next == NULL */
 
+static int __task_work_add(struct task_struct *task,
+			   struct callback_head **headptr,
+			   struct callback_head *work)
+{
+	struct callback_head *head;
+
+	do {
+		head = READ_ONCE(*headptr);
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
+		work->next = head;
+	} while (cmpxchg(headptr, head, work) != head);
+
+	return 0;
+}
+
 /**
  * task_work_add - ask the @task to execute @work->func()
  * @task: the task which should run the callback
@@ -27,39 +43,25 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *head;
+	int ret;
 
-	do {
-		head = READ_ONCE(task->task_works);
-		if (unlikely(head == &work_exited))
-			return -ESRCH;
-		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
+	ret = __task_work_add(task, &task->task_works, work);
 
 	if (notify)
 		set_notify_resume(task);
-	return 0;
+
+	return ret;
 }
 
-/**
- * task_work_cancel - cancel a pending work added by task_work_add()
- * @task: the task which should execute the work
- * @func: identifies the work to remove
- *
- * Find the last queued pending work with ->func == @func and remove
- * it from queue.
- *
- * RETURNS:
- * The found work or NULL if not found.
- */
-struct callback_head *
-task_work_cancel(struct task_struct *task, task_work_func_t func)
+static struct callback_head *__task_work_cancel(struct task_struct *task,
+						struct callback_head **headptr,
+						task_work_func_t func)
 {
-	struct callback_head **pprev = &task->task_works;
+	struct callback_head **pprev = headptr;
 	struct callback_head *work;
 	unsigned long flags;
 
-	if (likely(!task->task_works))
+	if (likely(!(*headptr)))
 		return NULL;
 	/*
 	 * If cmpxchg() fails we continue without updating pprev.
@@ -80,16 +82,25 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 }
 
 /**
- * task_work_run - execute the works added by task_work_add()
+ * task_work_cancel - cancel a pending work added by task_work_add()
+ * @task: the task which should execute the work
+ * @func: identifies the work to remove
  *
- * Flush the pending works. Should be used by the core kernel code.
- * Called before the task returns to the user-mode or stops, or when
- * it exits. In the latter case task_work_add() can no longer add the
- * new work after task_work_run() returns.
+ * Find the last queued pending work with ->func == @func and remove
+ * it from queue.
+ *
+ * RETURNS:
+ * The found work or NULL if not found.
  */
-void task_work_run(void)
+struct callback_head *
+task_work_cancel(struct task_struct *task, task_work_func_t func)
+{
+	return __task_work_cancel(task, &task->task_works, func);
+}
+
+static void __task_work_run(struct task_struct *task,
+			    struct callback_head **headptr)
 {
-	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
 	for (;;) {
@@ -99,14 +110,14 @@ void task_work_run(void)
 		 */
 		do {
 			head = NULL;
-			work = READ_ONCE(task->task_works);
+			work = READ_ONCE(*headptr);
 			if (!work) {
 				if (task->flags & PF_EXITING)
 					head = &work_exited;
 				else
 					break;
 			}
-		} while (cmpxchg(&task->task_works, work, head) != work);
+		} while (cmpxchg(headptr, work, head) != work);
 
 		if (!work)
 			break;
@@ -126,3 +137,16 @@ void task_work_run(void)
 		} while (work);
 	}
 }
+
+/**
+ * task_work_run - execute the works added by task_work_add()
+ *
+ * Flush the pending works. Should be used by the core kernel code.
+ * Called before the task returns to the user-mode or stops, or when
+ * it exits. In the latter case task_work_add() can no longer add the
+ * new work after task_work_run() returns.
+ */
+void task_work_run(void)
+{
+	__task_work_run(current, &current->task_works);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 6/9] sched: add a sched_work list
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (4 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 21:17   ` Peter Zijlstra
  2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

This is similar to the task_works, and uses the same infrastructure, but
the sched_work list is run when the task is being scheduled in or out.

The intended use case here is for core code to be able to add work
that should be automatically run by the task, without the task needing
to do anything. This is done outside of the task, one example would be
from waitqueue handlers, or anything else that is invoked out-of-band
from the task itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sched.h     |  4 ++-
 include/linux/task_work.h |  5 ++++
 kernel/sched/core.c       | 16 ++++++++--
 kernel/task_work.c        | 62 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..da15112c1140 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -648,6 +648,7 @@ struct task_struct {
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
 	unsigned int			ptrace;
+	int				on_rq;
 
 #ifdef CONFIG_SMP
 	struct llist_node		wake_entry;
@@ -670,13 +671,14 @@ struct task_struct {
 	int				recent_used_cpu;
 	int				wake_cpu;
 #endif
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
 
+	struct callback_head		*sched_work;
+
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index bd9a6a91c097..e0c56f461df6 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -17,9 +17,14 @@ int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
 
+int sched_work_add(struct task_struct *task, struct callback_head *work);
+struct callback_head *sched_work_cancel(struct task_struct *, task_work_func_t);
+void sched_work_run(void);
+
 static inline void exit_task_work(struct task_struct *task)
 {
 	task_work_run();
+	sched_work_run();
 }
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7bab13f9caa..9e0f754e0630 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2678,6 +2678,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
 static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
+	p->sched_work			= NULL;
 
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
@@ -4102,8 +4103,13 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
-static void sched_out_update(struct task_struct *tsk)
+static bool sched_out_update(struct task_struct *tsk)
 {
+	if (unlikely(tsk->sched_work)) {
+		sched_work_run();
+		return true;
+	}
+
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
 	 * it wants to wake up a task to maintain concurrency.
@@ -4119,6 +4125,8 @@ static void sched_out_update(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
+
+	return false;
 }
 
 static void sched_in_update(struct task_struct *tsk)
@@ -4129,6 +4137,8 @@ static void sched_in_update(struct task_struct *tsk)
 		else
 			io_wq_worker_running(tsk);
 	}
+	if (unlikely(tsk->sched_work))
+		sched_work_run();
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)
@@ -4136,7 +4146,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	if (!tsk->state)
 		return;
 
-	sched_out_update(tsk);
+	/* if we processed work, we could be runnable again. check. */
+	if (sched_out_update(tsk) && !tsk->state)
+		return;
 
 	if (tsk_is_pi_blocked(tsk))
 		return;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 3445421266e7..ba62485d5b3d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,7 +3,14 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
-static struct callback_head work_exited; /* all we need is ->next == NULL */
+static void task_exit_func(struct callback_head *head)
+{
+}
+
+static struct callback_head work_exited = {
+	.next	= NULL,
+	.func	= task_exit_func,
+};
 
 static int __task_work_add(struct task_struct *task,
 			   struct callback_head **headptr,
@@ -53,6 +60,28 @@ task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 	return ret;
 }
 
+/**
+ * sched_work_add - ask the @task to execute @work->func()
+ * @task: the task which should run the callback
+ * @work: the callback to run
+ * @notify: send the notification if true
+ *
+ * Queue @work for sched_work_run() below.
+ * Fails if the @task is exiting/exited and thus it can't process this @work.
+ * Otherwise @work->func() will be called when the @task is either scheduled
+ * in or out.
+ *
+ * Note: there is no ordering guarantee on works queued here.
+ *
+ * RETURNS:
+ * 0 if succeeds or -ESRCH.
+ */
+int
+sched_work_add(struct task_struct *task, struct callback_head *work)
+{
+	return __task_work_add(task, &task->sched_work, work);
+}
+
 static struct callback_head *__task_work_cancel(struct task_struct *task,
 						struct callback_head **headptr,
 						task_work_func_t func)
@@ -98,10 +127,27 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	return __task_work_cancel(task, &task->task_works, func);
 }
 
-static void __task_work_run(struct task_struct *task,
-			    struct callback_head **headptr)
+/**
+ * sched_work_cancel - cancel a pending work added by sched_work_add()
+ * @task: the task which should execute the work
+ * @func: identifies the work to remove
+ *
+ * Find the last queued pending work with ->func == @func and remove
+ * it from queue.
+ *
+ * RETURNS:
+ * The found work or NULL if not found.
+ */
+struct callback_head *
+sched_work_cancel(struct task_struct *task, task_work_func_t func)
+{
+	return __task_work_cancel(task, &task->sched_work, func);
+}
+
+static void __task_work_run(struct callback_head **headptr)
 {
 	struct callback_head *work, *head, *next;
+	struct task_struct *task = current;
 
 	for (;;) {
 		/*
@@ -148,5 +194,13 @@ static void __task_work_run(struct task_struct *task,
  */
 void task_work_run(void)
 {
-	__task_work_run(current, &current->task_works);
+	__task_work_run(&current->task_works);
+}
+
+/**
+ * sched_work_run - execute the works added by sched_work_add()
+ */
+void sched_work_run()
+{
+	__task_work_run(&current->sched_work);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (5 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 22:02   ` Jann Horn
  2020-02-21 13:51   ` Pavel Begunkov
  2020-02-20 20:31 ` [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
  2020-02-20 20:31 ` [PATCH 9/9] io_uring: use poll driven retry for files that support it Jens Axboe
  8 siblings, 2 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

For poll requests, it's not uncommon to link a read (or write) after
the poll to execute immediately after the file is marked as ready.
Since the poll completion is called inside the waitqueue wake up handler,
we have to punt that linked request to async context. This slows down
the processing, and actually means it's faster to not use a link for this
use case.

We also run into problems if the completion_lock is contended, as we're
doing a different lock ordering than the issue side is. Hence we have
to do trylock for completion, and if that fails, go async. Poll removal
needs to go async as well, for the same reason.

eventfd notification needs special case as well, to avoid stack blowing
recursion or deadlocks.

These are all deficiencies that were inherited from the aio poll
implementation, but I think we can do better. When a poll completes,
simply queue it up in the task poll list. When the task completes the
list, we can run dependent links inline as well. This means we never
have to go async, and we can remove a bunch of code associated with
that, and optimizations to try and make that run faster. The diffstat
speaks for itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 189 ++++++++++++++------------------------------------
 1 file changed, 53 insertions(+), 136 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c3fe2022e343..5991bcc24387 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -76,6 +76,7 @@
 #include <linux/fadvise.h>
 #include <linux/eventpoll.h>
 #include <linux/fs_struct.h>
+#include <linux/task_work.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -295,7 +296,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,10 +552,6 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
@@ -574,7 +570,17 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
-	struct io_wq_work	work;
+	union {
+		/*
+		 * Only commands that never go async can use the below fields,
+		 * obviously. Right now only IORING_OP_POLL_ADD uses them.
+		 */
+		struct {
+			struct task_struct	*task;
+			struct callback_head	sched_work;
+		};
+		struct io_wq_work	work;
+	};
 };
 
 #define IO_PLUG_THRESHOLD		2
@@ -764,6 +770,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 static int io_grab_files(struct io_kiocb *req);
 static void io_ring_file_ref_flush(struct fixed_file_data *data);
 static void io_cleanup_req(struct io_kiocb *req);
+static void __io_queue_sqe(struct io_kiocb *req,
+			   const struct io_uring_sqe *sqe);
 
 static struct kmem_cache *req_cachep;
 
@@ -834,7 +842,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -1056,24 +1063,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
 		return false;
 	if (!ctx->eventfd_async)
 		return true;
-	return io_wq_current_is_worker() || in_interrupt();
+	return io_wq_current_is_worker();
 }
 
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev)
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (trigger_ev)
+	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
-{
-	__io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx));
-}
-
 /* Returns true if there are no backlogged entries after the flush */
 static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 {
@@ -3450,18 +3452,27 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
 #endif
 }
 
-static void io_poll_remove_one(struct io_kiocb *req)
+static bool io_poll_remove_one(struct io_kiocb *req)
 {
 	struct io_poll_iocb *poll = &req->poll;
+	bool do_complete = false;
 
 	spin_lock(&poll->head->lock);
 	WRITE_ONCE(poll->canceled, true);
 	if (!list_empty(&poll->wait.entry)) {
 		list_del_init(&poll->wait.entry);
-		io_queue_async_work(req);
+		do_complete = true;
 	}
 	spin_unlock(&poll->head->lock);
 	hash_del(&req->hash_node);
+	if (do_complete) {
+		io_cqring_fill_event(req, -ECANCELED);
+		io_commit_cqring(req->ctx);
+		req->flags |= REQ_F_COMP_LOCKED;
+		io_put_req(req);
+	}
+
+	return do_complete;
 }
 
 static void io_poll_remove_all(struct io_ring_ctx *ctx)
@@ -3479,6 +3490,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 			io_poll_remove_one(req);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
 }
 
 static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
@@ -3488,10 +3501,11 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
 
 	list = &ctx->cancel_hash[hash_long(sqe_addr, ctx->cancel_hash_bits)];
 	hlist_for_each_entry(req, list, hash_node) {
-		if (sqe_addr == req->user_data) {
-			io_poll_remove_one(req);
+		if (sqe_addr != req->user_data)
+			continue;
+		if (io_poll_remove_one(req))
 			return 0;
-		}
+		return -EALREADY;
 	}
 
 	return -ENOENT;
@@ -3544,92 +3558,28 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
 
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
 }
 
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
+static void io_poll_task_func(struct callback_head *cb)
 {
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
+	struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
+	struct io_kiocb *nxt = NULL;
 
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
+	io_poll_task_handler(req, &nxt);
+	if (nxt)
+		__io_queue_sqe(nxt, NULL);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3637,8 +3587,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req(req);
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	init_task_work(&req->sched_work, io_poll_task_func);
+	sched_work_add(tsk, &req->sched_work);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3733,6 +3648,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
 
 	events = READ_ONCE(sqe->poll_events);
 	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
+
+	/* task will wait for requests on exit, don't need a ref */
+	req->task = current;
 	return 0;
 }
 
@@ -3744,7 +3662,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (6 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  2020-02-20 20:31 ` [PATCH 9/9] io_uring: use poll driven retry for files that support it Jens Axboe
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

Add a pollin/pollout field to the request table, and have commands that
we can safely poll for properly marked.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5991bcc24387..ca96e0206132 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -624,6 +624,9 @@ struct io_op_def {
 	unsigned		file_table : 1;
 	/* needs ->fs */
 	unsigned		needs_fs : 1;
+	/* set if opcode supports polled "wait" */
+	unsigned		pollin : 1;
+	unsigned		pollout : 1;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -633,6 +636,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITEV] = {
 		.async_ctx		= 1,
@@ -640,6 +644,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FSYNC] = {
 		.needs_file		= 1,
@@ -647,11 +652,13 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE_FIXED] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_POLL_ADD] = {
 		.needs_file		= 1,
@@ -667,6 +674,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECVMSG] = {
 		.async_ctx		= 1,
@@ -674,6 +682,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.async_ctx		= 1,
@@ -685,6 +694,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.file_table		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_ASYNC_CANCEL] = {},
 	[IORING_OP_LINK_TIMEOUT] = {
@@ -696,6 +706,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FALLOCATE] = {
 		.needs_file		= 1,
@@ -724,11 +735,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FADVISE] = {
 		.needs_file		= 1,
@@ -740,11 +753,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECV] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_OPENAT2] = {
 		.needs_file		= 1,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 9/9] io_uring: use poll driven retry for files that support it
  2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
                   ` (7 preceding siblings ...)
  2020-02-20 20:31 ` [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
@ 2020-02-20 20:31 ` Jens Axboe
  8 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 20:31 UTC (permalink / raw)
  To: io-uring; +Cc: glauber, peterz, asml.silence, Jens Axboe

Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we got -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.

This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
scheduler handlers.

Note that this is very much a work-in-progress, but it does pass the full
test suite. Notable missing features:

- Need to double check req->apoll life time.

- Probably a lot I don't quite recall right now...

It does work for the basic read/write, send/recv, etc testing I've
tried as well.

The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                   | 398 +++++++++++++++++++++++++-------
 include/trace/events/io_uring.h |  80 +++++++
 include/uapi/linux/io_uring.h   |   1 +
 3 files changed, 397 insertions(+), 82 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca96e0206132..39939b4935ca 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -482,6 +482,8 @@ enum {
 	REQ_F_COMP_LOCKED_BIT,
 	REQ_F_NEED_CLEANUP_BIT,
 	REQ_F_OVERFLOW_BIT,
+	REQ_F_WORK_BIT,
+	REQ_F_POLLED_BIT,
 };
 
 enum {
@@ -524,6 +526,15 @@ enum {
 	REQ_F_NEED_CLEANUP	= BIT(REQ_F_NEED_CLEANUP_BIT),
 	/* in overflow list */
 	REQ_F_OVERFLOW		= BIT(REQ_F_OVERFLOW_BIT),
+	/* ->work is valid */
+	REQ_F_WORK		= BIT(REQ_F_WORK_BIT),
+	/* already went through poll handler */
+	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+};
+
+struct async_poll {
+	struct io_poll_iocb	poll;
+	struct io_wq_work	work;
 };
 
 /*
@@ -557,10 +568,7 @@ struct io_kiocb {
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -570,14 +578,17 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
+
 	union {
 		/*
 		 * Only commands that never go async can use the below fields,
 		 * obviously. Right now only IORING_OP_POLL_ADD uses them.
 		 */
 		struct {
-			struct task_struct	*task;
 			struct callback_head	sched_work;
+			struct hlist_node	hash_node;
+			struct async_poll	*apoll;
 		};
 		struct io_wq_work	work;
 	};
@@ -953,10 +964,13 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
+	req->flags |= REQ_F_WORK;
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
 {
+	if (!(req->flags & REQ_F_WORK))
+		return;
 	if (req->work.mm) {
 		mmdrop(req->work.mm);
 		req->work.mm = NULL;
@@ -3467,9 +3481,199 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
 #endif
 }
 
-static bool io_poll_remove_one(struct io_kiocb *req)
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
+			    struct wait_queue_head *head)
+{
+	if (unlikely(poll->head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	poll->head = head;
+	add_wait_queue(head, &poll->wait);
+}
+
+static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	__io_queue_proc(&pt->req->apoll->poll, pt, head);
+}
+
+static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
+			   __poll_t mask, task_work_func_t func)
+{
+	struct task_struct *tsk;
+
+	trace_io_uring_task_add(req->ctx, req->opcode, req->user_data, mask);
+
+	/* for instances that support it check for an event match first: */
+	if (mask && !(mask & poll->events))
+		return 0;
+
+	list_del_init(&poll->wait.entry);
+
+	tsk = req->task;
+	req->result = mask;
+	init_task_work(&req->sched_work, func);
+	sched_work_add(tsk, &req->sched_work);
+	wake_up_process(tsk);
+	return 1;
+}
+
+static void io_async_task_func(struct callback_head *cb)
+{
+	struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
+	void *to_free;
+
+	if (hash_hashed(&req->hash_node)) {
+		struct io_ring_ctx *ctx = req->ctx;
+
+		spin_lock_irq(&ctx->completion_lock);
+		hash_del(&req->hash_node);
+		spin_unlock_irq(&ctx->completion_lock);
+	}
+
+	to_free = req->apoll;
+	WARN_ON_ONCE(!list_empty(&req->apoll->poll.wait.entry));
+
+	__set_current_state(TASK_RUNNING);
+	__io_queue_sqe(req, NULL);
+
+	kfree(to_free);
+}
+
+static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_kiocb *req = wait->private;
+	struct io_poll_iocb *poll = &req->apoll->poll;
+
+	trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data,
+					key_to_poll(key));
+
+	return __io_async_wake(req, poll, key_to_poll(key), io_async_task_func);
+}
+
+static void io_poll_req_insert(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct hlist_head *list;
+
+	list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)];
+	hlist_add_head(&req->hash_node, list);
+}
+
+static __poll_t __io_arm_poll_handler(struct io_kiocb *req,
+				      struct io_poll_iocb *poll,
+				      struct io_poll_table *ipt, __poll_t mask,
+				      wait_queue_func_t wake_func, void *priv)
+	__acquires(&ctx->completion_lock)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	bool cancel = false;
+
+	poll->file = req->file;
+	poll->head = NULL;
+	poll->done = poll->canceled = false;
+	poll->events = mask;
+
+	ipt->pt._key = mask;
+	ipt->req = req;
+	ipt->error = -EINVAL;
+
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, wake_func);
+	poll->wait.private = priv;
+
+	mask = vfs_poll(req->file, &ipt->pt) & poll->events;
+
+	spin_lock_irq(&ctx->completion_lock);
+	if (likely(poll->head)) {
+		spin_lock(&poll->head->lock);
+		if (unlikely(list_empty(&poll->wait.entry))) {
+			if (ipt->error)
+				cancel = true;
+			ipt->error = 0;
+			mask = 0;
+		}
+		if (mask || ipt->error)
+			list_del_init(&poll->wait.entry);
+		else if (cancel)
+			WRITE_ONCE(poll->canceled, true);
+		else if (!poll->done) /* actually waiting for an event */
+			io_poll_req_insert(req);
+		spin_unlock(&poll->head->lock);
+	}
+
+	return mask;
+}
+
+static bool io_arm_poll_handler(struct io_kiocb *req)
+{
+	const struct io_op_def *def = &io_op_defs[req->opcode];
+	struct io_ring_ctx *ctx = req->ctx;
+	struct async_poll *apoll;
+	struct io_poll_table ipt;
+	__poll_t mask, ret;
+
+	if (!req->file || !file_can_poll(req->file))
+		return false;
+	if (req->flags & (REQ_F_MUST_PUNT | REQ_F_WORK))
+		return false;
+	if (req->flags & REQ_F_POLLED) {
+		memcpy(&req->work, &req->apoll->work, sizeof(req->work));
+		return false;
+	}
+	if (!def->pollin && !def->pollout)
+		return false;
+
+	apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC);
+	if (unlikely(!apoll))
+		return false;
+
+	req->flags |= REQ_F_POLLED;
+	memcpy(&apoll->work, &req->work, sizeof(req->work));
+
+	req->task = current;
+	req->apoll = apoll;
+	INIT_HLIST_NODE(&req->hash_node);
+
+	if (def->pollin)
+		mask = POLLIN | POLLRDNORM;
+	if (def->pollout)
+		mask |= POLLOUT | POLLWRNORM;
+	mask |= POLLERR | POLLPRI;
+
+	ipt.pt._qproc = io_async_queue_proc;
+
+	ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
+					io_async_wake, req);
+	if (ret) {
+		ipt.error = 0;
+		apoll->poll.done = true;
+		spin_unlock_irq(&ctx->completion_lock);
+		memcpy(&req->work, &apoll->work, sizeof(req->work));
+		kfree(apoll);
+		return false;
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+	trace_io_uring_poll_arm(ctx, req->opcode, req->user_data, mask,
+					apoll->poll.events);
+	return true;
+}
+
+static bool __io_poll_remove_one(struct io_kiocb *req,
+				 struct io_poll_iocb *poll)
 {
-	struct io_poll_iocb *poll = &req->poll;
 	bool do_complete = false;
 
 	spin_lock(&poll->head->lock);
@@ -3479,7 +3683,24 @@ static bool io_poll_remove_one(struct io_kiocb *req)
 		do_complete = true;
 	}
 	spin_unlock(&poll->head->lock);
+	return do_complete;
+}
+
+static bool io_poll_remove_one(struct io_kiocb *req)
+{
+	bool do_complete;
+
+	if (req->opcode == IORING_OP_POLL_ADD) {
+		do_complete = __io_poll_remove_one(req, &req->poll);
+	} else {
+		/* non-poll requests have submit ref still */
+		do_complete = __io_poll_remove_one(req, &req->apoll->poll);
+		if (do_complete)
+			io_put_req(req);
+	}
+
 	hash_del(&req->hash_node);
+
 	if (do_complete) {
 		io_cqring_fill_event(req, -ECANCELED);
 		io_commit_cqring(req->ctx);
@@ -3602,51 +3823,16 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	__poll_t mask = key_to_poll(key);
-	struct task_struct *tsk;
-
-	/* for instances that support it check for an event match first: */
-	if (mask && !(mask & poll->events))
-		return 0;
-
-	list_del_init(&poll->wait.entry);
 
-	tsk = req->task;
-	req->result = mask;
-	init_task_work(&req->sched_work, io_poll_task_func);
-	sched_work_add(tsk, &req->sched_work);
-	wake_up_process(tsk);
-	return 1;
+	return __io_async_wake(req, poll, key_to_poll(key), io_poll_task_func);
 }
 
-struct io_poll_table {
-	struct poll_table_struct pt;
-	struct io_kiocb *req;
-	int error;
-};
-
 static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
 			       struct poll_table_struct *p)
 {
 	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
 
-	if (unlikely(pt->req->poll.head)) {
-		pt->error = -EINVAL;
-		return;
-	}
-
-	pt->error = 0;
-	pt->req->poll.head = head;
-	add_wait_queue(head, &pt->req->poll.wait);
-}
-
-static void io_poll_req_insert(struct io_kiocb *req)
-{
-	struct io_ring_ctx *ctx = req->ctx;
-	struct hlist_head *list;
-
-	list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)];
-	hlist_add_head(&req->hash_node, list);
+	__io_queue_proc(&pt->req->poll, pt, head);
 }
 
 static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -3674,46 +3860,15 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	struct io_poll_iocb *poll = &req->poll;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_poll_table ipt;
-	bool cancel = false;
 	__poll_t mask;
 
 	INIT_HLIST_NODE(&req->hash_node);
-
-	poll->head = NULL;
-	poll->done = false;
-	poll->canceled = false;
-
-	ipt.pt._qproc = io_poll_queue_proc;
-	ipt.pt._key = poll->events;
-	ipt.req = req;
-	ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
-
-	/* initialized the list so that we can do list_empty checks */
-	INIT_LIST_HEAD(&poll->wait.entry);
-	init_waitqueue_func_entry(&poll->wait, io_poll_wake);
-	poll->wait.private = poll;
-
 	INIT_LIST_HEAD(&req->list);
+	ipt.pt._qproc = io_poll_queue_proc;
 
-	mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
+	mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events,
+					io_poll_wake, &req->poll);
 
-	spin_lock_irq(&ctx->completion_lock);
-	if (likely(poll->head)) {
-		spin_lock(&poll->head->lock);
-		if (unlikely(list_empty(&poll->wait.entry))) {
-			if (ipt.error)
-				cancel = true;
-			ipt.error = 0;
-			mask = 0;
-		}
-		if (mask || ipt.error)
-			list_del_init(&poll->wait.entry);
-		else if (cancel)
-			WRITE_ONCE(poll->canceled, true);
-		else if (!poll->done) /* actually waiting for an event */
-			io_poll_req_insert(req);
-		spin_unlock(&poll->head->lock);
-	}
 	if (mask) { /* no async, we'd stolen it */
 		ipt.error = 0;
 		io_poll_complete(req, mask, 0);
@@ -4660,6 +4815,11 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	 */
 	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
 	    (req->flags & REQ_F_MUST_PUNT))) {
+		if (io_arm_poll_handler(req)) {
+			if (linked_timeout)
+				io_put_req(linked_timeout);
+			goto done_req;
+		}
 punt:
 		if (io_op_defs[req->opcode].file_table) {
 			ret = io_grab_files(req);
@@ -5199,8 +5359,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	struct io_rings *rings = ctx->rings;
 	int ret = 0;
 
-	if (io_cqring_events(ctx, false) >= min_events)
-		return 0;
+	do {
+		if (io_cqring_events(ctx, false) >= min_events)
+			return 0;
+		if (!current->sched_work)
+			break;
+		sched_work_run();
+	} while (1);
 
 	if (sig) {
 #ifdef CONFIG_COMPAT
@@ -6435,6 +6600,62 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
 	finish_wait(&ctx->inflight_wait, &wait);
 }
 
+static void __io_uring_cancel_task(struct task_struct *tsk,
+				   task_work_func_t func,
+				   void (*cancel)(struct io_kiocb *))
+{
+	struct callback_head *head;
+
+	while ((head = sched_work_cancel(tsk, func)) != NULL) {
+		struct io_kiocb *req;
+
+		req = container_of(head, struct io_kiocb, sched_work);
+		cancel(req);
+	}
+}
+
+static void async_cancel(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	void *to_free;
+
+	spin_lock_irq(&ctx->completion_lock);
+	hash_del(&req->hash_node);
+	io_cqring_fill_event(req, -ECANCELED);
+	io_commit_cqring(ctx);
+	req->flags |= REQ_F_COMP_LOCKED;
+	to_free = req->apoll;
+	io_double_put_req(req);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	kfree(to_free);
+	io_cqring_ev_posted(ctx);
+}
+
+static void io_uring_cancel_task_async(struct task_struct *tsk)
+{
+	__io_uring_cancel_task(tsk, io_async_task_func, async_cancel);
+}
+
+static void poll_cancel(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock_irq(&ctx->completion_lock);
+	hash_del(&req->hash_node);
+	io_poll_complete(req, -ECANCELED, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req(req);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
+}
+
+static void io_uring_cancel_task_poll(struct task_struct *tsk)
+{
+	__io_uring_cancel_task(tsk, io_poll_task_func, poll_cancel);
+}
+
 static int io_uring_flush(struct file *file, void *data)
 {
 	struct io_ring_ctx *ctx = file->private_data;
@@ -6444,8 +6665,11 @@ static int io_uring_flush(struct file *file, void *data)
 	/*
 	 * If the task is going away, cancel work it may have pending
 	 */
-	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
+	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
+		io_uring_cancel_task_poll(current);
+		io_uring_cancel_task_async(current);
 		io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current));
+	}
 
 	return 0;
 }
@@ -6650,6 +6874,16 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 		seq_printf(m, "Personalities:\n");
 		idr_for_each(&ctx->personality_idr, io_uring_show_cred, m);
 	}
+	seq_printf(m, "Inflight:\n");
+	spin_lock_irq(&ctx->completion_lock);
+	for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) {
+		struct hlist_head *list = &ctx->cancel_hash[i];
+		struct io_kiocb *req;
+
+		hlist_for_each_entry(req, list, hash_node)
+			seq_printf(m, "  req=%lx, op=%d, tsk list=%d\n", (long) req, req->opcode, req->task->sched_work != NULL);
+	}
+	spin_unlock_irq(&ctx->completion_lock);
 	mutex_unlock(&ctx->uring_lock);
 }
 
@@ -6863,7 +7097,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
 
 	p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
 			IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |
-			IORING_FEAT_CUR_PERSONALITY;
+			IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL;
 	trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
 	return ret;
 err:
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 27bd9e4f927b..433e02b3ffb7 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -357,6 +357,86 @@ TRACE_EVENT(io_uring_submit_sqe,
 			  __entry->force_nonblock, __entry->sq_thread)
 );
 
+TRACE_EVENT(io_uring_poll_arm,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask, int events),
+
+	TP_ARGS(ctx, opcode, user_data, mask, events),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+		__field(  int,		events		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+		__entry->events		= events;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x, events 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask, __entry->events)
+);
+
+TRACE_EVENT(io_uring_poll_wake,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask),
+
+	TP_ARGS(ctx, opcode, user_data, mask),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask)
+);
+
+TRACE_EVENT(io_uring_task_add,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask),
+
+	TP_ARGS(ctx, opcode, user_data, mask),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask %x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3f7961c1c243..653865554691 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -204,6 +204,7 @@ struct io_uring_params {
 #define IORING_FEAT_SUBMIT_STABLE	(1U << 2)
 #define IORING_FEAT_RW_CUR_POS		(1U << 3)
 #define IORING_FEAT_CUR_PERSONALITY	(1U << 4)
+#define IORING_FEAT_FAST_POLL		(1U << 5)
 
 /*
  * io_uring_register(2) opcodes and arguments
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/9] kernel: abstract out task work helpers
  2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
@ 2020-02-20 21:07   ` Peter Zijlstra
  2020-02-20 21:08     ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2020-02-20 21:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, glauber, asml.silence

On Thu, Feb 20, 2020 at 01:31:47PM -0700, Jens Axboe wrote:

> @@ -27,39 +43,25 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
>  int
>  task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
>  {
> -	struct callback_head *head;
> +	int ret;
>  
> -	do {
> -		head = READ_ONCE(task->task_works);
> -		if (unlikely(head == &work_exited))
> -			return -ESRCH;
> -		work->next = head;
> -	} while (cmpxchg(&task->task_works, head, work) != head);
> +	ret = __task_work_add(task, &task->task_works, work);
>  
>  	if (notify)

	if (!ret && notify)

>  		set_notify_resume(task);
> -	return 0;
> +
> +	return ret;
>  }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/9] kernel: abstract out task work helpers
  2020-02-20 21:07   ` Peter Zijlstra
@ 2020-02-20 21:08     ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 21:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: io-uring, glauber, asml.silence

On 2/20/20 2:07 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2020 at 01:31:47PM -0700, Jens Axboe wrote:
> 
>> @@ -27,39 +43,25 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
>>  int
>>  task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
>>  {
>> -	struct callback_head *head;
>> +	int ret;
>>  
>> -	do {
>> -		head = READ_ONCE(task->task_works);
>> -		if (unlikely(head == &work_exited))
>> -			return -ESRCH;
>> -		work->next = head;
>> -	} while (cmpxchg(&task->task_works, head, work) != head);
>> +	ret = __task_work_add(task, &task->task_works, work);
>>  
>>  	if (notify)
> 
> 	if (!ret && notify)

Good catch, thanks! Fixed up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/9] sched: add a sched_work list
  2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
@ 2020-02-20 21:17   ` Peter Zijlstra
  2020-02-20 21:53     ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2020-02-20 21:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, glauber, asml.silence

On Thu, Feb 20, 2020 at 01:31:48PM -0700, Jens Axboe wrote:
> This is similar to the task_works, and uses the same infrastructure, but
> the sched_work list is run when the task is being scheduled in or out.
> 
> The intended use case here is for core code to be able to add work
> that should be automatically run by the task, without the task needing
> to do anything. This is done outside of the task, one example would be
> from waitqueue handlers, or anything else that is invoked out-of-band
> from the task itself.
> 


> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 3445421266e7..ba62485d5b3d 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -3,7 +3,14 @@
>  #include <linux/task_work.h>
>  #include <linux/tracehook.h>
>  
> -static struct callback_head work_exited; /* all we need is ->next == NULL */
> +static void task_exit_func(struct callback_head *head)
> +{
> +}
> +
> +static struct callback_head work_exited = {
> +	.next	= NULL,
> +	.func	= task_exit_func,
> +};

Do we really need this? It seems to suggest we're trying to execute
work_exited, which would be an error.

Doing so would be the result of calling sched_work_run() after
exit_task_work(). I suppose that's actually possible.. the problem is
that that would reset sched_work to NULL and re-allow queueing works,
which would then leak.

I'll look at it in more detail tomorrow, I'm tired...


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/9] sched: add a sched_work list
  2020-02-20 21:17   ` Peter Zijlstra
@ 2020-02-20 21:53     ` Jens Axboe
  2020-02-20 22:02       ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 21:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: io-uring, glauber, asml.silence

On 2/20/20 2:17 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2020 at 01:31:48PM -0700, Jens Axboe wrote:
>> This is similar to the task_works, and uses the same infrastructure, but
>> the sched_work list is run when the task is being scheduled in or out.
>>
>> The intended use case here is for core code to be able to add work
>> that should be automatically run by the task, without the task needing
>> to do anything. This is done outside of the task, one example would be
>> from waitqueue handlers, or anything else that is invoked out-of-band
>> from the task itself.
>>
> 
> 
>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>> index 3445421266e7..ba62485d5b3d 100644
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -3,7 +3,14 @@
>>  #include <linux/task_work.h>
>>  #include <linux/tracehook.h>
>>  
>> -static struct callback_head work_exited; /* all we need is ->next == NULL */
>> +static void task_exit_func(struct callback_head *head)
>> +{
>> +}
>> +
>> +static struct callback_head work_exited = {
>> +	.next	= NULL,
>> +	.func	= task_exit_func,
>> +};
> 
> Do we really need this? It seems to suggest we're trying to execute
> work_exited, which would be an error.
> 
> Doing so would be the result of calling sched_work_run() after
> exit_task_work(). I suppose that's actually possible.. the problem is
> that that would reset sched_work to NULL and re-allow queueing works,
> which would then leak.
> 
> I'll look at it in more detail tomorrow, I'm tired...

Let me try and instrument it, I definitely hit it on task exit
but might have been induced by some earlier bugs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/9] sched: add a sched_work list
  2020-02-20 21:53     ` Jens Axboe
@ 2020-02-20 22:02       ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 22:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: io-uring, glauber, asml.silence

On 2/20/20 2:53 PM, Jens Axboe wrote:
> On 2/20/20 2:17 PM, Peter Zijlstra wrote:
>> On Thu, Feb 20, 2020 at 01:31:48PM -0700, Jens Axboe wrote:
>>> This is similar to the task_works, and uses the same infrastructure, but
>>> the sched_work list is run when the task is being scheduled in or out.
>>>
>>> The intended use case here is for core code to be able to add work
>>> that should be automatically run by the task, without the task needing
>>> to do anything. This is done outside of the task, one example would be
>>> from waitqueue handlers, or anything else that is invoked out-of-band
>>> from the task itself.
>>>
>>
>>
>>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>>> index 3445421266e7..ba62485d5b3d 100644
>>> --- a/kernel/task_work.c
>>> +++ b/kernel/task_work.c
>>> @@ -3,7 +3,14 @@
>>>  #include <linux/task_work.h>
>>>  #include <linux/tracehook.h>
>>>  
>>> -static struct callback_head work_exited; /* all we need is ->next == NULL */
>>> +static void task_exit_func(struct callback_head *head)
>>> +{
>>> +}
>>> +
>>> +static struct callback_head work_exited = {
>>> +	.next	= NULL,
>>> +	.func	= task_exit_func,
>>> +};
>>
>> Do we really need this? It seems to suggest we're trying to execute
>> work_exited, which would be an error.
>>
>> Doing so would be the result of calling sched_work_run() after
>> exit_task_work(). I suppose that's actually possible.. the problem is
>> that that would reset sched_work to NULL and re-allow queueing works,
>> which would then leak.
>>
>> I'll look at it in more detail tomorrow, I'm tired...
> 
> Let me try and instrument it, I definitely hit it on task exit
> but might have been induced by some earlier bugs.

I suspect I hit this before we added sched_work_run() to
exit_task_work(). I re-ran all the testing, and it's definitely
not trigger now.

So I'll remove it for now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
@ 2020-02-20 22:02   ` Jann Horn
  2020-02-20 22:14     ` Jens Axboe
                       ` (2 more replies)
  2020-02-21 13:51   ` Pavel Begunkov
  1 sibling, 3 replies; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> For poll requests, it's not uncommon to link a read (or write) after
> the poll to execute immediately after the file is marked as ready.
> Since the poll completion is called inside the waitqueue wake up handler,
> we have to punt that linked request to async context. This slows down
> the processing, and actually means it's faster to not use a link for this
> use case.
>
> We also run into problems if the completion_lock is contended, as we're
> doing a different lock ordering than the issue side is. Hence we have
> to do trylock for completion, and if that fails, go async. Poll removal
> needs to go async as well, for the same reason.
>
> eventfd notification needs special case as well, to avoid stack blowing
> recursion or deadlocks.
>
> These are all deficiencies that were inherited from the aio poll
> implementation, but I think we can do better. When a poll completes,
> simply queue it up in the task poll list. When the task completes the
> list, we can run dependent links inline as well. This means we never
> have to go async, and we can remove a bunch of code associated with
> that, and optimizations to try and make that run faster. The diffstat
> speaks for itself.
[...]
> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> +static void io_poll_task_func(struct callback_head *cb)
>  {
> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> +       struct io_kiocb *nxt = NULL;
>
[...]
> +       io_poll_task_handler(req, &nxt);
> +       if (nxt)
> +               __io_queue_sqe(nxt, NULL);

This can now get here from anywhere that calls schedule(), right?
Which means that this might almost double the required kernel stack
size, if one codepath exists that calls schedule() while near the
bottom of the stack and another codepath exists that goes from here
through the VFS and again uses a big amount of stack space? This is a
somewhat ugly suggestion, but I wonder whether it'd make sense to
check whether we've consumed over 25% of stack space, or something
like that, and if so, directly punt the request.

Also, can we recursively hit this point? Even if __io_queue_sqe()
doesn't *want* to block, the code it calls into might still block on a
mutex or something like that, at which point the mutex code would call
into schedule(), which would then again hit sched_out_update() and get
here, right? As far as I can tell, this could cause unbounded
recursion.

(On modern kernels with CONFIG_VMAP_STACK=y, running out of stack
space on a task stack is "just" a plain kernel oops instead of nasty
memory corruption, but we still should really try to avoid it.)

>  }
[...]
> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>
>         list_del_init(&poll->wait.entry);
>
[...]
> +       tsk = req->task;
> +       req->result = mask;
> +       init_task_work(&req->sched_work, io_poll_task_func);
> +       sched_work_add(tsk, &req->sched_work);

Doesn't this have to check the return value?

> +       wake_up_process(tsk);
>         return 1;
>  }
>
> @@ -3733,6 +3648,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
>
>         events = READ_ONCE(sqe->poll_events);
>         poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
> +
> +       /* task will wait for requests on exit, don't need a ref */
> +       req->task = current;

Can we get here in SQPOLL mode?

>         return 0;
>  }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:02   ` Jann Horn
@ 2020-02-20 22:14     ` Jens Axboe
  2020-02-20 22:18       ` Jens Axboe
                         ` (2 more replies)
  2020-02-20 22:56     ` Jann Horn
  2020-02-21 10:47     ` Peter Zijlstra
  2 siblings, 3 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 22:14 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:02 PM, Jann Horn wrote:
> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> For poll requests, it's not uncommon to link a read (or write) after
>> the poll to execute immediately after the file is marked as ready.
>> Since the poll completion is called inside the waitqueue wake up handler,
>> we have to punt that linked request to async context. This slows down
>> the processing, and actually means it's faster to not use a link for this
>> use case.
>>
>> We also run into problems if the completion_lock is contended, as we're
>> doing a different lock ordering than the issue side is. Hence we have
>> to do trylock for completion, and if that fails, go async. Poll removal
>> needs to go async as well, for the same reason.
>>
>> eventfd notification needs special case as well, to avoid stack blowing
>> recursion or deadlocks.
>>
>> These are all deficiencies that were inherited from the aio poll
>> implementation, but I think we can do better. When a poll completes,
>> simply queue it up in the task poll list. When the task completes the
>> list, we can run dependent links inline as well. This means we never
>> have to go async, and we can remove a bunch of code associated with
>> that, and optimizations to try and make that run faster. The diffstat
>> speaks for itself.
> [...]
>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>> +static void io_poll_task_func(struct callback_head *cb)
>>  {
>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>> +       struct io_kiocb *nxt = NULL;
>>
> [...]
>> +       io_poll_task_handler(req, &nxt);
>> +       if (nxt)
>> +               __io_queue_sqe(nxt, NULL);
> 
> This can now get here from anywhere that calls schedule(), right?
> Which means that this might almost double the required kernel stack
> size, if one codepath exists that calls schedule() while near the
> bottom of the stack and another codepath exists that goes from here
> through the VFS and again uses a big amount of stack space? This is a
> somewhat ugly suggestion, but I wonder whether it'd make sense to
> check whether we've consumed over 25% of stack space, or something
> like that, and if so, directly punt the request.

Right, it'll increase the stack usage. Not against adding some safe
guard that punts if we're too deep in, though I'd have to look how to
even do that... Looks like stack_not_used(), though it's not clear to me
how efficient that is?

> Also, can we recursively hit this point? Even if __io_queue_sqe()
> doesn't *want* to block, the code it calls into might still block on a
> mutex or something like that, at which point the mutex code would call
> into schedule(), which would then again hit sched_out_update() and get
> here, right? As far as I can tell, this could cause unbounded
> recursion.

The sched_work items are pruned before being run, so that can't happen.

> (On modern kernels with CONFIG_VMAP_STACK=y, running out of stack
> space on a task stack is "just" a plain kernel oops instead of nasty
> memory corruption, but we still should really try to avoid it.)

Certainly!

>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>
>>         list_del_init(&poll->wait.entry);
>>
> [...]
>> +       tsk = req->task;
>> +       req->result = mask;
>> +       init_task_work(&req->sched_work, io_poll_task_func);
>> +       sched_work_add(tsk, &req->sched_work);
> 
> Doesn't this have to check the return value?

Trying to think if we can get here with TASK_EXITING, but probably safer
to just handle it in any case. I'll add that.

>> +       wake_up_process(tsk);
>>         return 1;
>>  }
>>
>> @@ -3733,6 +3648,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
>>
>>         events = READ_ONCE(sqe->poll_events);
>>         poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
>> +
>> +       /* task will wait for requests on exit, don't need a ref */
>> +       req->task = current;
> 
> Can we get here in SQPOLL mode?

We can, this and the async poll arm should just revert to the old
behavior for SQPOLL. I'll make that change.

Thanks for taking a look!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:14     ` Jens Axboe
@ 2020-02-20 22:18       ` Jens Axboe
  2020-02-20 22:25         ` Jann Horn
  2020-02-20 22:23       ` Jens Axboe
  2020-02-20 22:23       ` Jann Horn
  2 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 22:18 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:14 PM, Jens Axboe wrote:
>>> @@ -3733,6 +3648,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
>>>
>>>         events = READ_ONCE(sqe->poll_events);
>>>         poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
>>> +
>>> +       /* task will wait for requests on exit, don't need a ref */
>>> +       req->task = current;
>>
>> Can we get here in SQPOLL mode?
> 
> We can, this and the async poll arm should just revert to the old
> behavior for SQPOLL. I'll make that change.

Actually, I think that should work fine, are you seeing a reason it
should not?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:14     ` Jens Axboe
  2020-02-20 22:18       ` Jens Axboe
@ 2020-02-20 22:23       ` Jens Axboe
  2020-02-20 22:38         ` Jann Horn
  2020-02-20 22:23       ` Jann Horn
  2 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 22:23 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:14 PM, Jens Axboe wrote:
>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>
>>>         list_del_init(&poll->wait.entry);
>>>
>> [...]
>>> +       tsk = req->task;
>>> +       req->result = mask;
>>> +       init_task_work(&req->sched_work, io_poll_task_func);
>>> +       sched_work_add(tsk, &req->sched_work);
>>
>> Doesn't this have to check the return value?
> 
> Trying to think if we can get here with TASK_EXITING, but probably safer
> to just handle it in any case. I'll add that.

Double checked this one, and I think it's good as-is, but needs a
comment. If the sched_work_add() fails, then the work item is still in
the poll hash on the ctx. That work is canceled on exit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:14     ` Jens Axboe
  2020-02-20 22:18       ` Jens Axboe
  2020-02-20 22:23       ` Jens Axboe
@ 2020-02-20 22:23       ` Jann Horn
  2020-02-20 23:00         ` Jens Axboe
  2 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 3:02 PM, Jann Horn wrote:
> > On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> For poll requests, it's not uncommon to link a read (or write) after
> >> the poll to execute immediately after the file is marked as ready.
> >> Since the poll completion is called inside the waitqueue wake up handler,
> >> we have to punt that linked request to async context. This slows down
> >> the processing, and actually means it's faster to not use a link for this
> >> use case.
> >>
> >> We also run into problems if the completion_lock is contended, as we're
> >> doing a different lock ordering than the issue side is. Hence we have
> >> to do trylock for completion, and if that fails, go async. Poll removal
> >> needs to go async as well, for the same reason.
> >>
> >> eventfd notification needs special case as well, to avoid stack blowing
> >> recursion or deadlocks.
> >>
> >> These are all deficiencies that were inherited from the aio poll
> >> implementation, but I think we can do better. When a poll completes,
> >> simply queue it up in the task poll list. When the task completes the
> >> list, we can run dependent links inline as well. This means we never
> >> have to go async, and we can remove a bunch of code associated with
> >> that, and optimizations to try and make that run faster. The diffstat
> >> speaks for itself.
> > [...]
> >> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >> +static void io_poll_task_func(struct callback_head *cb)
> >>  {
> >> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >> +       struct io_kiocb *nxt = NULL;
> >>
> > [...]
> >> +       io_poll_task_handler(req, &nxt);
> >> +       if (nxt)
> >> +               __io_queue_sqe(nxt, NULL);
> >
> > This can now get here from anywhere that calls schedule(), right?
> > Which means that this might almost double the required kernel stack
> > size, if one codepath exists that calls schedule() while near the
> > bottom of the stack and another codepath exists that goes from here
> > through the VFS and again uses a big amount of stack space? This is a
> > somewhat ugly suggestion, but I wonder whether it'd make sense to
> > check whether we've consumed over 25% of stack space, or something
> > like that, and if so, directly punt the request.
>
> Right, it'll increase the stack usage. Not against adding some safe
> guard that punts if we're too deep in, though I'd have to look how to
> even do that... Looks like stack_not_used(), though it's not clear to me
> how efficient that is?

No, I don't think you want to do that... at least on X86-64, I think
something vaguely like this should do the job:

unsigned long cur_stack = (unsigned long)__builtin_frame_address(0);
unsigned long begin = (unsigned long)task_stack_page(task);
unsigned long end   = (unsigned long)task_stack_page(task) + THREAD_SIZE;
if (cur_stack < begin || cur_stack >= end || cur_stack < begin +
THREAD_SIZE*3/4)
  [bailout]

But since stacks grow in different directions depending on the
architecture and so on, it might have to be an arch-specific thing...
I'm not sure.

> > Also, can we recursively hit this point? Even if __io_queue_sqe()
> > doesn't *want* to block, the code it calls into might still block on a
> > mutex or something like that, at which point the mutex code would call
> > into schedule(), which would then again hit sched_out_update() and get
> > here, right? As far as I can tell, this could cause unbounded
> > recursion.
>
> The sched_work items are pruned before being run, so that can't happen.

And is it impossible for new ones to be added in the meantime if a
second poll operation completes in the background just when we're
entering __io_queue_sqe()?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:18       ` Jens Axboe
@ 2020-02-20 22:25         ` Jann Horn
  0 siblings, 0 replies; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:18 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 3:14 PM, Jens Axboe wrote:
> >>> @@ -3733,6 +3648,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
> >>>
> >>>         events = READ_ONCE(sqe->poll_events);
> >>>         poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
> >>> +
> >>> +       /* task will wait for requests on exit, don't need a ref */
> >>> +       req->task = current;
> >>
> >> Can we get here in SQPOLL mode?
> >
> > We can, this and the async poll arm should just revert to the old
> > behavior for SQPOLL. I'll make that change.
>
> Actually, I think that should work fine, are you seeing a reason it
> should not?

Hm, no, I guess that might work.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:23       ` Jens Axboe
@ 2020-02-20 22:38         ` Jann Horn
  2020-02-20 22:56           ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:23 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 3:14 PM, Jens Axboe wrote:
> >>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
> >>>
> >>>         list_del_init(&poll->wait.entry);
> >>>
> >> [...]
> >>> +       tsk = req->task;
> >>> +       req->result = mask;
> >>> +       init_task_work(&req->sched_work, io_poll_task_func);
> >>> +       sched_work_add(tsk, &req->sched_work);
> >>
> >> Doesn't this have to check the return value?
> >
> > Trying to think if we can get here with TASK_EXITING, but probably safer
> > to just handle it in any case. I'll add that.
>
> Double checked this one, and I think it's good as-is, but needs a
> comment. If the sched_work_add() fails, then the work item is still in
> the poll hash on the ctx. That work is canceled on exit.

You mean via io_poll_remove_all()? That doesn't happen when a thread
dies, right?

As far as I can tell, the following might happen:

1. process with threads A and B set up uring
2. thread B submits chained requests poll->read
3. thread A waits for request completion
4. thread B dies
5. poll waitqueue is notified, data is ready

Even if there isn't a memory leak, you'd still want the read request
to execute at some point so that thread A can see the result, right?

And actually, in this scenario, wouldn't the req->task be a dangling
pointer, since you're not holding a reference? Or is there some magic
callback from do_exit() to io_uring that I missed? There is a comment
"/* task will wait for requests on exit, don't need a ref */", but I
don't see how that works...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:38         ` Jann Horn
@ 2020-02-20 22:56           ` Jens Axboe
  2020-02-20 22:58             ` Jann Horn
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 22:56 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:38 PM, Jann Horn wrote:
> On Thu, Feb 20, 2020 at 11:23 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 3:14 PM, Jens Axboe wrote:
>>>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>>
>>>>>         list_del_init(&poll->wait.entry);
>>>>>
>>>> [...]
>>>>> +       tsk = req->task;
>>>>> +       req->result = mask;
>>>>> +       init_task_work(&req->sched_work, io_poll_task_func);
>>>>> +       sched_work_add(tsk, &req->sched_work);
>>>>
>>>> Doesn't this have to check the return value?
>>>
>>> Trying to think if we can get here with TASK_EXITING, but probably safer
>>> to just handle it in any case. I'll add that.
>>
>> Double checked this one, and I think it's good as-is, but needs a
>> comment. If the sched_work_add() fails, then the work item is still in
>> the poll hash on the ctx. That work is canceled on exit.
> 
> You mean via io_poll_remove_all()? That doesn't happen when a thread
> dies, right?

Off of io_uring_flush, we do:

if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
	io_uring_cancel_task_poll(current);
	io_uring_cancel_task_async(current);
	io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current));
}

to cancel _anything_ that the task has pending.

> As far as I can tell, the following might happen:
> 
> 1. process with threads A and B set up uring
> 2. thread B submits chained requests poll->read
> 3. thread A waits for request completion
> 4. thread B dies
> 5. poll waitqueue is notified, data is ready

Unless I'm mistaken, when B dies, the requests from #2 will be canceled.

> Even if there isn't a memory leak, you'd still want the read request
> to execute at some point so that thread A can see the result, right?

It just needs to complete, if the task is going away, then a cancelation
is fine too.

> And actually, in this scenario, wouldn't the req->task be a dangling
> pointer, since you're not holding a reference? Or is there some magic
> callback from do_exit() to io_uring that I missed? There is a comment
> "/* task will wait for requests on exit, don't need a ref */", but I
> don't see how that works...

That'd only be the case if we didn't cancel requests when it dies. I'll
double check if that's 100% the case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:02   ` Jann Horn
  2020-02-20 22:14     ` Jens Axboe
@ 2020-02-20 22:56     ` Jann Horn
  2020-02-21 10:47     ` Peter Zijlstra
  2 siblings, 0 replies; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:02 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > For poll requests, it's not uncommon to link a read (or write) after
> > the poll to execute immediately after the file is marked as ready.
> > Since the poll completion is called inside the waitqueue wake up handler,
> > we have to punt that linked request to async context. This slows down
> > the processing, and actually means it's faster to not use a link for this
> > use case.
> >
> > We also run into problems if the completion_lock is contended, as we're
> > doing a different lock ordering than the issue side is. Hence we have
> > to do trylock for completion, and if that fails, go async. Poll removal
> > needs to go async as well, for the same reason.
> >
> > eventfd notification needs special case as well, to avoid stack blowing
> > recursion or deadlocks.
> >
> > These are all deficiencies that were inherited from the aio poll
> > implementation, but I think we can do better. When a poll completes,
> > simply queue it up in the task poll list. When the task completes the
> > list, we can run dependent links inline as well. This means we never
> > have to go async, and we can remove a bunch of code associated with
> > that, and optimizations to try and make that run faster. The diffstat
> > speaks for itself.
> [...]
> > -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> > +static void io_poll_task_func(struct callback_head *cb)
> >  {
> > -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> > +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> > +       struct io_kiocb *nxt = NULL;
> >
> [...]
> > +       io_poll_task_handler(req, &nxt);
> > +       if (nxt)
> > +               __io_queue_sqe(nxt, NULL);
>
> This can now get here from anywhere that calls schedule(), right?
> Which means that this might almost double the required kernel stack
> size, if one codepath exists that calls schedule() while near the
> bottom of the stack and another codepath exists that goes from here
> through the VFS and again uses a big amount of stack space?

Oh, I think this also implies that any mutex reachable via any of the
nonblocking uring ops nests inside any mutex under which we happen to
schedule(), right? I wonder whether that's going to cause deadlocks...

For example, FUSE's ->read_iter() can call fuse_direct_io(), which can
call inode_lock() and then call fuse_sync_writes() under the inode
lock, which can wait_event(), which can schedule(); and if uring then
from schedule() calls ->read_iter() again, you could reach
inode_lock() on the same inode again, causing a deadlock, I think?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:56           ` Jens Axboe
@ 2020-02-20 22:58             ` Jann Horn
  2020-02-20 23:02               ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-20 22:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 3:38 PM, Jann Horn wrote:
> > On Thu, Feb 20, 2020 at 11:23 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2/20/20 3:14 PM, Jens Axboe wrote:
> >>>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
> >>>>>
> >>>>>         list_del_init(&poll->wait.entry);
> >>>>>
> >>>> [...]
> >>>>> +       tsk = req->task;
> >>>>> +       req->result = mask;
> >>>>> +       init_task_work(&req->sched_work, io_poll_task_func);
> >>>>> +       sched_work_add(tsk, &req->sched_work);
> >>>>
> >>>> Doesn't this have to check the return value?
> >>>
> >>> Trying to think if we can get here with TASK_EXITING, but probably safer
> >>> to just handle it in any case. I'll add that.
> >>
> >> Double checked this one, and I think it's good as-is, but needs a
> >> comment. If the sched_work_add() fails, then the work item is still in
> >> the poll hash on the ctx. That work is canceled on exit.
> >
> > You mean via io_poll_remove_all()? That doesn't happen when a thread
> > dies, right?
>
> Off of io_uring_flush, we do:
>
> if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
>         io_uring_cancel_task_poll(current);
>         io_uring_cancel_task_async(current);
>         io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current));
> }
>
> to cancel _anything_ that the task has pending.

->flush() is only for when the uring instance is dropped from a file
descriptor table; threads typically share their file descriptor
tables, and therefore won't ->flush() until the last one dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:23       ` Jann Horn
@ 2020-02-20 23:00         ` Jens Axboe
  2020-02-20 23:12           ` Jann Horn
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 23:00 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:23 PM, Jann Horn wrote:
> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 3:02 PM, Jann Horn wrote:
>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>> the poll to execute immediately after the file is marked as ready.
>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>> we have to punt that linked request to async context. This slows down
>>>> the processing, and actually means it's faster to not use a link for this
>>>> use case.
>>>>
>>>> We also run into problems if the completion_lock is contended, as we're
>>>> doing a different lock ordering than the issue side is. Hence we have
>>>> to do trylock for completion, and if that fails, go async. Poll removal
>>>> needs to go async as well, for the same reason.
>>>>
>>>> eventfd notification needs special case as well, to avoid stack blowing
>>>> recursion or deadlocks.
>>>>
>>>> These are all deficiencies that were inherited from the aio poll
>>>> implementation, but I think we can do better. When a poll completes,
>>>> simply queue it up in the task poll list. When the task completes the
>>>> list, we can run dependent links inline as well. This means we never
>>>> have to go async, and we can remove a bunch of code associated with
>>>> that, and optimizations to try and make that run faster. The diffstat
>>>> speaks for itself.
>>> [...]
>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>>> +static void io_poll_task_func(struct callback_head *cb)
>>>>  {
>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>>> +       struct io_kiocb *nxt = NULL;
>>>>
>>> [...]
>>>> +       io_poll_task_handler(req, &nxt);
>>>> +       if (nxt)
>>>> +               __io_queue_sqe(nxt, NULL);
>>>
>>> This can now get here from anywhere that calls schedule(), right?
>>> Which means that this might almost double the required kernel stack
>>> size, if one codepath exists that calls schedule() while near the
>>> bottom of the stack and another codepath exists that goes from here
>>> through the VFS and again uses a big amount of stack space? This is a
>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>>> check whether we've consumed over 25% of stack space, or something
>>> like that, and if so, directly punt the request.
>>
>> Right, it'll increase the stack usage. Not against adding some safe
>> guard that punts if we're too deep in, though I'd have to look how to
>> even do that... Looks like stack_not_used(), though it's not clear to me
>> how efficient that is?
> 
> No, I don't think you want to do that... at least on X86-64, I think
> something vaguely like this should do the job:
> 
> unsigned long cur_stack = (unsigned long)__builtin_frame_address(0);
> unsigned long begin = (unsigned long)task_stack_page(task);
> unsigned long end   = (unsigned long)task_stack_page(task) + THREAD_SIZE;
> if (cur_stack < begin || cur_stack >= end || cur_stack < begin +
> THREAD_SIZE*3/4)
>   [bailout]
> 
> But since stacks grow in different directions depending on the
> architecture and so on, it might have to be an arch-specific thing...
> I'm not sure.

Yeah, that's fun... Probably a good first attempt is to wire up
an x86-64 variant that works, and use the base of stack_not_used()
for archs that don't provide it. Hopefully that'll get rectified
as time progresses.

>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
>>> doesn't *want* to block, the code it calls into might still block on a
>>> mutex or something like that, at which point the mutex code would call
>>> into schedule(), which would then again hit sched_out_update() and get
>>> here, right? As far as I can tell, this could cause unbounded
>>> recursion.
>>
>> The sched_work items are pruned before being run, so that can't happen.
> 
> And is it impossible for new ones to be added in the meantime if a
> second poll operation completes in the background just when we're
> entering __io_queue_sqe()?

True, that can happen.

I wonder if we just prevent the recursion whether we can ignore most
of it. Eg never process the sched_work list if we're not at the top
level, so to speak.

This should also prevent the deadlock that you mentioned with FUSE
in the next email that just rolled in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:58             ` Jann Horn
@ 2020-02-20 23:02               ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 23:02 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 3:58 PM, Jann Horn wrote:
> On Thu, Feb 20, 2020 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 3:38 PM, Jann Horn wrote:
>>> On Thu, Feb 20, 2020 at 11:23 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 2/20/20 3:14 PM, Jens Axboe wrote:
>>>>>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>>>>
>>>>>>>         list_del_init(&poll->wait.entry);
>>>>>>>
>>>>>> [...]
>>>>>>> +       tsk = req->task;
>>>>>>> +       req->result = mask;
>>>>>>> +       init_task_work(&req->sched_work, io_poll_task_func);
>>>>>>> +       sched_work_add(tsk, &req->sched_work);
>>>>>>
>>>>>> Doesn't this have to check the return value?
>>>>>
>>>>> Trying to think if we can get here with TASK_EXITING, but probably safer
>>>>> to just handle it in any case. I'll add that.
>>>>
>>>> Double checked this one, and I think it's good as-is, but needs a
>>>> comment. If the sched_work_add() fails, then the work item is still in
>>>> the poll hash on the ctx. That work is canceled on exit.
>>>
>>> You mean via io_poll_remove_all()? That doesn't happen when a thread
>>> dies, right?
>>
>> Off of io_uring_flush, we do:
>>
>> if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
>>         io_uring_cancel_task_poll(current);
>>         io_uring_cancel_task_async(current);
>>         io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current));
>> }
>>
>> to cancel _anything_ that the task has pending.
> 
> ->flush() is only for when the uring instance is dropped from a file
> descriptor table; threads typically share their file descriptor
> tables, and therefore won't ->flush() until the last one dies.

True, then I guess I'll need some other notifier for that particular
case. Might be able to use sched_work_run() for that, since we know
that'll definitely get called when the task exits.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 23:00         ` Jens Axboe
@ 2020-02-20 23:12           ` Jann Horn
  2020-02-20 23:22             ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-20 23:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 3:23 PM, Jann Horn wrote:
> > On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2/20/20 3:02 PM, Jann Horn wrote:
> >>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> For poll requests, it's not uncommon to link a read (or write) after
> >>>> the poll to execute immediately after the file is marked as ready.
> >>>> Since the poll completion is called inside the waitqueue wake up handler,
> >>>> we have to punt that linked request to async context. This slows down
> >>>> the processing, and actually means it's faster to not use a link for this
> >>>> use case.
[...]
> >>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>>> +static void io_poll_task_func(struct callback_head *cb)
> >>>>  {
> >>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>>> +       struct io_kiocb *nxt = NULL;
> >>>>
> >>> [...]
> >>>> +       io_poll_task_handler(req, &nxt);
> >>>> +       if (nxt)
> >>>> +               __io_queue_sqe(nxt, NULL);
> >>>
> >>> This can now get here from anywhere that calls schedule(), right?
> >>> Which means that this might almost double the required kernel stack
> >>> size, if one codepath exists that calls schedule() while near the
> >>> bottom of the stack and another codepath exists that goes from here
> >>> through the VFS and again uses a big amount of stack space? This is a
> >>> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >>> check whether we've consumed over 25% of stack space, or something
> >>> like that, and if so, directly punt the request.
[...]
> >>> Also, can we recursively hit this point? Even if __io_queue_sqe()
> >>> doesn't *want* to block, the code it calls into might still block on a
> >>> mutex or something like that, at which point the mutex code would call
> >>> into schedule(), which would then again hit sched_out_update() and get
> >>> here, right? As far as I can tell, this could cause unbounded
> >>> recursion.
> >>
> >> The sched_work items are pruned before being run, so that can't happen.
> >
> > And is it impossible for new ones to be added in the meantime if a
> > second poll operation completes in the background just when we're
> > entering __io_queue_sqe()?
>
> True, that can happen.
>
> I wonder if we just prevent the recursion whether we can ignore most
> of it. Eg never process the sched_work list if we're not at the top
> level, so to speak.
>
> This should also prevent the deadlock that you mentioned with FUSE
> in the next email that just rolled in.

But there the first ->read_iter could be from outside io_uring. So you
don't just have to worry about nesting inside an already-running uring
work; you also have to worry about nesting inside more or less
anything else that might be holding mutexes. So I think you'd pretty
much have to whitelist known-safe schedule() callers, or something
like that.

Taking a step back: Do you know why this whole approach brings the
kind of performance benefit you mentioned in the cover letter? 4x is a
lot... Is it that expensive to take a trip through the scheduler?
I wonder whether the performance numbers for the echo test would
change if you commented out io_worker_spin_for_work()...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 23:12           ` Jann Horn
@ 2020-02-20 23:22             ` Jens Axboe
  2020-02-21  1:29               ` Jann Horn
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-20 23:22 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 4:12 PM, Jann Horn wrote:
> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 3:23 PM, Jann Horn wrote:
>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>>> the poll to execute immediately after the file is marked as ready.
>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>>> we have to punt that linked request to async context. This slows down
>>>>>> the processing, and actually means it's faster to not use a link for this
>>>>>> use case.
> [...]
>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>>>>> +static void io_poll_task_func(struct callback_head *cb)
>>>>>>  {
>>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>>>>> +       struct io_kiocb *nxt = NULL;
>>>>>>
>>>>> [...]
>>>>>> +       io_poll_task_handler(req, &nxt);
>>>>>> +       if (nxt)
>>>>>> +               __io_queue_sqe(nxt, NULL);
>>>>>
>>>>> This can now get here from anywhere that calls schedule(), right?
>>>>> Which means that this might almost double the required kernel stack
>>>>> size, if one codepath exists that calls schedule() while near the
>>>>> bottom of the stack and another codepath exists that goes from here
>>>>> through the VFS and again uses a big amount of stack space? This is a
>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>>>>> check whether we've consumed over 25% of stack space, or something
>>>>> like that, and if so, directly punt the request.
> [...]
>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
>>>>> doesn't *want* to block, the code it calls into might still block on a
>>>>> mutex or something like that, at which point the mutex code would call
>>>>> into schedule(), which would then again hit sched_out_update() and get
>>>>> here, right? As far as I can tell, this could cause unbounded
>>>>> recursion.
>>>>
>>>> The sched_work items are pruned before being run, so that can't happen.
>>>
>>> And is it impossible for new ones to be added in the meantime if a
>>> second poll operation completes in the background just when we're
>>> entering __io_queue_sqe()?
>>
>> True, that can happen.
>>
>> I wonder if we just prevent the recursion whether we can ignore most
>> of it. Eg never process the sched_work list if we're not at the top
>> level, so to speak.
>>
>> This should also prevent the deadlock that you mentioned with FUSE
>> in the next email that just rolled in.
> 
> But there the first ->read_iter could be from outside io_uring. So you
> don't just have to worry about nesting inside an already-running uring
> work; you also have to worry about nesting inside more or less
> anything else that might be holding mutexes. So I think you'd pretty
> much have to whitelist known-safe schedule() callers, or something
> like that.

I'll see if I can come up with something for that. Ideally any issue
with IOCB_NOWAIT set should be honored, and trylock etc should be used.
But I don't think we can fully rely on that, we need something a bit
more solid...

> Taking a step back: Do you know why this whole approach brings the
> kind of performance benefit you mentioned in the cover letter? 4x is a
> lot... Is it that expensive to take a trip through the scheduler?
> I wonder whether the performance numbers for the echo test would
> change if you commented out io_worker_spin_for_work()...

If anything, I expect the spin removal to make it worse. There's really
no magic there on why it's faster, if you offload work to a thread that
is essentially sync, then you're going to take a huge hit in
performance. It's the difference between:

1) Queue work with thread, wake up thread
2) Thread wakes, starts work, goes to sleep.
3) Data available, thread is woken, does work
4) Thread signals completion of work

versus just completing the work when it's ready and not having any
switches to a worker thread at all. As the cover letter mentions, the
single client case is a huge win, and that is of course the biggest win
because everything is idle. If the thread doing the offload can be kept
running, the gains become smaller as we're not paying those wake/sleep
penalties anymore.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 23:22             ` Jens Axboe
@ 2020-02-21  1:29               ` Jann Horn
  2020-02-21 17:32                 ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-21  1:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 4:12 PM, Jann Horn wrote:
> > On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2/20/20 3:23 PM, Jann Horn wrote:
> >>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 2/20/20 3:02 PM, Jann Horn wrote:
> >>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> For poll requests, it's not uncommon to link a read (or write) after
> >>>>>> the poll to execute immediately after the file is marked as ready.
> >>>>>> Since the poll completion is called inside the waitqueue wake up handler,
> >>>>>> we have to punt that linked request to async context. This slows down
> >>>>>> the processing, and actually means it's faster to not use a link for this
> >>>>>> use case.
> > [...]
> >>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>>>>> +static void io_poll_task_func(struct callback_head *cb)
> >>>>>>  {
> >>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>>>>> +       struct io_kiocb *nxt = NULL;
> >>>>>>
> >>>>> [...]
> >>>>>> +       io_poll_task_handler(req, &nxt);
> >>>>>> +       if (nxt)
> >>>>>> +               __io_queue_sqe(nxt, NULL);
> >>>>>
> >>>>> This can now get here from anywhere that calls schedule(), right?
> >>>>> Which means that this might almost double the required kernel stack
> >>>>> size, if one codepath exists that calls schedule() while near the
> >>>>> bottom of the stack and another codepath exists that goes from here
> >>>>> through the VFS and again uses a big amount of stack space? This is a
> >>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >>>>> check whether we've consumed over 25% of stack space, or something
> >>>>> like that, and if so, directly punt the request.
> > [...]
> >>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
> >>>>> doesn't *want* to block, the code it calls into might still block on a
> >>>>> mutex or something like that, at which point the mutex code would call
> >>>>> into schedule(), which would then again hit sched_out_update() and get
> >>>>> here, right? As far as I can tell, this could cause unbounded
> >>>>> recursion.
> >>>>
> >>>> The sched_work items are pruned before being run, so that can't happen.
> >>>
> >>> And is it impossible for new ones to be added in the meantime if a
> >>> second poll operation completes in the background just when we're
> >>> entering __io_queue_sqe()?
> >>
> >> True, that can happen.
> >>
> >> I wonder if we just prevent the recursion whether we can ignore most
> >> of it. Eg never process the sched_work list if we're not at the top
> >> level, so to speak.
> >>
> >> This should also prevent the deadlock that you mentioned with FUSE
> >> in the next email that just rolled in.
> >
> > But there the first ->read_iter could be from outside io_uring. So you
> > don't just have to worry about nesting inside an already-running uring
> > work; you also have to worry about nesting inside more or less
> > anything else that might be holding mutexes. So I think you'd pretty
> > much have to whitelist known-safe schedule() callers, or something
> > like that.
>
> I'll see if I can come up with something for that. Ideally any issue
> with IOCB_NOWAIT set should be honored, and trylock etc should be used.

Are you sure? For example, an IO operation typically copies data to
userspace, which can take pagefaults. And those should be handled
synchronously even with IOCB_NOWAIT set, right? And the page fault
code can block on mutexes (like the mmap_sem) or even wait for a
blocking filesystem operation (via file mappings) or for userspace
(via userfaultfd or FUSE mappings).

> But I don't think we can fully rely on that, we need something a bit
> more solid...
>
> > Taking a step back: Do you know why this whole approach brings the
> > kind of performance benefit you mentioned in the cover letter? 4x is a
> > lot... Is it that expensive to take a trip through the scheduler?
> > I wonder whether the performance numbers for the echo test would
> > change if you commented out io_worker_spin_for_work()...
>
> If anything, I expect the spin removal to make it worse. There's really
> no magic there on why it's faster, if you offload work to a thread that
> is essentially sync, then you're going to take a huge hit in
> performance. It's the difference between:
>
> 1) Queue work with thread, wake up thread
> 2) Thread wakes, starts work, goes to sleep.

If we go to sleep here, then the other side hasn't yet sent us
anything, so up to this point, it shouldn't have any impact on the
measured throughput, right?

> 3) Data available, thread is woken, does work

This is the same in the other case: Data is available, the
application's thread is woken and does the work.

> 4) Thread signals completion of work

And this is also basically the same, except that in the worker-thread
case, we have to go through the scheduler to reach userspace, while
with this patch series, we can signal "work is completed" and return
to userspace without an extra trip through the scheduler.

I could imagine this optimization having some performance benefit, but
I'm still sceptical about it buying a 4x benefit without some more
complicated reason behind it.

> versus just completing the work when it's ready and not having any
> switches to a worker thread at all. As the cover letter mentions, the
> single client case is a huge win, and that is of course the biggest win
> because everything is idle. If the thread doing the offload can be kept
> running, the gains become smaller as we're not paying those wake/sleep
> penalties anymore.

I'd really like to see what the scheduler behavior looks like here,
for this single-client echo test. I can imagine three cases (which I
guess are probably going to be mixed because the scheduler moves tasks
around; but I don't actually know much about how the scheduler works,
so my guesses are probably not very helpful):

Case 1: Both the worker and the userspace task are on the same CPU. In
this case, the worker will waste something on the order of 10000
cycles for every message while userspace is runnable, unless the
scheduler decides that the worker has spent so much time on the CPU
that it should relinquish it to the userspace task. (You test for
need_resched() in the busyloop, but AFAIK that just asks the scheduler
whether it wants you to get off the CPU right now, not whether there
are any other runnable tasks on the CPU at the moment.)
Case 2: The worker and the userspace task are on different *physical*
cores and don't share L1D and L2. This will cause a performance
penalty due to constant cacheline bouncing.
Case 3: The worker and the userspace task are on hyperthreads, so they
share L1D and L2 and can run concurrently. Compared to the other two
cases, this would probably work best, but I'm not sure whether the
scheduler is smart enough to specifically target this behavior? (And
if you're running inside a VM, or on a system without hyperthreading,
this isn't even an option.)

So I wonder what things look like if you force the worker and the
userspace task to run on the same CPU without any idle polling in the
worker; or how it looks when you pin the worker and the userspace task
on hyperthreads. And if that does make a difference, it might be worth
considering whether the interaction between io_uring and the scheduler
could be optimized.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 22:02   ` Jann Horn
  2020-02-20 22:14     ` Jens Axboe
  2020-02-20 22:56     ` Jann Horn
@ 2020-02-21 10:47     ` Peter Zijlstra
  2020-02-21 14:49       ` Jens Axboe
  2 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2020-02-21 10:47 UTC (permalink / raw)
  To: Jann Horn; +Cc: Jens Axboe, io-uring, Glauber Costa, Pavel Begunkov

On Thu, Feb 20, 2020 at 11:02:16PM +0100, Jann Horn wrote:
> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > For poll requests, it's not uncommon to link a read (or write) after
> > the poll to execute immediately after the file is marked as ready.
> > Since the poll completion is called inside the waitqueue wake up handler,
> > we have to punt that linked request to async context. This slows down
> > the processing, and actually means it's faster to not use a link for this
> > use case.
> >
> > We also run into problems if the completion_lock is contended, as we're
> > doing a different lock ordering than the issue side is. Hence we have
> > to do trylock for completion, and if that fails, go async. Poll removal
> > needs to go async as well, for the same reason.
> >
> > eventfd notification needs special case as well, to avoid stack blowing
> > recursion or deadlocks.
> >
> > These are all deficiencies that were inherited from the aio poll
> > implementation, but I think we can do better. When a poll completes,
> > simply queue it up in the task poll list. When the task completes the
> > list, we can run dependent links inline as well. This means we never
> > have to go async, and we can remove a bunch of code associated with
> > that, and optimizations to try and make that run faster. The diffstat
> > speaks for itself.
> [...]
> > -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> > +static void io_poll_task_func(struct callback_head *cb)
> >  {
> > -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> > +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> > +       struct io_kiocb *nxt = NULL;
> >
> [...]
> > +       io_poll_task_handler(req, &nxt);
> > +       if (nxt)
> > +               __io_queue_sqe(nxt, NULL);
> 
> This can now get here from anywhere that calls schedule(), right?
> Which means that this might almost double the required kernel stack
> size, if one codepath exists that calls schedule() while near the
> bottom of the stack and another codepath exists that goes from here
> through the VFS and again uses a big amount of stack space? This is a
> somewhat ugly suggestion, but I wonder whether it'd make sense to
> check whether we've consumed over 25% of stack space, or something
> like that, and if so, directly punt the request.

I'm still completely confused as to how io_uring works, and concequently
the ramifications of all this.

But I thought to understand that these sched_work things were only
queued on tasks that were stuck waiting on POLL (or it's io_uring
equivalent). Earlier patches were explicitly running things from
io_cqring_wait(), which might have given me this impression.

The above seems to suggest this is not the case. Which then does indeed
lead to all the worries expressed by Jann. All sorts of nasty nesting is
possible with this.

Can someone please spell this out for me?

Afaict the req->tsk=current thing is set for whomever happens to run
io_poll_add_prep(), which is either a sys_io_uring_enter() or an io-wq
thread afaict.

But I'm then unsure what happens to that thread afterwards.

Jens, what exactly is the benefit of running this on every random
schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
the very last thing the syscall does, task_work.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
  2020-02-20 22:02   ` Jann Horn
@ 2020-02-21 13:51   ` Pavel Begunkov
  2020-02-21 14:50     ` Jens Axboe
  1 sibling, 1 reply; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-21 13:51 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn

On 20/02/2020 23:31, Jens Axboe wrote:
> For poll requests, it's not uncommon to link a read (or write) after
> the poll to execute immediately after the file is marked as ready.
> Since the poll completion is called inside the waitqueue wake up handler,
> we have to punt that linked request to async context. This slows down
> the processing, and actually means it's faster to not use a link for this
> use case.
> 
> We also run into problems if the completion_lock is contended, as we're
> doing a different lock ordering than the issue side is. Hence we have
> to do trylock for completion, and if that fails, go async. Poll removal
> needs to go async as well, for the same reason.
> 
> eventfd notification needs special case as well, to avoid stack blowing
> recursion or deadlocks.
> 
> These are all deficiencies that were inherited from the aio poll
> implementation, but I think we can do better. When a poll completes,
> simply queue it up in the task poll list. When the task completes the
> list, we can run dependent links inline as well. This means we never
> have to go async, and we can remove a bunch of code associated with
> that, and optimizations to try and make that run faster. The diffstat
> speaks for itself.

So, it piggybacks request execution onto a random task, that happens to complete
a poll. Did I get it right?

I can't find where it setting right mm, creds, etc., or why it have them already.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 10:47     ` Peter Zijlstra
@ 2020-02-21 14:49       ` Jens Axboe
  2020-02-21 15:02         ` Jann Horn
  2020-02-21 16:23         ` Peter Zijlstra
  0 siblings, 2 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 14:49 UTC (permalink / raw)
  To: Peter Zijlstra, Jann Horn; +Cc: io-uring, Glauber Costa, Pavel Begunkov

On 2/21/20 3:47 AM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2020 at 11:02:16PM +0100, Jann Horn wrote:
>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> For poll requests, it's not uncommon to link a read (or write) after
>>> the poll to execute immediately after the file is marked as ready.
>>> Since the poll completion is called inside the waitqueue wake up handler,
>>> we have to punt that linked request to async context. This slows down
>>> the processing, and actually means it's faster to not use a link for this
>>> use case.
>>>
>>> We also run into problems if the completion_lock is contended, as we're
>>> doing a different lock ordering than the issue side is. Hence we have
>>> to do trylock for completion, and if that fails, go async. Poll removal
>>> needs to go async as well, for the same reason.
>>>
>>> eventfd notification needs special case as well, to avoid stack blowing
>>> recursion or deadlocks.
>>>
>>> These are all deficiencies that were inherited from the aio poll
>>> implementation, but I think we can do better. When a poll completes,
>>> simply queue it up in the task poll list. When the task completes the
>>> list, we can run dependent links inline as well. This means we never
>>> have to go async, and we can remove a bunch of code associated with
>>> that, and optimizations to try and make that run faster. The diffstat
>>> speaks for itself.
>> [...]
>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>> +static void io_poll_task_func(struct callback_head *cb)
>>>  {
>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>> +       struct io_kiocb *nxt = NULL;
>>>
>> [...]
>>> +       io_poll_task_handler(req, &nxt);
>>> +       if (nxt)
>>> +               __io_queue_sqe(nxt, NULL);
>>
>> This can now get here from anywhere that calls schedule(), right?
>> Which means that this might almost double the required kernel stack
>> size, if one codepath exists that calls schedule() while near the
>> bottom of the stack and another codepath exists that goes from here
>> through the VFS and again uses a big amount of stack space? This is a
>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>> check whether we've consumed over 25% of stack space, or something
>> like that, and if so, directly punt the request.
> 
> I'm still completely confused as to how io_uring works, and concequently
> the ramifications of all this.
> 
> But I thought to understand that these sched_work things were only
> queued on tasks that were stuck waiting on POLL (or it's io_uring
> equivalent). Earlier patches were explicitly running things from
> io_cqring_wait(), which might have given me this impression.

No, that is correct.

> The above seems to suggest this is not the case. Which then does indeed
> lead to all the worries expressed by Jann. All sorts of nasty nesting is
> possible with this.
> 
> Can someone please spell this out for me?

Let me try with an example - the tldr is that a task wants to eg read
from a socket, it issues a io_uring recv() for example. We always do
these non-blocking, there's no data there, the task gets -EAGAIN on the
attempt. What would happen in the previous code is the task would then
offload the recv() to a worker thread, and the worker thread would
block waiting on the receive. This is sub-optimal, in that it both
requires a thread offload and has a thread alive waiting for that data
to come in.

This, instead, arms a poll handler for the task. When we get notified of
data availability, we queue a work item that will the perform the
recv(). This is what is offloaded to the sched_work list currently.

> Afaict the req->tsk=current thing is set for whomever happens to run
> io_poll_add_prep(), which is either a sys_io_uring_enter() or an io-wq
> thread afaict.
> 
> But I'm then unsure what happens to that thread afterwards.
> 
> Jens, what exactly is the benefit of running this on every random
> schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
> the very last thing the syscall does, task_work.

I took a step back and I think we can just use the task work, which
makes this a lot less complicated in terms of locking and schedule
state. Ran some quick testing with the below and it works for me.

I'm going to re-spin based on this and just dump the sched_work
addition.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 81aa3959f326..413ac86d7882 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3529,7 +3529,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * the exit check will ultimately cancel these work items. Hence we
 	 * don't need to check here and handle it specifically.
 	 */
-	sched_work_add(tsk, &req->sched_work);
+	task_work_add(tsk, &req->sched_work, true);
 	wake_up_process(tsk);
 	return 1;
 }
@@ -5367,9 +5367,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	do {
 		if (io_cqring_events(ctx, false) >= min_events)
 			return 0;
-		if (!current->sched_work)
+		if (!current->task_works)
 			break;
-		sched_work_run();
+		task_work_run();
 	} while (1);
 
 	if (sig) {
@@ -5392,6 +5392,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 						TASK_INTERRUPTIBLE);
 		if (io_should_wake(&iowq, false))
 			break;
+		if (current->task_works) {
+			task_work_run();
+			if (io_should_wake(&iowq, false))
+				break;
+			continue;
+		}
 		schedule();
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -6611,7 +6617,7 @@ static void __io_uring_cancel_task(struct task_struct *tsk,
 {
 	struct callback_head *head;
 
-	while ((head = sched_work_cancel(tsk, func)) != NULL) {
+	while ((head = task_work_cancel(tsk, func)) != NULL) {
 		struct io_kiocb *req;
 
 		req = container_of(head, struct io_kiocb, sched_work);
@@ -6886,7 +6892,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 		struct io_kiocb *req;
 
 		hlist_for_each_entry(req, list, hash_node)
-			seq_printf(m, "  req=%lx, op=%d, tsk list=%d\n", (long) req, req->opcode, req->task->sched_work != NULL);
+			seq_printf(m, "  req=%lx, op=%d, tsk list=%d\n", (long) req, req->opcode, req->task->task_works != NULL);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
 	mutex_unlock(&ctx->uring_lock);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 13:51   ` Pavel Begunkov
@ 2020-02-21 14:50     ` Jens Axboe
  2020-02-21 18:30       ` Pavel Begunkov
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 14:50 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/21/20 6:51 AM, Pavel Begunkov wrote:
> On 20/02/2020 23:31, Jens Axboe wrote:
>> For poll requests, it's not uncommon to link a read (or write) after
>> the poll to execute immediately after the file is marked as ready.
>> Since the poll completion is called inside the waitqueue wake up handler,
>> we have to punt that linked request to async context. This slows down
>> the processing, and actually means it's faster to not use a link for this
>> use case.
>>
>> We also run into problems if the completion_lock is contended, as we're
>> doing a different lock ordering than the issue side is. Hence we have
>> to do trylock for completion, and if that fails, go async. Poll removal
>> needs to go async as well, for the same reason.
>>
>> eventfd notification needs special case as well, to avoid stack blowing
>> recursion or deadlocks.
>>
>> These are all deficiencies that were inherited from the aio poll
>> implementation, but I think we can do better. When a poll completes,
>> simply queue it up in the task poll list. When the task completes the
>> list, we can run dependent links inline as well. This means we never
>> have to go async, and we can remove a bunch of code associated with
>> that, and optimizations to try and make that run faster. The diffstat
>> speaks for itself.
> 
> So, it piggybacks request execution onto a random task, that happens
> to complete a poll. Did I get it right?
> 
> I can't find where it setting right mm, creds, etc., or why it have
> them already.

Not a random task, the very task that initially tried to do the receive
(or whatever the operation may be). Hence there's no need to set
mm/creds/whatever, we're still running in the context of the original
task once we retry the operation after the poll signals readiness.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 14:49       ` Jens Axboe
@ 2020-02-21 15:02         ` Jann Horn
  2020-02-21 16:12           ` Peter Zijlstra
  2020-02-21 16:23         ` Peter Zijlstra
  1 sibling, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-21 15:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, io-uring, Glauber Costa, Pavel Begunkov

On Fri, Feb 21, 2020 at 3:49 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/21/20 3:47 AM, Peter Zijlstra wrote:
> > On Thu, Feb 20, 2020 at 11:02:16PM +0100, Jann Horn wrote:
> >> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> For poll requests, it's not uncommon to link a read (or write) after
> >>> the poll to execute immediately after the file is marked as ready.
> >>> Since the poll completion is called inside the waitqueue wake up handler,
> >>> we have to punt that linked request to async context. This slows down
> >>> the processing, and actually means it's faster to not use a link for this
> >>> use case.
> >>>
> >>> We also run into problems if the completion_lock is contended, as we're
> >>> doing a different lock ordering than the issue side is. Hence we have
> >>> to do trylock for completion, and if that fails, go async. Poll removal
> >>> needs to go async as well, for the same reason.
> >>>
> >>> eventfd notification needs special case as well, to avoid stack blowing
> >>> recursion or deadlocks.
> >>>
> >>> These are all deficiencies that were inherited from the aio poll
> >>> implementation, but I think we can do better. When a poll completes,
> >>> simply queue it up in the task poll list. When the task completes the
> >>> list, we can run dependent links inline as well. This means we never
> >>> have to go async, and we can remove a bunch of code associated with
> >>> that, and optimizations to try and make that run faster. The diffstat
> >>> speaks for itself.
> >> [...]
> >>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>> +static void io_poll_task_func(struct callback_head *cb)
> >>>  {
> >>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>> +       struct io_kiocb *nxt = NULL;
> >>>
> >> [...]
> >>> +       io_poll_task_handler(req, &nxt);
> >>> +       if (nxt)
> >>> +               __io_queue_sqe(nxt, NULL);
> >>
> >> This can now get here from anywhere that calls schedule(), right?
> >> Which means that this might almost double the required kernel stack
> >> size, if one codepath exists that calls schedule() while near the
> >> bottom of the stack and another codepath exists that goes from here
> >> through the VFS and again uses a big amount of stack space? This is a
> >> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >> check whether we've consumed over 25% of stack space, or something
> >> like that, and if so, directly punt the request.
> >
> > I'm still completely confused as to how io_uring works, and concequently
> > the ramifications of all this.
> >
> > But I thought to understand that these sched_work things were only
> > queued on tasks that were stuck waiting on POLL (or it's io_uring
> > equivalent). Earlier patches were explicitly running things from
> > io_cqring_wait(), which might have given me this impression.
>
> No, that is correct.

Really? I was pretty sure that io_uring does not force the calling
thread to block on the io_uring operations to continue; and isn't that
the whole point?

I think that when Peter says "stuck waiting on POLL", he really means
"blocked in the context of sys_io_uring_enter() and can't go
anywhere"; while I think you interpret it as "has pending POLL work
queued up in the background and may decide to wait for it in
sys_io_uring_enter(), but might also be doing anything else".

> > The above seems to suggest this is not the case. Which then does indeed
> > lead to all the worries expressed by Jann. All sorts of nasty nesting is
> > possible with this.
> >
> > Can someone please spell this out for me?
>
> Let me try with an example - the tldr is that a task wants to eg read
> from a socket, it issues a io_uring recv() for example. We always do
> these non-blocking, there's no data there, the task gets -EAGAIN on the
> attempt. What would happen in the previous code is the task would then
> offload the recv() to a worker thread, and the worker thread would
> block waiting on the receive. This is sub-optimal, in that it both
> requires a thread offload and has a thread alive waiting for that data
> to come in.
>
> This, instead, arms a poll handler for the task.

And then returns to userspace, which can do whatever it wants, right?

> When we get notified of
> data availability, we queue a work item that will the perform the
> recv(). This is what is offloaded to the sched_work list currently.
>
> > Afaict the req->tsk=current thing is set for whomever happens to run
> > io_poll_add_prep(), which is either a sys_io_uring_enter() or an io-wq
> > thread afaict.
> >
> > But I'm then unsure what happens to that thread afterwards.
> >
> > Jens, what exactly is the benefit of running this on every random
> > schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
> > the very last thing the syscall does, task_work.
>
> I took a step back and I think we can just use the task work, which
> makes this a lot less complicated in terms of locking and schedule
> state. Ran some quick testing with the below and it works for me.
>
> I'm going to re-spin based on this and just dump the sched_work
> addition.

Task work only run on transitions between userspace and kernel, more
or less, right? I guess that means that you'd have to wake up the
ctx->cq_wait queue so that anyone who might e.g. be waiting for the
ring with select() or poll() or epoll or whatever gets woken up and
returns out of the polling syscall? Essentially an intentional
spurious notification to force the task to pick up work. And the
interaction with eventfds would then be a bit weird, I think, since
you may have to signal completion on an eventfd in order to get the
task to pick up the work; but then the work may block, and then stuff
is a bit weird.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 15:02         ` Jann Horn
@ 2020-02-21 16:12           ` Peter Zijlstra
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2020-02-21 16:12 UTC (permalink / raw)
  To: Jann Horn; +Cc: Jens Axboe, io-uring, Glauber Costa, Pavel Begunkov

On Fri, Feb 21, 2020 at 04:02:36PM +0100, Jann Horn wrote:
> On Fri, Feb 21, 2020 at 3:49 PM Jens Axboe <axboe@kernel.dk> wrote:
> > On 2/21/20 3:47 AM, Peter Zijlstra wrote:

> > > But I thought to understand that these sched_work things were only
> > > queued on tasks that were stuck waiting on POLL (or it's io_uring
> > > equivalent). Earlier patches were explicitly running things from
> > > io_cqring_wait(), which might have given me this impression.
> >
> > No, that is correct.
> 
> Really? I was pretty sure that io_uring does not force the calling
> thread to block on the io_uring operations to continue; and isn't that
> the whole point?
> 
> I think that when Peter says "stuck waiting on POLL", he really means
> "blocked in the context of sys_io_uring_enter() and can't go
> anywhere";

Exactly.

> while I think you interpret it as "has pending POLL work
> queued up in the background and may decide to wait for it in
> sys_io_uring_enter(), but might also be doing anything else".

In which case it can hit schedule() at some random point before it gets
to io_uring_cqring_wait().

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 14:49       ` Jens Axboe
  2020-02-21 15:02         ` Jann Horn
@ 2020-02-21 16:23         ` Peter Zijlstra
  2020-02-21 20:13           ` Jens Axboe
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2020-02-21 16:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jann Horn, io-uring, Glauber Costa, Pavel Begunkov

On Fri, Feb 21, 2020 at 06:49:16AM -0800, Jens Axboe wrote:

> > Jens, what exactly is the benefit of running this on every random
> > schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
> > the very last thing the syscall does, task_work.
> 
> I took a step back and I think we can just use the task work, which
> makes this a lot less complicated in terms of locking and schedule
> state. Ran some quick testing with the below and it works for me.
> 
> I'm going to re-spin based on this and just dump the sched_work
> addition.

Aswesome, simpler is better.

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 81aa3959f326..413ac86d7882 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3529,7 +3529,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
>  	 * the exit check will ultimately cancel these work items. Hence we
>  	 * don't need to check here and handle it specifically.
>  	 */
> -	sched_work_add(tsk, &req->sched_work);
> +	task_work_add(tsk, &req->sched_work, true);
>  	wake_up_process(tsk);
>  	return 1;
>  }
> @@ -5367,9 +5367,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  	do {
>  		if (io_cqring_events(ctx, false) >= min_events)
>  			return 0;
> -		if (!current->sched_work)
> +		if (!current->task_works)
>  			break;
> -		sched_work_run();
> +		task_work_run();
>  	} while (1);
>  
>  	if (sig) {
> @@ -5392,6 +5392,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  						TASK_INTERRUPTIBLE);
>  		if (io_should_wake(&iowq, false))
>  			break;
> +		if (current->task_works) {
> +			task_work_run();
> +			if (io_should_wake(&iowq, false))
> +				break;
> +			continue;
> +		}

		if (current->task_works)
			task_work_run();
		if (io_should_wake(&iowq, false);
			break;

doesn't work?

>  		schedule();
>  		if (signal_pending(current)) {
>  			ret = -EINTR;


Anyway, we need to be careful about the context where we call
task_work_run(), but afaict doing it here should be fine.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21  1:29               ` Jann Horn
@ 2020-02-21 17:32                 ` Jens Axboe
  2020-02-21 19:24                   ` Jann Horn
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 17:32 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/20/20 6:29 PM, Jann Horn wrote:
> On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 4:12 PM, Jann Horn wrote:
>>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 2/20/20 3:23 PM, Jann Horn wrote:
>>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
>>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>>>>> the poll to execute immediately after the file is marked as ready.
>>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>>>>> we have to punt that linked request to async context. This slows down
>>>>>>>> the processing, and actually means it's faster to not use a link for this
>>>>>>>> use case.
>>> [...]
>>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>>>>>>> +static void io_poll_task_func(struct callback_head *cb)
>>>>>>>>  {
>>>>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>>>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>>>>>>> +       struct io_kiocb *nxt = NULL;
>>>>>>>>
>>>>>>> [...]
>>>>>>>> +       io_poll_task_handler(req, &nxt);
>>>>>>>> +       if (nxt)
>>>>>>>> +               __io_queue_sqe(nxt, NULL);
>>>>>>>
>>>>>>> This can now get here from anywhere that calls schedule(), right?
>>>>>>> Which means that this might almost double the required kernel stack
>>>>>>> size, if one codepath exists that calls schedule() while near the
>>>>>>> bottom of the stack and another codepath exists that goes from here
>>>>>>> through the VFS and again uses a big amount of stack space? This is a
>>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>>>>>>> check whether we've consumed over 25% of stack space, or something
>>>>>>> like that, and if so, directly punt the request.
>>> [...]
>>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
>>>>>>> doesn't *want* to block, the code it calls into might still block on a
>>>>>>> mutex or something like that, at which point the mutex code would call
>>>>>>> into schedule(), which would then again hit sched_out_update() and get
>>>>>>> here, right? As far as I can tell, this could cause unbounded
>>>>>>> recursion.
>>>>>>
>>>>>> The sched_work items are pruned before being run, so that can't happen.
>>>>>
>>>>> And is it impossible for new ones to be added in the meantime if a
>>>>> second poll operation completes in the background just when we're
>>>>> entering __io_queue_sqe()?
>>>>
>>>> True, that can happen.
>>>>
>>>> I wonder if we just prevent the recursion whether we can ignore most
>>>> of it. Eg never process the sched_work list if we're not at the top
>>>> level, so to speak.
>>>>
>>>> This should also prevent the deadlock that you mentioned with FUSE
>>>> in the next email that just rolled in.
>>>
>>> But there the first ->read_iter could be from outside io_uring. So you
>>> don't just have to worry about nesting inside an already-running uring
>>> work; you also have to worry about nesting inside more or less
>>> anything else that might be holding mutexes. So I think you'd pretty
>>> much have to whitelist known-safe schedule() callers, or something
>>> like that.
>>
>> I'll see if I can come up with something for that. Ideally any issue
>> with IOCB_NOWAIT set should be honored, and trylock etc should be used.
> 
> Are you sure? For example, an IO operation typically copies data to
> userspace, which can take pagefaults. And those should be handled
> synchronously even with IOCB_NOWAIT set, right? And the page fault
> code can block on mutexes (like the mmap_sem) or even wait for a
> blocking filesystem operation (via file mappings) or for userspace
> (via userfaultfd or FUSE mappings).

Yeah that's a good point. The more I think about it, the less I think
the scheduler invoked callback is going to work. We need to be able to
manage the context of when we are called, see later messages on the
task_work usage instead.

>> But I don't think we can fully rely on that, we need something a bit
>> more solid...
>>
>>> Taking a step back: Do you know why this whole approach brings the
>>> kind of performance benefit you mentioned in the cover letter? 4x is a
>>> lot... Is it that expensive to take a trip through the scheduler?
>>> I wonder whether the performance numbers for the echo test would
>>> change if you commented out io_worker_spin_for_work()...
>>
>> If anything, I expect the spin removal to make it worse. There's really
>> no magic there on why it's faster, if you offload work to a thread that
>> is essentially sync, then you're going to take a huge hit in
>> performance. It's the difference between:
>>
>> 1) Queue work with thread, wake up thread
>> 2) Thread wakes, starts work, goes to sleep.
> 
> If we go to sleep here, then the other side hasn't yet sent us
> anything, so up to this point, it shouldn't have any impact on the
> measured throughput, right?
> 
>> 3) Data available, thread is woken, does work
> 
> This is the same in the other case: Data is available, the
> application's thread is woken and does the work.
> 
>> 4) Thread signals completion of work
> 
> And this is also basically the same, except that in the worker-thread
> case, we have to go through the scheduler to reach userspace, while
> with this patch series, we can signal "work is completed" and return
> to userspace without an extra trip through the scheduler.

There's a big difference between:

- Task needs to do work, task goes to sleep on it, task is woken

and

- Task needs to do work, task passes work to thread. Task goes to sleep.
  Thread wakes up, tries to do work, goes to sleep. Thread is woken,
  does work, notifies task. Task is woken up.

If you've ever done any sort of thread poll (userspace or otherwise),
this is painful, and particularly so when you're only keeping one
work item in flight. That kind of pipeline is rife with bubbles. If we
can have multiple items in flight, then we start to gain ground due to
the parallelism.

> I could imagine this optimization having some performance benefit, but
> I'm still sceptical about it buying a 4x benefit without some more
> complicated reason behind it.

I just re-ran the testing, this time on top of the current tree, where
instead of doing the task/sched_work_add() we simply queue for async.
This should be an even better case than before, since hopefully the
thread will not need to go to sleep to process the work, it'll complete
without blocking. For an echo test setup over a socket, this approach
yields about 45-48K requests per second. This, btw, is with the io-wq
spin removed. Using the callback method where the task itself does the
work, 175K-180K requests per second.

>> versus just completing the work when it's ready and not having any
>> switches to a worker thread at all. As the cover letter mentions, the
>> single client case is a huge win, and that is of course the biggest win
>> because everything is idle. If the thread doing the offload can be kept
>> running, the gains become smaller as we're not paying those wake/sleep
>> penalties anymore.
> 
> I'd really like to see what the scheduler behavior looks like here,
> for this single-client echo test. I can imagine three cases (which I
> guess are probably going to be mixed because the scheduler moves tasks
> around; but I don't actually know much about how the scheduler works,
> so my guesses are probably not very helpful):
> 
> Case 1: Both the worker and the userspace task are on the same CPU. In
> this case, the worker will waste something on the order of 10000
> cycles for every message while userspace is runnable, unless the
> scheduler decides that the worker has spent so much time on the CPU
> that it should relinquish it to the userspace task. (You test for
> need_resched() in the busyloop, but AFAIK that just asks the scheduler
> whether it wants you to get off the CPU right now, not whether there
> are any other runnable tasks on the CPU at the moment.)
> Case 2: The worker and the userspace task are on different *physical*
> cores and don't share L1D and L2. This will cause a performance
> penalty due to constant cacheline bouncing.
> Case 3: The worker and the userspace task are on hyperthreads, so they
> share L1D and L2 and can run concurrently. Compared to the other two
> cases, this would probably work best, but I'm not sure whether the
> scheduler is smart enough to specifically target this behavior? (And
> if you're running inside a VM, or on a system without hyperthreading,
> this isn't even an option.)
> 
> So I wonder what things look like if you force the worker and the
> userspace task to run on the same CPU without any idle polling in the
> worker; or how it looks when you pin the worker and the userspace task
> on hyperthreads. And if that does make a difference, it might be worth
> considering whether the interaction between io_uring and the scheduler
> could be optimized.

We can probably get closer with smarter placement, but it's never going
to be anywhere near as fast as when the task itself does the work. For
the echo example, ideally you want the server and client on the same
CPU. And since it's a benchmark, both of them soak up the core, in my
testing I see about 55-60% backend CPU, and 40-40% from the client. I
could affinitize the async thread to the same CPU, but we're just going
to be stealing cycles at that point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 14:50     ` Jens Axboe
@ 2020-02-21 18:30       ` Pavel Begunkov
  2020-02-21 19:10         ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-21 18:30 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn

On 21/02/2020 17:50, Jens Axboe wrote:
> On 2/21/20 6:51 AM, Pavel Begunkov wrote:
>> On 20/02/2020 23:31, Jens Axboe wrote:
>>> For poll requests, it's not uncommon to link a read (or write) after
>>> the poll to execute immediately after the file is marked as ready.
>>> Since the poll completion is called inside the waitqueue wake up handler,
>>> we have to punt that linked request to async context. This slows down
>>> the processing, and actually means it's faster to not use a link for this
>>> use case.
>>>
>>> We also run into problems if the completion_lock is contended, as we're
>>> doing a different lock ordering than the issue side is. Hence we have
>>> to do trylock for completion, and if that fails, go async. Poll removal
>>> needs to go async as well, for the same reason.
>>>
>>> eventfd notification needs special case as well, to avoid stack blowing
>>> recursion or deadlocks.
>>>
>>> These are all deficiencies that were inherited from the aio poll
>>> implementation, but I think we can do better. When a poll completes,
>>> simply queue it up in the task poll list. When the task completes the
>>> list, we can run dependent links inline as well. This means we never
>>> have to go async, and we can remove a bunch of code associated with
>>> that, and optimizations to try and make that run faster. The diffstat
>>> speaks for itself.
>>
>> So, it piggybacks request execution onto a random task, that happens
>> to complete a poll. Did I get it right?
>>
>> I can't find where it setting right mm, creds, etc., or why it have
>> them already.
> 
> Not a random task, the very task that initially tried to do the receive
> (or whatever the operation may be). Hence there's no need to set
> mm/creds/whatever, we're still running in the context of the original
> task once we retry the operation after the poll signals readiness.

Got it. Then, it may happen in the future after returning from
__io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
should have already restored creds (i.e. personality stuff) on the way back.
This might be a problem.

BTW, Is it by design, that all requests of a link use personality creds
specified in the head's sqe?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 18:30       ` Pavel Begunkov
@ 2020-02-21 19:10         ` Jens Axboe
  2020-02-21 19:22           ` Pavel Begunkov
  2020-02-23  6:00           ` Jens Axboe
  0 siblings, 2 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 19:10 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/21/20 11:30 AM, Pavel Begunkov wrote:
> On 21/02/2020 17:50, Jens Axboe wrote:
>> On 2/21/20 6:51 AM, Pavel Begunkov wrote:
>>> On 20/02/2020 23:31, Jens Axboe wrote:
>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>> the poll to execute immediately after the file is marked as ready.
>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>> we have to punt that linked request to async context. This slows down
>>>> the processing, and actually means it's faster to not use a link for this
>>>> use case.
>>>>
>>>> We also run into problems if the completion_lock is contended, as we're
>>>> doing a different lock ordering than the issue side is. Hence we have
>>>> to do trylock for completion, and if that fails, go async. Poll removal
>>>> needs to go async as well, for the same reason.
>>>>
>>>> eventfd notification needs special case as well, to avoid stack blowing
>>>> recursion or deadlocks.
>>>>
>>>> These are all deficiencies that were inherited from the aio poll
>>>> implementation, but I think we can do better. When a poll completes,
>>>> simply queue it up in the task poll list. When the task completes the
>>>> list, we can run dependent links inline as well. This means we never
>>>> have to go async, and we can remove a bunch of code associated with
>>>> that, and optimizations to try and make that run faster. The diffstat
>>>> speaks for itself.
>>>
>>> So, it piggybacks request execution onto a random task, that happens
>>> to complete a poll. Did I get it right?
>>>
>>> I can't find where it setting right mm, creds, etc., or why it have
>>> them already.
>>
>> Not a random task, the very task that initially tried to do the receive
>> (or whatever the operation may be). Hence there's no need to set
>> mm/creds/whatever, we're still running in the context of the original
>> task once we retry the operation after the poll signals readiness.
> 
> Got it. Then, it may happen in the future after returning from
> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
> should have already restored creds (i.e. personality stuff) on the way back.
> This might be a problem.

Not sure I follow, can you elaborate? Just to be sure, the requests that
go through the poll handler will go through __io_queue_sqe() again. Oh I
guess your point is that that is one level below where we normally
assign the creds.

> BTW, Is it by design, that all requests of a link use personality creds
> specified in the head's sqe?

No, I think that's more by accident. We should make sure they use the
specified creds, regardless of the issue time. Care to clean that up?
Would probably help get it right for the poll case, too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 19:10         ` Jens Axboe
@ 2020-02-21 19:22           ` Pavel Begunkov
  2020-02-23  6:00           ` Jens Axboe
  1 sibling, 0 replies; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-21 19:22 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn

On 21/02/2020 22:10, Jens Axboe wrote:
> On 2/21/20 11:30 AM, Pavel Begunkov wrote:
>> On 21/02/2020 17:50, Jens Axboe wrote:
>>> On 2/21/20 6:51 AM, Pavel Begunkov wrote:
>>>> On 20/02/2020 23:31, Jens Axboe wrote:
>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>> the poll to execute immediately after the file is marked as ready.
>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>> we have to punt that linked request to async context. This slows down
>>>>> the processing, and actually means it's faster to not use a link for this
>>>>> use case.
>>>>>
>>>>> We also run into problems if the completion_lock is contended, as we're
>>>>> doing a different lock ordering than the issue side is. Hence we have
>>>>> to do trylock for completion, and if that fails, go async. Poll removal
>>>>> needs to go async as well, for the same reason.
>>>>>
>>>>> eventfd notification needs special case as well, to avoid stack blowing
>>>>> recursion or deadlocks.
>>>>>
>>>>> These are all deficiencies that were inherited from the aio poll
>>>>> implementation, but I think we can do better. When a poll completes,
>>>>> simply queue it up in the task poll list. When the task completes the
>>>>> list, we can run dependent links inline as well. This means we never
>>>>> have to go async, and we can remove a bunch of code associated with
>>>>> that, and optimizations to try and make that run faster. The diffstat
>>>>> speaks for itself.
>>>>
>>>> So, it piggybacks request execution onto a random task, that happens
>>>> to complete a poll. Did I get it right?
>>>>
>>>> I can't find where it setting right mm, creds, etc., or why it have
>>>> them already.
>>>
>>> Not a random task, the very task that initially tried to do the receive
>>> (or whatever the operation may be). Hence there's no need to set
>>> mm/creds/whatever, we're still running in the context of the original
>>> task once we retry the operation after the poll signals readiness.
>>
>> Got it. Then, it may happen in the future after returning from
>> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
>> should have already restored creds (i.e. personality stuff) on the way back.
>> This might be a problem.
> 
> Not sure I follow, can you elaborate? Just to be sure, the requests that
> go through the poll handler will go through __io_queue_sqe() again. Oh I
> guess your point is that that is one level below where we normally
> assign the creds.

Yeah, exactly. Poll handler won't do the personality dancing, as it doesn't go
through io_submit_sqes().

> 
>> BTW, Is it by design, that all requests of a link use personality creds
>> specified in the head's sqe?
> 
> No, I think that's more by accident. We should make sure they use the
> specified creds, regardless of the issue time. Care to clean that up?
> Would probably help get it right for the poll case, too.

Ok, I'll prepare

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 17:32                 ` Jens Axboe
@ 2020-02-21 19:24                   ` Jann Horn
  2020-02-21 20:18                     ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Jann Horn @ 2020-02-21 19:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On Fri, Feb 21, 2020 at 6:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 6:29 PM, Jann Horn wrote:
> > On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2/20/20 4:12 PM, Jann Horn wrote:
> >>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 2/20/20 3:23 PM, Jann Horn wrote:
> >>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
> >>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
> >>>>>>>> the poll to execute immediately after the file is marked as ready.
> >>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
> >>>>>>>> we have to punt that linked request to async context. This slows down
> >>>>>>>> the processing, and actually means it's faster to not use a link for this
> >>>>>>>> use case.
> >>> [...]
> >>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>>>>>>> +static void io_poll_task_func(struct callback_head *cb)
> >>>>>>>>  {
> >>>>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>>>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>>>>>>> +       struct io_kiocb *nxt = NULL;
> >>>>>>>>
> >>>>>>> [...]
> >>>>>>>> +       io_poll_task_handler(req, &nxt);
> >>>>>>>> +       if (nxt)
> >>>>>>>> +               __io_queue_sqe(nxt, NULL);
> >>>>>>>
> >>>>>>> This can now get here from anywhere that calls schedule(), right?
> >>>>>>> Which means that this might almost double the required kernel stack
> >>>>>>> size, if one codepath exists that calls schedule() while near the
> >>>>>>> bottom of the stack and another codepath exists that goes from here
> >>>>>>> through the VFS and again uses a big amount of stack space? This is a
> >>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >>>>>>> check whether we've consumed over 25% of stack space, or something
> >>>>>>> like that, and if so, directly punt the request.
> >>> [...]
> >>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
> >>>>>>> doesn't *want* to block, the code it calls into might still block on a
> >>>>>>> mutex or something like that, at which point the mutex code would call
> >>>>>>> into schedule(), which would then again hit sched_out_update() and get
> >>>>>>> here, right? As far as I can tell, this could cause unbounded
> >>>>>>> recursion.
> >>>>>>
> >>>>>> The sched_work items are pruned before being run, so that can't happen.
> >>>>>
> >>>>> And is it impossible for new ones to be added in the meantime if a
> >>>>> second poll operation completes in the background just when we're
> >>>>> entering __io_queue_sqe()?
> >>>>
> >>>> True, that can happen.
> >>>>
> >>>> I wonder if we just prevent the recursion whether we can ignore most
> >>>> of it. Eg never process the sched_work list if we're not at the top
> >>>> level, so to speak.
> >>>>
> >>>> This should also prevent the deadlock that you mentioned with FUSE
> >>>> in the next email that just rolled in.
> >>>
> >>> But there the first ->read_iter could be from outside io_uring. So you
> >>> don't just have to worry about nesting inside an already-running uring
> >>> work; you also have to worry about nesting inside more or less
> >>> anything else that might be holding mutexes. So I think you'd pretty
> >>> much have to whitelist known-safe schedule() callers, or something
> >>> like that.
> >>
> >> I'll see if I can come up with something for that. Ideally any issue
> >> with IOCB_NOWAIT set should be honored, and trylock etc should be used.
> >
> > Are you sure? For example, an IO operation typically copies data to
> > userspace, which can take pagefaults. And those should be handled
> > synchronously even with IOCB_NOWAIT set, right? And the page fault
> > code can block on mutexes (like the mmap_sem) or even wait for a
> > blocking filesystem operation (via file mappings) or for userspace
> > (via userfaultfd or FUSE mappings).
>
> Yeah that's a good point. The more I think about it, the less I think
> the scheduler invoked callback is going to work. We need to be able to
> manage the context of when we are called, see later messages on the
> task_work usage instead.
>
> >> But I don't think we can fully rely on that, we need something a bit
> >> more solid...
> >>
> >>> Taking a step back: Do you know why this whole approach brings the
> >>> kind of performance benefit you mentioned in the cover letter? 4x is a
> >>> lot... Is it that expensive to take a trip through the scheduler?
> >>> I wonder whether the performance numbers for the echo test would
> >>> change if you commented out io_worker_spin_for_work()...
> >>
> >> If anything, I expect the spin removal to make it worse. There's really
> >> no magic there on why it's faster, if you offload work to a thread that
> >> is essentially sync, then you're going to take a huge hit in
> >> performance. It's the difference between:
> >>
> >> 1) Queue work with thread, wake up thread
> >> 2) Thread wakes, starts work, goes to sleep.
> >
> > If we go to sleep here, then the other side hasn't yet sent us
> > anything, so up to this point, it shouldn't have any impact on the
> > measured throughput, right?
> >
> >> 3) Data available, thread is woken, does work
> >
> > This is the same in the other case: Data is available, the
> > application's thread is woken and does the work.
> >
> >> 4) Thread signals completion of work
> >
> > And this is also basically the same, except that in the worker-thread
> > case, we have to go through the scheduler to reach userspace, while
> > with this patch series, we can signal "work is completed" and return
> > to userspace without an extra trip through the scheduler.
>
> There's a big difference between:
>
> - Task needs to do work, task goes to sleep on it, task is woken
>
> and
>
> - Task needs to do work, task passes work to thread. Task goes to sleep.
>   Thread wakes up, tries to do work, goes to sleep. Thread is woken,
>   does work, notifies task. Task is woken up.
>
> If you've ever done any sort of thread poll (userspace or otherwise),
> this is painful, and particularly so when you're only keeping one
> work item in flight. That kind of pipeline is rife with bubbles. If we
> can have multiple items in flight, then we start to gain ground due to
> the parallelism.
>
> > I could imagine this optimization having some performance benefit, but
> > I'm still sceptical about it buying a 4x benefit without some more
> > complicated reason behind it.
>
> I just re-ran the testing, this time on top of the current tree, where
> instead of doing the task/sched_work_add() we simply queue for async.
> This should be an even better case than before, since hopefully the
> thread will not need to go to sleep to process the work, it'll complete
> without blocking. For an echo test setup over a socket, this approach
> yields about 45-48K requests per second. This, btw, is with the io-wq
> spin removed. Using the callback method where the task itself does the
> work, 175K-180K requests per second.

Huh. So that's like, what, somewhere on the order of 7.6 microseconds
or somewhere around 15000 cycles overhead for shoving a request
completion event from worker context over to a task, assuming that
you're running at something around 2GHz? Well, I guess that's a little
more than twice as much time as it takes to switch from one blocked
thread to another via eventfd (including overhead from syscall and CPU
mitigations and stuff), so I guess it's not completely unreasonable...

Anyway, I'll stop nagging about this since it sounds like you're going
to implement this in a less unorthodox way now. ^^

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 16:23         ` Peter Zijlstra
@ 2020-02-21 20:13           ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 20:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jann Horn, io-uring, Glauber Costa, Pavel Begunkov

On 2/21/20 9:23 AM, Peter Zijlstra wrote:
> On Fri, Feb 21, 2020 at 06:49:16AM -0800, Jens Axboe wrote:
> 
>>> Jens, what exactly is the benefit of running this on every random
>>> schedule() vs in io_cqring_wait() ? Or even, since io_cqring_wait() is
>>> the very last thing the syscall does, task_work.
>>
>> I took a step back and I think we can just use the task work, which
>> makes this a lot less complicated in terms of locking and schedule
>> state. Ran some quick testing with the below and it works for me.
>>
>> I'm going to re-spin based on this and just dump the sched_work
>> addition.
> 
> Aswesome, simpler is better.

Agree!

>> @@ -5392,6 +5392,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>  						TASK_INTERRUPTIBLE);
>>  		if (io_should_wake(&iowq, false))
>>  			break;
>> +		if (current->task_works) {
>> +			task_work_run();
>> +			if (io_should_wake(&iowq, false))
>> +				break;
>> +			continue;
>> +		}
> 
> 		if (current->task_works)
> 			task_work_run();
> 		if (io_should_wake(&iowq, false);
> 			break;
> 
> doesn't work?

Yeah it totally does, I'll make that change. Not sure what I was
thinking there.

>>  		schedule();
>>  		if (signal_pending(current)) {
>>  			ret = -EINTR;
> 
> 
> Anyway, we need to be careful about the context where we call
> task_work_run(), but afaict doing it here should be fine.

Right, that's the main win over the sched in/out approach. On clean
entry to a system call we should be fine, I added it for
io_uring_enter() as well. We're not holding anything at that point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 19:24                   ` Jann Horn
@ 2020-02-21 20:18                     ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-21 20:18 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring, Glauber Costa, Peter Zijlstra, Pavel Begunkov

On 2/21/20 12:24 PM, Jann Horn wrote:
> On Fri, Feb 21, 2020 at 6:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 2/20/20 6:29 PM, Jann Horn wrote:
>>> On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 2/20/20 4:12 PM, Jann Horn wrote:
>>>>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 2/20/20 3:23 PM, Jann Horn wrote:
>>>>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
>>>>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>>>>>>> the poll to execute immediately after the file is marked as ready.
>>>>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>>>>>>> we have to punt that linked request to async context. This slows down
>>>>>>>>>> the processing, and actually means it's faster to not use a link for this
>>>>>>>>>> use case.
>>>>> [...]
>>>>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
>>>>>>>>>> +static void io_poll_task_func(struct callback_head *cb)
>>>>>>>>>>  {
>>>>>>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
>>>>>>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
>>>>>>>>>> +       struct io_kiocb *nxt = NULL;
>>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>> +       io_poll_task_handler(req, &nxt);
>>>>>>>>>> +       if (nxt)
>>>>>>>>>> +               __io_queue_sqe(nxt, NULL);
>>>>>>>>>
>>>>>>>>> This can now get here from anywhere that calls schedule(), right?
>>>>>>>>> Which means that this might almost double the required kernel stack
>>>>>>>>> size, if one codepath exists that calls schedule() while near the
>>>>>>>>> bottom of the stack and another codepath exists that goes from here
>>>>>>>>> through the VFS and again uses a big amount of stack space? This is a
>>>>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
>>>>>>>>> check whether we've consumed over 25% of stack space, or something
>>>>>>>>> like that, and if so, directly punt the request.
>>>>> [...]
>>>>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
>>>>>>>>> doesn't *want* to block, the code it calls into might still block on a
>>>>>>>>> mutex or something like that, at which point the mutex code would call
>>>>>>>>> into schedule(), which would then again hit sched_out_update() and get
>>>>>>>>> here, right? As far as I can tell, this could cause unbounded
>>>>>>>>> recursion.
>>>>>>>>
>>>>>>>> The sched_work items are pruned before being run, so that can't happen.
>>>>>>>
>>>>>>> And is it impossible for new ones to be added in the meantime if a
>>>>>>> second poll operation completes in the background just when we're
>>>>>>> entering __io_queue_sqe()?
>>>>>>
>>>>>> True, that can happen.
>>>>>>
>>>>>> I wonder if we just prevent the recursion whether we can ignore most
>>>>>> of it. Eg never process the sched_work list if we're not at the top
>>>>>> level, so to speak.
>>>>>>
>>>>>> This should also prevent the deadlock that you mentioned with FUSE
>>>>>> in the next email that just rolled in.
>>>>>
>>>>> But there the first ->read_iter could be from outside io_uring. So you
>>>>> don't just have to worry about nesting inside an already-running uring
>>>>> work; you also have to worry about nesting inside more or less
>>>>> anything else that might be holding mutexes. So I think you'd pretty
>>>>> much have to whitelist known-safe schedule() callers, or something
>>>>> like that.
>>>>
>>>> I'll see if I can come up with something for that. Ideally any issue
>>>> with IOCB_NOWAIT set should be honored, and trylock etc should be used.
>>>
>>> Are you sure? For example, an IO operation typically copies data to
>>> userspace, which can take pagefaults. And those should be handled
>>> synchronously even with IOCB_NOWAIT set, right? And the page fault
>>> code can block on mutexes (like the mmap_sem) or even wait for a
>>> blocking filesystem operation (via file mappings) or for userspace
>>> (via userfaultfd or FUSE mappings).
>>
>> Yeah that's a good point. The more I think about it, the less I think
>> the scheduler invoked callback is going to work. We need to be able to
>> manage the context of when we are called, see later messages on the
>> task_work usage instead.
>>
>>>> But I don't think we can fully rely on that, we need something a bit
>>>> more solid...
>>>>
>>>>> Taking a step back: Do you know why this whole approach brings the
>>>>> kind of performance benefit you mentioned in the cover letter? 4x is a
>>>>> lot... Is it that expensive to take a trip through the scheduler?
>>>>> I wonder whether the performance numbers for the echo test would
>>>>> change if you commented out io_worker_spin_for_work()...
>>>>
>>>> If anything, I expect the spin removal to make it worse. There's really
>>>> no magic there on why it's faster, if you offload work to a thread that
>>>> is essentially sync, then you're going to take a huge hit in
>>>> performance. It's the difference between:
>>>>
>>>> 1) Queue work with thread, wake up thread
>>>> 2) Thread wakes, starts work, goes to sleep.
>>>
>>> If we go to sleep here, then the other side hasn't yet sent us
>>> anything, so up to this point, it shouldn't have any impact on the
>>> measured throughput, right?
>>>
>>>> 3) Data available, thread is woken, does work
>>>
>>> This is the same in the other case: Data is available, the
>>> application's thread is woken and does the work.
>>>
>>>> 4) Thread signals completion of work
>>>
>>> And this is also basically the same, except that in the worker-thread
>>> case, we have to go through the scheduler to reach userspace, while
>>> with this patch series, we can signal "work is completed" and return
>>> to userspace without an extra trip through the scheduler.
>>
>> There's a big difference between:
>>
>> - Task needs to do work, task goes to sleep on it, task is woken
>>
>> and
>>
>> - Task needs to do work, task passes work to thread. Task goes to sleep.
>>   Thread wakes up, tries to do work, goes to sleep. Thread is woken,
>>   does work, notifies task. Task is woken up.
>>
>> If you've ever done any sort of thread poll (userspace or otherwise),
>> this is painful, and particularly so when you're only keeping one
>> work item in flight. That kind of pipeline is rife with bubbles. If we
>> can have multiple items in flight, then we start to gain ground due to
>> the parallelism.
>>
>>> I could imagine this optimization having some performance benefit, but
>>> I'm still sceptical about it buying a 4x benefit without some more
>>> complicated reason behind it.
>>
>> I just re-ran the testing, this time on top of the current tree, where
>> instead of doing the task/sched_work_add() we simply queue for async.
>> This should be an even better case than before, since hopefully the
>> thread will not need to go to sleep to process the work, it'll complete
>> without blocking. For an echo test setup over a socket, this approach
>> yields about 45-48K requests per second. This, btw, is with the io-wq
>> spin removed. Using the callback method where the task itself does the
>> work, 175K-180K requests per second.
> 
> Huh. So that's like, what, somewhere on the order of 7.6 microseconds
> or somewhere around 15000 cycles overhead for shoving a request
> completion event from worker context over to a task, assuming that
> you're running at something around 2GHz? Well, I guess that's a little
> more than twice as much time as it takes to switch from one blocked
> thread to another via eventfd (including overhead from syscall and CPU
> mitigations and stuff), so I guess it's not completely unreasonable...

This is on my laptop, running the kernel in kvm for testing. So it's not
a beefy setup:

Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz

> Anyway, I'll stop nagging about this since it sounds like you're going
> to implement this in a less unorthodox way now. ^^

I'll post the updated series later today, processing off ->task_works
instead. I do agree that this is much saner than trying to entangle task
state on schedule() entry/exit, and it seems to work just as well in my
testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-21 19:10         ` Jens Axboe
  2020-02-21 19:22           ` Pavel Begunkov
@ 2020-02-23  6:00           ` Jens Axboe
  2020-02-23  6:26             ` Jens Axboe
  1 sibling, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-23  6:00 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/21/20 12:10 PM, Jens Axboe wrote:
>> Got it. Then, it may happen in the future after returning from
>> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
>> should have already restored creds (i.e. personality stuff) on the way back.
>> This might be a problem.
> 
> Not sure I follow, can you elaborate? Just to be sure, the requests that
> go through the poll handler will go through __io_queue_sqe() again. Oh I
> guess your point is that that is one level below where we normally
> assign the creds.

Fixed this one.

>> BTW, Is it by design, that all requests of a link use personality creds
>> specified in the head's sqe?
> 
> No, I think that's more by accident. We should make sure they use the
> specified creds, regardless of the issue time. Care to clean that up?
> Would probably help get it right for the poll case, too.

Took a look at this, and I think you're wrong. Every iteration of
io_submit_sqe() will lookup the right creds, and assign them to the
current task in case we're going to issue it. In the case of a link
where we already have the head, then we grab the current work
environment. This means assigning req->work.creds from
get_current_cred(), if not set, and these are the credentials we looked
up already.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23  6:00           ` Jens Axboe
@ 2020-02-23  6:26             ` Jens Axboe
  2020-02-23 11:02               ` Pavel Begunkov
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-23  6:26 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/22/20 11:00 PM, Jens Axboe wrote:
> On 2/21/20 12:10 PM, Jens Axboe wrote:
>>> Got it. Then, it may happen in the future after returning from
>>> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
>>> should have already restored creds (i.e. personality stuff) on the way back.
>>> This might be a problem.
>>
>> Not sure I follow, can you elaborate? Just to be sure, the requests that
>> go through the poll handler will go through __io_queue_sqe() again. Oh I
>> guess your point is that that is one level below where we normally
>> assign the creds.
> 
> Fixed this one.
> 
>>> BTW, Is it by design, that all requests of a link use personality creds
>>> specified in the head's sqe?
>>
>> No, I think that's more by accident. We should make sure they use the
>> specified creds, regardless of the issue time. Care to clean that up?
>> Would probably help get it right for the poll case, too.
> 
> Took a look at this, and I think you're wrong. Every iteration of
> io_submit_sqe() will lookup the right creds, and assign them to the
> current task in case we're going to issue it. In the case of a link
> where we already have the head, then we grab the current work
> environment. This means assigning req->work.creds from
> get_current_cred(), if not set, and these are the credentials we looked
> up already.

What does look wrong is that we don't restore the right credentials for
queuing the head, so basically the opposite problem. Something like the
below should fix that.


commit b94ddeebd4d068d9205b319179974e09da2591fd
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sat Feb 22 23:22:19 2020 -0700

    io_uring: handle multiple personalities in link chains
    
    If we have a chain of requests and they don't all use the same
    credentials, then the head of the chain will be issued with the
    credentails of the tail of the chain.
    
    Ensure __io_queue_sqe() overrides the credentials, if they are different.
    
    Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index de650df9ac53..59024e4757d6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4705,11 +4705,18 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_kiocb *linked_timeout;
 	struct io_kiocb *nxt = NULL;
+	const struct cred *old_creds = NULL;
 	int ret;
 
 again:
 	linked_timeout = io_prep_linked_timeout(req);
 
+	if (req->work.creds && req->work.creds != get_current_cred()) {
+		if (old_creds)
+			revert_creds(old_creds);
+		old_creds = override_creds(req->work.creds);
+	}
+
 	ret = io_issue_sqe(req, sqe, &nxt, true);
 
 	/*
@@ -4759,6 +4766,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 			goto punt;
 		goto again;
 	}
+	if (old_creds)
+		revert_creds(old_creds);
 }
 
 static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23  6:26             ` Jens Axboe
@ 2020-02-23 11:02               ` Pavel Begunkov
  2020-02-23 14:49                 ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-23 11:02 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn

On 23/02/2020 09:26, Jens Axboe wrote:
> On 2/22/20 11:00 PM, Jens Axboe wrote:
>> On 2/21/20 12:10 PM, Jens Axboe wrote:
>>>> Got it. Then, it may happen in the future after returning from
>>>> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
>>>> should have already restored creds (i.e. personality stuff) on the way back.
>>>> This might be a problem.
>>>
>>> Not sure I follow, can you elaborate? Just to be sure, the requests that
>>> go through the poll handler will go through __io_queue_sqe() again. Oh I
>>> guess your point is that that is one level below where we normally
>>> assign the creds.
>>
>> Fixed this one.

Looking at

io_async_task_func() {
	...
	/* ensure req->work.creds is valid for __io_queue_sqe() */
	req->work.creds = apoll->work.creds;
}

It copies creds, but doesn't touch the rest req->work fields. And if you have
one, you most probably got all of them in *grab_env(). Are you sure it doesn't
leak, e.g. mmgrab()'ed mm?


>>
>>>> BTW, Is it by design, that all requests of a link use personality creds
>>>> specified in the head's sqe?
>>>
>>> No, I think that's more by accident. We should make sure they use the
>>> specified creds, regardless of the issue time. Care to clean that up?
>>> Would probably help get it right for the poll case, too.
>>
>> Took a look at this, and I think you're wrong. Every iteration of
>> io_submit_sqe() will lookup the right creds, and assign them to the
>> current task in case we're going to issue it. In the case of a link
>> where we already have the head, then we grab the current work
>> environment. This means assigning req->work.creds from
>> get_current_cred(), if not set, and these are the credentials we looked
>> up already.

Yeah, I've spotted that there something wrong, but never looked up properly.

> 
> What does look wrong is that we don't restore the right credentials for
> queuing the head, so basically the opposite problem. Something like the
> below should fix that.
> index de650df9ac53..59024e4757d6 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4705,11 +4705,18 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  {
>  	struct io_kiocb *linked_timeout;
>  	struct io_kiocb *nxt = NULL;
> +	const struct cred *old_creds = NULL;
>  	int ret;
>  
>  again:
>  	linked_timeout = io_prep_linked_timeout(req);
>  
> +	if (req->work.creds && req->work.creds != get_current_cred()) {

get_current_cred() gets a ref.
See my attempt below, it fixes miscount, and should work better for cases
changing back to initial creds (i.e. personality 0)

Anyway, creds handling is too scattered across the code, and this do a lot of
useless refcounting and bouncing. It's better to find it a better place in the
near future.

> +		if (old_creds)
> +			revert_creds(old_creds);
> +		old_creds = override_creds(req->work.creds);
> +	}
> +
>  	ret = io_issue_sqe(req, sqe, &nxt, true);
>  
>  	/*
> @@ -4759,6 +4766,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  			goto punt;
>  		goto again;
>  	}
> +	if (old_creds)
> +		revert_creds(old_creds);
>  }
>  
>  static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> 

diff --git a/fs/io_uring.c b/fs/io_uring.c
index de650df9ac53..dc06298abb37 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4705,11 +4705,21 @@ static void __io_queue_sqe(struct io_kiocb *req, const
struct io_uring_sqe *sqe)
 {
 	struct io_kiocb *linked_timeout;
 	struct io_kiocb *nxt = NULL;
+	const struct cred *old_creds = NULL;
 	int ret;

 again:
 	linked_timeout = io_prep_linked_timeout(req);

+	if (req->work.creds && req->work.creds != current_cred()) {
+		if (old_creds)
+			revert_creds(old_creds);
+		if (old_creds == req->work.creds)
+			old_creds = NULL; /* restored original creds */
+		else
+			old_creds = override_creds(req->work.creds);
+	}
+
 	ret = io_issue_sqe(req, sqe, &nxt, true);

 	/*
@@ -4759,6 +4769,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const
struct io_uring_sqe *sqe)
 			goto punt;
 		goto again;
 	}
+	if (old_creds)
+		revert_creds(old_creds);
 }

 static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 11:02               ` Pavel Begunkov
@ 2020-02-23 14:49                 ` Jens Axboe
  2020-02-23 14:58                   ` Jens Axboe
  2020-02-23 17:55                   ` Pavel Begunkov
  0 siblings, 2 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-23 14:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/23/20 4:02 AM, Pavel Begunkov wrote:
> On 23/02/2020 09:26, Jens Axboe wrote:
>> On 2/22/20 11:00 PM, Jens Axboe wrote:
>>> On 2/21/20 12:10 PM, Jens Axboe wrote:
>>>>> Got it. Then, it may happen in the future after returning from
>>>>> __io_arm_poll_handler() and io_uring_enter(). And by that time io_submit_sqes()
>>>>> should have already restored creds (i.e. personality stuff) on the way back.
>>>>> This might be a problem.
>>>>
>>>> Not sure I follow, can you elaborate? Just to be sure, the requests that
>>>> go through the poll handler will go through __io_queue_sqe() again. Oh I
>>>> guess your point is that that is one level below where we normally
>>>> assign the creds.
>>>
>>> Fixed this one.
> 
> Looking at
> 
> io_async_task_func() {
> 	...
> 	/* ensure req->work.creds is valid for __io_queue_sqe() */
> 	req->work.creds = apoll->work.creds;
> }
> 
> It copies creds, but doesn't touch the rest req->work fields. And if you have
> one, you most probably got all of them in *grab_env(). Are you sure it doesn't
> leak, e.g. mmgrab()'ed mm?

You're looking at a version that only existed for about 20 min, had to
check I pushed it out. But ce21471abe0fef is the current one, it does
a full memcpy() of it.

>>>>> BTW, Is it by design, that all requests of a link use personality creds
>>>>> specified in the head's sqe?
>>>>
>>>> No, I think that's more by accident. We should make sure they use the
>>>> specified creds, regardless of the issue time. Care to clean that up?
>>>> Would probably help get it right for the poll case, too.
>>>
>>> Took a look at this, and I think you're wrong. Every iteration of
>>> io_submit_sqe() will lookup the right creds, and assign them to the
>>> current task in case we're going to issue it. In the case of a link
>>> where we already have the head, then we grab the current work
>>> environment. This means assigning req->work.creds from
>>> get_current_cred(), if not set, and these are the credentials we looked
>>> up already.
> 
> Yeah, I've spotted that there something wrong, but never looked up properly.

And thanks for that!

>> What does look wrong is that we don't restore the right credentials for
>> queuing the head, so basically the opposite problem. Something like the
>> below should fix that.
>> index de650df9ac53..59024e4757d6 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -4705,11 +4705,18 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  {
>>  	struct io_kiocb *linked_timeout;
>>  	struct io_kiocb *nxt = NULL;
>> +	const struct cred *old_creds = NULL;
>>  	int ret;
>>  
>>  again:
>>  	linked_timeout = io_prep_linked_timeout(req);
>>  
>> +	if (req->work.creds && req->work.creds != get_current_cred()) {
> 
> get_current_cred() gets a ref.

Oops yes

> See my attempt below, it fixes miscount, and should work better for
> cases changing back to initial creds (i.e. personality 0)

Thanks, I'll fold this in, if you don't mind.

> Anyway, creds handling is too scattered across the code, and this do a
> lot of useless refcounting and bouncing. It's better to find it a
> better place in the near future.

I think a good cleanup on top of this would be to move the personality
lookup to io_req_defer_prep(), and kill it from io_submit_sqe(). Now
__io_issue_sqe() does the right thing, and it'll just fall out nicely
with that as far as I can tell.

Care to send a patch for that?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 14:49                 ` Jens Axboe
@ 2020-02-23 14:58                   ` Jens Axboe
  2020-02-23 15:07                     ` Jens Axboe
  2020-02-23 17:55                   ` Pavel Begunkov
  1 sibling, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-23 14:58 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/23/20 7:49 AM, Jens Axboe wrote:
>> Anyway, creds handling is too scattered across the code, and this do a
>> lot of useless refcounting and bouncing. It's better to find it a
>> better place in the near future.
> 
> I think a good cleanup on top of this would be to move the personality
> lookup to io_req_defer_prep(), and kill it from io_submit_sqe(). Now
> __io_issue_sqe() does the right thing, and it'll just fall out nicely
> with that as far as I can tell.
> 
> Care to send a patch for that?

Since we also need it for non-deferral, how about just leaving the
lookup in there and removing the assignment? That means we only do that
juggling in one spot, which makes more sense. I think this should just
be folded into the previous patch.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index cead1a0602b4..b5422613c7b1 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4923,7 +4923,6 @@ static inline void io_queue_link_head(struct io_kiocb *req)
 static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			  struct io_submit_state *state, struct io_kiocb **link)
 {
-	const struct cred *old_creds = NULL;
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned int sqe_flags;
 	int ret, id;
@@ -4938,14 +4937,11 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	id = READ_ONCE(sqe->personality);
 	if (id) {
-		const struct cred *personality_creds;
-
-		personality_creds = idr_find(&ctx->personality_idr, id);
-		if (unlikely(!personality_creds)) {
+		req->work.creds = idr_find(&ctx->personality_idr, id);
+		if (unlikely(!req->work.creds)) {
 			ret = -EINVAL;
 			goto err_req;
 		}
-		old_creds = override_creds(personality_creds);
 	}
 
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
@@ -4957,8 +4953,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 err_req:
 		io_cqring_add_event(req, ret);
 		io_double_put_req(req);
-		if (old_creds)
-			revert_creds(old_creds);
 		return false;
 	}
 
@@ -5019,8 +5013,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		}
 	}
 
-	if (old_creds)
-		revert_creds(old_creds);
 	return true;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 14:58                   ` Jens Axboe
@ 2020-02-23 15:07                     ` Jens Axboe
  2020-02-23 18:04                       ` Pavel Begunkov
  0 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2020-02-23 15:07 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/23/20 7:58 AM, Jens Axboe wrote:
> On 2/23/20 7:49 AM, Jens Axboe wrote:
>>> Anyway, creds handling is too scattered across the code, and this do a
>>> lot of useless refcounting and bouncing. It's better to find it a
>>> better place in the near future.
>>
>> I think a good cleanup on top of this would be to move the personality
>> lookup to io_req_defer_prep(), and kill it from io_submit_sqe(). Now
>> __io_issue_sqe() does the right thing, and it'll just fall out nicely
>> with that as far as I can tell.
>>
>> Care to send a patch for that?
> 
> Since we also need it for non-deferral, how about just leaving the
> lookup in there and removing the assignment? That means we only do that
> juggling in one spot, which makes more sense. I think this should just
> be folded into the previous patch.

Tested, we need a ref grab on the creds when assigning since we're
dropped at the other end.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index cead1a0602b4..d83f113f22fd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4923,7 +4923,6 @@ static inline void io_queue_link_head(struct io_kiocb *req)
 static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			  struct io_submit_state *state, struct io_kiocb **link)
 {
-	const struct cred *old_creds = NULL;
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned int sqe_flags;
 	int ret, id;
@@ -4938,14 +4937,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	id = READ_ONCE(sqe->personality);
 	if (id) {
-		const struct cred *personality_creds;
-
-		personality_creds = idr_find(&ctx->personality_idr, id);
-		if (unlikely(!personality_creds)) {
+		req->work.creds = idr_find(&ctx->personality_idr, id);
+		if (unlikely(!req->work.creds)) {
 			ret = -EINVAL;
 			goto err_req;
 		}
-		old_creds = override_creds(personality_creds);
+		get_cred(req->work.creds);
 	}
 
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
@@ -4957,8 +4954,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 err_req:
 		io_cqring_add_event(req, ret);
 		io_double_put_req(req);
-		if (old_creds)
-			revert_creds(old_creds);
 		return false;
 	}
 
@@ -5019,8 +5014,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		}
 	}
 
-	if (old_creds)
-		revert_creds(old_creds);
 	return true;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 14:49                 ` Jens Axboe
  2020-02-23 14:58                   ` Jens Axboe
@ 2020-02-23 17:55                   ` Pavel Begunkov
  1 sibling, 0 replies; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-23 17:55 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn


[-- Attachment #1.1: Type: text/plain, Size: 741 bytes --]

On 23/02/2020 17:49, Jens Axboe wrote:
> On 2/23/20 4:02 AM, Pavel Begunkov wrote:
>> Looking at
>>
>> io_async_task_func() {
>> 	...
>> 	/* ensure req->work.creds is valid for __io_queue_sqe() */
>> 	req->work.creds = apoll->work.creds;
>> }
>>
>> It copies creds, but doesn't touch the rest req->work fields. And if you have
>> one, you most probably got all of them in *grab_env(). Are you sure it doesn't
>> leak, e.g. mmgrab()'ed mm?
> 
> You're looking at a version that only existed for about 20 min, had to
> check I pushed it out. But ce21471abe0fef is the current one, it does
> a full memcpy() of it.

Lucky me, great then

> 
> Thanks, I'll fold this in, if you don't mind.

Sure

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 15:07                     ` Jens Axboe
@ 2020-02-23 18:04                       ` Pavel Begunkov
  2020-02-23 18:06                         ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Begunkov @ 2020-02-23 18:04 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: glauber, peterz, Jann Horn

On 23/02/2020 18:07, Jens Axboe wrote:
> On 2/23/20 7:58 AM, Jens Axboe wrote:
>> On 2/23/20 7:49 AM, Jens Axboe wrote:
>>>> Anyway, creds handling is too scattered across the code, and this do a
>>>> lot of useless refcounting and bouncing. It's better to find it a
>>>> better place in the near future.
>>>
>>> I think a good cleanup on top of this would be to move the personality
>>> lookup to io_req_defer_prep(), and kill it from io_submit_sqe(). Now
>>> __io_issue_sqe() does the right thing, and it'll just fall out nicely
>>> with that as far as I can tell.
>>>
>>> Care to send a patch for that?
>>
>> Since we also need it for non-deferral, how about just leaving the
>> lookup in there and removing the assignment? That means we only do that
>> juggling in one spot, which makes more sense. I think this should just
>> be folded into the previous patch.
> 
> Tested, we need a ref grab on the creds when assigning since we're
> dropped at the other end.

Nice, this looks much better.

> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index cead1a0602b4..d83f113f22fd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4923,7 +4923,6 @@ static inline void io_queue_link_head(struct io_kiocb *req)
>  static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  			  struct io_submit_state *state, struct io_kiocb **link)
>  {
> -	const struct cred *old_creds = NULL;
>  	struct io_ring_ctx *ctx = req->ctx;
>  	unsigned int sqe_flags;
>  	int ret, id;
> @@ -4938,14 +4937,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  
>  	id = READ_ONCE(sqe->personality);
>  	if (id) {
> -		const struct cred *personality_creds;
> -
> -		personality_creds = idr_find(&ctx->personality_idr, id);
> -		if (unlikely(!personality_creds)) {
> +		req->work.creds = idr_find(&ctx->personality_idr, id);
> +		if (unlikely(!req->work.creds)) {
>  			ret = -EINVAL;
>  			goto err_req;
>  		}
> -		old_creds = override_creds(personality_creds);
> +		get_cred(req->work.creds);
>  	}
>  
>  	/* same numerical values with corresponding REQ_F_*, safe to copy */
> @@ -4957,8 +4954,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  err_req:
>  		io_cqring_add_event(req, ret);
>  		io_double_put_req(req);
> -		if (old_creds)
> -			revert_creds(old_creds);
>  		return false;
>  	}
>  
> @@ -5019,8 +5014,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  		}
>  	}
>  
> -	if (old_creds)
> -		revert_creds(old_creds);
>  	return true;
>  }
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] io_uring: add per-task callback handler
  2020-02-23 18:04                       ` Pavel Begunkov
@ 2020-02-23 18:06                         ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2020-02-23 18:06 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: glauber, peterz, Jann Horn

On 2/23/20 11:04 AM, Pavel Begunkov wrote:
> On 23/02/2020 18:07, Jens Axboe wrote:
>> On 2/23/20 7:58 AM, Jens Axboe wrote:
>>> On 2/23/20 7:49 AM, Jens Axboe wrote:
>>>>> Anyway, creds handling is too scattered across the code, and this do a
>>>>> lot of useless refcounting and bouncing. It's better to find it a
>>>>> better place in the near future.
>>>>
>>>> I think a good cleanup on top of this would be to move the personality
>>>> lookup to io_req_defer_prep(), and kill it from io_submit_sqe(). Now
>>>> __io_issue_sqe() does the right thing, and it'll just fall out nicely
>>>> with that as far as I can tell.
>>>>
>>>> Care to send a patch for that?
>>>
>>> Since we also need it for non-deferral, how about just leaving the
>>> lookup in there and removing the assignment? That means we only do that
>>> juggling in one spot, which makes more sense. I think this should just
>>> be folded into the previous patch.
>>
>> Tested, we need a ref grab on the creds when assigning since we're
>> dropped at the other end.
> 
> Nice, this looks much better.

Glad you agree, here's the final folded in:


commit 6494e0bd77a5b339e0585c65792e1f829f2a4812
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sat Feb 22 23:22:19 2020 -0700

    io_uring: handle multiple personalities in link chains
    
    If we have a chain of requests and they don't all use the same
    credentials, then the head of the chain will be issued with the
    credentails of the tail of the chain.
    
    Ensure __io_queue_sqe() overrides the credentials, if they are different.
    
    Once we do that, we can clean up the creds handling as well, by only
    having io_submit_sqe() do the lookup of a personality. It doesn't need
    to assign it, since __io_queue_sqe() now always does the right thing.
    
    Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands")
    Reported-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index de650df9ac53..7d0be264527d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4705,11 +4705,21 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_kiocb *linked_timeout;
 	struct io_kiocb *nxt = NULL;
+	const struct cred *old_creds = NULL;
 	int ret;
 
 again:
 	linked_timeout = io_prep_linked_timeout(req);
 
+	if (req->work.creds && req->work.creds != current_cred()) {
+		if (old_creds)
+			revert_creds(old_creds);
+		if (old_creds == req->work.creds)
+			old_creds = NULL; /* restored original creds */
+		else
+			old_creds = override_creds(req->work.creds);
+	}
+
 	ret = io_issue_sqe(req, sqe, &nxt, true);
 
 	/*
@@ -4759,6 +4769,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 			goto punt;
 		goto again;
 	}
+	if (old_creds)
+		revert_creds(old_creds);
 }
 
 static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -4803,7 +4815,6 @@ static inline void io_queue_link_head(struct io_kiocb *req)
 static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			  struct io_submit_state *state, struct io_kiocb **link)
 {
-	const struct cred *old_creds = NULL;
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned int sqe_flags;
 	int ret, id;
@@ -4818,14 +4829,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	id = READ_ONCE(sqe->personality);
 	if (id) {
-		const struct cred *personality_creds;
-
-		personality_creds = idr_find(&ctx->personality_idr, id);
-		if (unlikely(!personality_creds)) {
+		req->work.creds = idr_find(&ctx->personality_idr, id);
+		if (unlikely(!req->work.creds)) {
 			ret = -EINVAL;
 			goto err_req;
 		}
-		old_creds = override_creds(personality_creds);
+		get_cred(req->work.creds);
 	}
 
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
@@ -4837,8 +4846,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 err_req:
 		io_cqring_add_event(req, ret);
 		io_double_put_req(req);
-		if (old_creds)
-			revert_creds(old_creds);
 		return false;
 	}
 
@@ -4899,8 +4906,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		}
 	}
 
-	if (old_creds)
-		revert_creds(old_creds);
 	return true;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2020-02-23 18:06 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-20 20:31 [PATCHSET 0/9] io_uring: use polled async retry Jens Axboe
2020-02-20 20:31 ` [PATCH 1/9] io_uring: consider any io_read/write -EAGAIN as final Jens Axboe
2020-02-20 20:31 ` [PATCH 2/9] io_uring: io_accept() should hold on to submit reference on retry Jens Axboe
2020-02-20 20:31 ` [PATCH 3/9] sched: move io-wq/workqueue worker sched in/out into helpers Jens Axboe
2020-02-20 20:31 ` [PATCH 4/9] task_work_run: don't take ->pi_lock unconditionally Jens Axboe
2020-02-20 20:31 ` [PATCH 5/9] kernel: abstract out task work helpers Jens Axboe
2020-02-20 21:07   ` Peter Zijlstra
2020-02-20 21:08     ` Jens Axboe
2020-02-20 20:31 ` [PATCH 6/9] sched: add a sched_work list Jens Axboe
2020-02-20 21:17   ` Peter Zijlstra
2020-02-20 21:53     ` Jens Axboe
2020-02-20 22:02       ` Jens Axboe
2020-02-20 20:31 ` [PATCH 7/9] io_uring: add per-task callback handler Jens Axboe
2020-02-20 22:02   ` Jann Horn
2020-02-20 22:14     ` Jens Axboe
2020-02-20 22:18       ` Jens Axboe
2020-02-20 22:25         ` Jann Horn
2020-02-20 22:23       ` Jens Axboe
2020-02-20 22:38         ` Jann Horn
2020-02-20 22:56           ` Jens Axboe
2020-02-20 22:58             ` Jann Horn
2020-02-20 23:02               ` Jens Axboe
2020-02-20 22:23       ` Jann Horn
2020-02-20 23:00         ` Jens Axboe
2020-02-20 23:12           ` Jann Horn
2020-02-20 23:22             ` Jens Axboe
2020-02-21  1:29               ` Jann Horn
2020-02-21 17:32                 ` Jens Axboe
2020-02-21 19:24                   ` Jann Horn
2020-02-21 20:18                     ` Jens Axboe
2020-02-20 22:56     ` Jann Horn
2020-02-21 10:47     ` Peter Zijlstra
2020-02-21 14:49       ` Jens Axboe
2020-02-21 15:02         ` Jann Horn
2020-02-21 16:12           ` Peter Zijlstra
2020-02-21 16:23         ` Peter Zijlstra
2020-02-21 20:13           ` Jens Axboe
2020-02-21 13:51   ` Pavel Begunkov
2020-02-21 14:50     ` Jens Axboe
2020-02-21 18:30       ` Pavel Begunkov
2020-02-21 19:10         ` Jens Axboe
2020-02-21 19:22           ` Pavel Begunkov
2020-02-23  6:00           ` Jens Axboe
2020-02-23  6:26             ` Jens Axboe
2020-02-23 11:02               ` Pavel Begunkov
2020-02-23 14:49                 ` Jens Axboe
2020-02-23 14:58                   ` Jens Axboe
2020-02-23 15:07                     ` Jens Axboe
2020-02-23 18:04                       ` Pavel Begunkov
2020-02-23 18:06                         ` Jens Axboe
2020-02-23 17:55                   ` Pavel Begunkov
2020-02-20 20:31 ` [PATCH 8/9] io_uring: mark requests that we can do poll async in io_op_defs Jens Axboe
2020-02-20 20:31 ` [PATCH 9/9] io_uring: use poll driven retry for files that support it Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).