All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running
@ 2020-08-07 16:55 Jens Axboe
  2020-08-07 18:00 ` Jann Horn
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-08-07 16:55 UTC (permalink / raw)
  To: io-uring

An earlier commit:

b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")

ensured that we didn't get stuck waiting for eventfd reads when it's
registered with the io_uring ring for event notification, but we still
have a gap where the task can be waiting on other events in the kernel
and need a bigger nudge to make forward progress.

Ensure that we use signaled notifications for a task that isn't currently
running, to be certain the work is seen and processed immediately.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

This isn't perfect, as it'll use TWA_SIGNAL even for cases where we
don't absolutely need it (like task waiting for completions in
io_cqring_wait()), but we don't have a good way to tell right now. We
can probably improve on this in the future, for now I think this is the
best solution.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9b27cdaa735..b4300a61f231 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1720,7 +1720,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
 	 */
 	if (ctx->flags & IORING_SETUP_SQPOLL)
 		notify = 0;
-	else if (ctx->cq_ev_fd)
+	else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
 		notify = TWA_SIGNAL;
 
 	ret = task_work_add(tsk, cb, notify);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running
  2020-08-07 16:55 [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running Jens Axboe
@ 2020-08-07 18:00 ` Jann Horn
  2020-08-07 21:50   ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Jann Horn @ 2020-08-07 18:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Aug 7, 2020 at 6:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> An earlier commit:
>
> b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
>
> ensured that we didn't get stuck waiting for eventfd reads when it's
> registered with the io_uring ring for event notification, but we still
> have a gap where the task can be waiting on other events in the kernel
> and need a bigger nudge to make forward progress.
>
> Ensure that we use signaled notifications for a task that isn't currently
> running, to be certain the work is seen and processed immediately.
>
> Cc: stable@vger.kernel.org # v5.7+
> Reported-by: Josef <josef.grieb@gmail.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> This isn't perfect, as it'll use TWA_SIGNAL even for cases where we
> don't absolutely need it (like task waiting for completions in
> io_cqring_wait()), but we don't have a good way to tell right now. We
> can probably improve on this in the future, for now I think this is the
> best solution.
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e9b27cdaa735..b4300a61f231 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1720,7 +1720,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
>          */
>         if (ctx->flags & IORING_SETUP_SQPOLL)
>                 notify = 0;
> -       else if (ctx->cq_ev_fd)
> +       else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
>                 notify = TWA_SIGNAL;
>
>         ret = task_work_add(tsk, cb, notify);

I don't get it. Apart from still not understanding the big picture:

What guarantees that the lockless read of tsk->state is in any way
related to the state of the remote process by the time we reach
task_work_add()? And why do we not need to signal in TASK_RUNNING
state (e.g. directly before the remote process switches to
TASK_INTERRUPTIBLE or something like that)?

Even if this is correct, it would still be nice if you could add a big
comment that explains the precise semantics this is attempting to
provide. As far as I understand so far, the goal is to trigger -EINTR
returns from certain syscalls, or something like that? But I don't
understand whether that's indeed what's going on, or which syscalls
precisely this is attempting to make return -EINTR.

(Also, lockless reads of concurrently changing variables should be
written with READ_ONCE().)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running
  2020-08-07 18:00 ` Jann Horn
@ 2020-08-07 21:50   ` Jens Axboe
  2020-08-07 22:11     ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-08-07 21:50 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring

On 8/7/20 12:00 PM, Jann Horn wrote:
> On Fri, Aug 7, 2020 at 6:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> An earlier commit:
>>
>> b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
>>
>> ensured that we didn't get stuck waiting for eventfd reads when it's
>> registered with the io_uring ring for event notification, but we still
>> have a gap where the task can be waiting on other events in the kernel
>> and need a bigger nudge to make forward progress.
>>
>> Ensure that we use signaled notifications for a task that isn't currently
>> running, to be certain the work is seen and processed immediately.
>>
>> Cc: stable@vger.kernel.org # v5.7+
>> Reported-by: Josef <josef.grieb@gmail.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> This isn't perfect, as it'll use TWA_SIGNAL even for cases where we
>> don't absolutely need it (like task waiting for completions in
>> io_cqring_wait()), but we don't have a good way to tell right now. We
>> can probably improve on this in the future, for now I think this is the
>> best solution.
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index e9b27cdaa735..b4300a61f231 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1720,7 +1720,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
>>          */
>>         if (ctx->flags & IORING_SETUP_SQPOLL)
>>                 notify = 0;
>> -       else if (ctx->cq_ev_fd)
>> +       else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
>>                 notify = TWA_SIGNAL;
>>
>>         ret = task_work_add(tsk, cb, notify);
> 
> I don't get it. Apart from still not understanding the big picture:
> 
> What guarantees that the lockless read of tsk->state is in any way
> related to the state of the remote process by the time we reach
> task_work_add()? And why do we not need to signal in TASK_RUNNING
> state (e.g. directly before the remote process switches to
> TASK_INTERRUPTIBLE or something like that)?

Yeah it doesn't, the patch doesn't cover the racy case. As far as I can
tell, we've got two ways to do it:

1) We split the task_work_add() into two parts, one adding the work and
   one doing the signaling. Then we could do:

int notify = TWA_RESUME;

__task_work_add(tsk, cb);

if (ctx->flags & IORING_SETUP_SQPOLL)
	notify = 0;
else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
	notify = TWA_SIGNAL;

__task_work_signal(tsk, notify);

2) We imply that behavior in task_work_add() itself, if TWA_SIGNAL is
used, making TWA_SIGNAL imply "use signal wakeup IFF task is not
running". Or add a TWA_SIGNAL_NOT_RUNNING for that behavior.

I kind of like the first approach.

> Even if this is correct, it would still be nice if you could add a big
> comment that explains the precise semantics this is attempting to
> provide. As far as I understand so far, the goal is to trigger -EINTR
> returns from certain syscalls, or something like that? But I don't
> understand whether that's indeed what's going on, or which syscalls
> precisely this is attempting to make return -EINTR.

The point is if the original task is currently looping (or just waiting)
in the kernel on another event, it still gets a chance to process the
task work. The completion it's waiting for may be dependent on getting
that task work run.

The test case for this one is kicking off a thread that waits on the
completion event, while the main task is waiting for the thread to exit.

Agree it needs a comment, I'll add one.

> (Also, lockless reads of concurrently changing variables should be
> written with READ_ONCE().)

Good point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running
  2020-08-07 21:50   ` Jens Axboe
@ 2020-08-07 22:11     ` Jens Axboe
  0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2020-08-07 22:11 UTC (permalink / raw)
  To: Jann Horn; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 2686 bytes --]

On 8/7/20 3:50 PM, Jens Axboe wrote:
> On 8/7/20 12:00 PM, Jann Horn wrote:
>> On Fri, Aug 7, 2020 at 6:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> An earlier commit:
>>>
>>> b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
>>>
>>> ensured that we didn't get stuck waiting for eventfd reads when it's
>>> registered with the io_uring ring for event notification, but we still
>>> have a gap where the task can be waiting on other events in the kernel
>>> and need a bigger nudge to make forward progress.
>>>
>>> Ensure that we use signaled notifications for a task that isn't currently
>>> running, to be certain the work is seen and processed immediately.
>>>
>>> Cc: stable@vger.kernel.org # v5.7+
>>> Reported-by: Josef <josef.grieb@gmail.com>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> This isn't perfect, as it'll use TWA_SIGNAL even for cases where we
>>> don't absolutely need it (like task waiting for completions in
>>> io_cqring_wait()), but we don't have a good way to tell right now. We
>>> can probably improve on this in the future, for now I think this is the
>>> best solution.
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index e9b27cdaa735..b4300a61f231 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1720,7 +1720,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
>>>          */
>>>         if (ctx->flags & IORING_SETUP_SQPOLL)
>>>                 notify = 0;
>>> -       else if (ctx->cq_ev_fd)
>>> +       else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
>>>                 notify = TWA_SIGNAL;
>>>
>>>         ret = task_work_add(tsk, cb, notify);
>>
>> I don't get it. Apart from still not understanding the big picture:
>>
>> What guarantees that the lockless read of tsk->state is in any way
>> related to the state of the remote process by the time we reach
>> task_work_add()? And why do we not need to signal in TASK_RUNNING
>> state (e.g. directly before the remote process switches to
>> TASK_INTERRUPTIBLE or something like that)?
> 
> Yeah it doesn't, the patch doesn't cover the racy case. As far as I can
> tell, we've got two ways to do it:
> 
> 1) We split the task_work_add() into two parts, one adding the work and
>    one doing the signaling. Then we could do:
> 
> int notify = TWA_RESUME;
> 
> __task_work_add(tsk, cb);
> 
> if (ctx->flags & IORING_SETUP_SQPOLL)
> 	notify = 0;
> else if (ctx->cq_ev_fd || (tsk->state != TASK_RUNNING))
> 	notify = TWA_SIGNAL;
> 
> __task_work_signal(tsk, notify);

Something like the attached - totally untested so far, but it implements
that idea.

-- 
Jens Axboe


[-- Attachment #2: 0002-io_uring-use-TWA_SIGNAL-for-task_work-if-the-task-is.patch --]
[-- Type: text/x-patch, Size: 2351 bytes --]

From ec67af3b1e0c9325a04d5d1c12311086349d3a79 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 6 Aug 2020 19:41:50 -0600
Subject: [PATCH 2/2] io_uring: use TWA_SIGNAL for task_work if the task isn't
 running

An earlier commit:

b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")

ensured that we didn't get stuck waiting for eventfd reads when it's
registered with the io_uring ring for event notification, but we still
have a gap where the task can be waiting on other events in the kernel
and need a bigger nudge to make forward progress.

Ensure that we use signaled notifications for a task that isn't currently
running, to be certain the work is seen and processed immediately.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9b27cdaa735..443eecdfeda9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1712,21 +1712,27 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret, notify = TWA_RESUME;
 
+	ret = __task_work_add(tsk, cb);
+	if (unlikely(ret))
+		return ret;
+
 	/*
 	 * SQPOLL kernel thread doesn't need notification, just a wakeup.
-	 * If we're not using an eventfd, then TWA_RESUME is always fine,
-	 * as we won't have dependencies between request completions for
-	 * other kernel wait conditions.
+	 * For any other work, use signaled wakeups if the task isn't
+	 * running to avoid dependencies between tasks or threads. If
+	 * the issuing task is currently waiting in the kernel on a thread,
+	 * and same thread is waiting for a completion event, then we need
+	 * to ensure that the issuing task processes task_work. TWA_SIGNAL
+	 * is needed for that.
 	 */
 	if (ctx->flags & IORING_SETUP_SQPOLL)
 		notify = 0;
-	else if (ctx->cq_ev_fd)
+	else if (READ_ONCE(tsk->state) != TASK_RUNNING)
 		notify = TWA_SIGNAL;
 
-	ret = task_work_add(tsk, cb, notify);
-	if (!ret)
-		wake_up_process(tsk);
-	return ret;
+	__task_work_notify(tsk, notify);
+	wake_up_process(tsk);
+	return 0;
 }
 
 static void __io_req_task_cancel(struct io_kiocb *req, int error)
-- 
2.28.0


[-- Attachment #3: 0001-kernel-split-task_work_add-into-two-separate-helpers.patch --]
[-- Type: text/x-patch, Size: 3714 bytes --]

From 802b09d10bdd2f90e510049b1c52b81edc2ae0a3 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Fri, 7 Aug 2020 16:00:53 -0600
Subject: [PATCH 1/2] kernel: split task_work_add() into two separate helpers

Some callers may need to make signaling decisions based on the state
of the targeted task, and that can only safely be done post adding
the task_work to the task. Split task_work_add() into:

__task_work_add()	- adds the work item
__task_work_notify()	- sends the notification

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/task_work.h | 19 ++++++++++++++++
 kernel/task_work.c        | 48 +++++++++++++++++++++------------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 0fb93aafa478..7abbd8df5e13 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -5,6 +5,8 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 
+extern struct callback_head work_exited;
+
 typedef void (*task_work_func_t)(struct callback_head *);
 
 static inline void
@@ -13,6 +15,21 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 	twork->func = func;
 }
 
+static inline int __task_work_add(struct task_struct *task,
+				  struct callback_head *work)
+{
+	struct callback_head *head;
+
+	do {
+		head = READ_ONCE(task->task_works);
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
+		work->next = head;
+	} while (cmpxchg(&task->task_works, head, work) != head);
+
+	return 0;
+}
+
 #define TWA_RESUME	1
 #define TWA_SIGNAL	2
 int task_work_add(struct task_struct *task, struct callback_head *twork, int);
@@ -20,6 +37,8 @@ int task_work_add(struct task_struct *task, struct callback_head *twork, int);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
 
+void __task_work_notify(struct task_struct *task, int notify);
+
 static inline void exit_task_work(struct task_struct *task)
 {
 	task_work_run();
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 5c0848ca1287..9bde81481984 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,7 +3,27 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
-static struct callback_head work_exited; /* all we need is ->next == NULL */
+struct callback_head work_exited = {
+	.next = NULL	/* all we need is ->next == NULL */
+};
+
+void __task_work_notify(struct task_struct *task, int notify)
+{
+	unsigned long flags;
+
+	switch (notify) {
+	case TWA_RESUME:
+		set_notify_resume(task);
+		break;
+	case TWA_SIGNAL:
+		if (lock_task_sighand(task, &flags)) {
+			task->jobctl |= JOBCTL_TASK_WORK;
+			signal_wake_up(task, 0);
+			unlock_task_sighand(task, &flags);
+		}
+		break;
+	}
+}
 
 /**
  * task_work_add - ask the @task to execute @work->func()
@@ -27,29 +47,13 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
 int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
-	struct callback_head *head;
-	unsigned long flags;
+	int ret;
 
-	do {
-		head = READ_ONCE(task->task_works);
-		if (unlikely(head == &work_exited))
-			return -ESRCH;
-		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
-
-	switch (notify) {
-	case TWA_RESUME:
-		set_notify_resume(task);
-		break;
-	case TWA_SIGNAL:
-		if (lock_task_sighand(task, &flags)) {
-			task->jobctl |= JOBCTL_TASK_WORK;
-			signal_wake_up(task, 0);
-			unlock_task_sighand(task, &flags);
-		}
-		break;
-	}
+	ret = __task_work_add(task, work);
+	if (unlikely(ret))
+		return ret;
 
+	__task_work_notify(task, notify);
 	return 0;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-08-07 22:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 16:55 [PATCH v2] io_uring: use TWA_SIGNAL for task_work if the task isn't running Jens Axboe
2020-08-07 18:00 ` Jann Horn
2020-08-07 21:50   ` Jens Axboe
2020-08-07 22:11     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.