linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
@ 2020-10-01 15:03 Jens Axboe
  2020-10-01 16:27 ` Oleg Nesterov
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-10-01 15:03 UTC (permalink / raw)
  To: io-uring, linux-kernel; +Cc: Peter Zijlstra, Oleg Nesterov, Thomas Gleixner

Users of TWA_SIGNAL need to have system calls interrupted and go through
a kernel/user transition to ensure they are run. This currently works
well from a functional standpoint, but it is heavy handed on a
multithreaded application where sighand is shared between the threads and
main process. Adding TWA_SIGNAL task_work on such setups need to grab
the sighand->lock, which creates a hot spot for otherwise unrelated
task_work.

This adds TIF_TASKWORK for x86, which if set, will return true on
checking for pending signals. That in turn causes tasks to restart the
system call, which will run the added task_work. If TIF_TASKWORK is
available, we'll use that for notification when TWA_SIGNAL is specified.
If it isn't available, the existing TIF_SIGPENDING path is used.

Once all archs have added support for TIF_TASKWORK, we can kill the
old code completely. That will also allow removal of JOBCTL_TASK_WORK
and related code.

On my test box, even just using 16 threads shows a nice improvement
running an io_uring based echo server.

stock kernel:
0.01% <= 0.1 milliseconds
95.86% <= 0.2 milliseconds
98.27% <= 0.3 milliseconds
99.71% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
100.00% <= 0.7 milliseconds
100.00% <= 0.8 milliseconds
100.00% <= 0.9 milliseconds
100.00% <= 1.0 milliseconds
100.00% <= 1.1 milliseconds
100.00% <= 2 milliseconds
100.00% <= 3 milliseconds
100.00% <= 3 milliseconds
1378930.00 requests per second
~1600% CPU

1.38M requests/second, and all 16 CPUs are maxed out.

patched kernel:
0.01% <= 0.1 milliseconds
98.24% <= 0.2 milliseconds
99.47% <= 0.3 milliseconds
99.99% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
100.00% <= 0.7 milliseconds
100.00% <= 0.8 milliseconds
100.00% <= 0.9 milliseconds
100.00% <= 1.2 milliseconds
1666111.38 requests per second
~1450% CPU

1.67M requests/second, and we're no longer just hammering on the sighand
lock. The original reporter states:

"For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%.
 for 5.7.16 or later it achieves only 1M qps and the system cpu is is
 at ~100%"

with the only difference there being that TWA_SIGNAL is used
unconditionally in 5.7.16, since we need it to be able to solve an
inability to run task_work if the application is waiting in the kernel
already on an event that needs task_work run to be satisfied. Also
see commit 0ba9c9edcd15.

Reported-by: Roman Gershman <romger@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

V2: Ensure we don't restart syscall if we did deliver a signal

This likely needs some splitting into two parts, but figured I'd toss
this one out there for comments. It's fixing a performance regression
for me, while still fixing the original issue that introduced the
regression.

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASKWORK		19	/* task_work pending */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASKWORK		(1 << TIF_TASKWORK)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index be0d7d4152ec..5dc1eeaf0866 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -799,21 +799,8 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 #endif
 }
 
-/*
- * Note that 'init' is a special process: it doesn't get signals it doesn't
- * want to handle. Thus you cannot kill init even with a SIGKILL even by
- * mistake.
- */
-void arch_do_signal(struct pt_regs *regs)
+void arch_restart_syscall(struct pt_regs *regs)
 {
-	struct ksignal ksig;
-
-	if (get_signal(&ksig)) {
-		/* Whee! Actually deliver the signal.  */
-		handle_signal(&ksig, regs);
-		return;
-	}
-
 	/* Did we come from a system call? */
 	if (syscall_get_nr(current, regs) >= 0) {
 		/* Restart the system call - no handlers present */
@@ -831,12 +818,29 @@ void arch_do_signal(struct pt_regs *regs)
 			break;
 		}
 	}
+}
+
+/*
+ * Note that 'init' is a special process: it doesn't get signals it doesn't
+ * want to handle. Thus you cannot kill init even with a SIGKILL even by
+ * mistake.
+ */
+bool arch_do_signal(struct pt_regs *regs)
+{
+	struct ksignal ksig;
+
+	if (get_signal(&ksig)) {
+		/* Whee! Actually deliver the signal.  */
+		handle_signal(&ksig, regs);
+		return true;
+	}
 
 	/*
 	 * If there's no signal to deliver, we just put the saved sigmask
 	 * back.
 	 */
 	restore_saved_sigmask();
+	return false;
 }
 
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index eed0d068904c..e0fec19b5faa 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1767,7 +1767,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb,
 		notify = TWA_SIGNAL;
 
 	ret = task_work_add(tsk, cb, notify);
-	if (!ret)
+	if (!ret && !notify)
 		wake_up_process(tsk);
 
 	return ret;
@@ -6744,27 +6744,17 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
 	trace_io_uring_cqring_wait(ctx, min_events);
-	do {
+	while (!io_should_wake(&iowq, false)) {
 		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
-		/* make sure we run task_work before checking for signals */
 		if (io_run_task_work())
 			continue;
 		if (signal_pending(current)) {
-			if (current->jobctl & JOBCTL_TASK_WORK) {
-				spin_lock_irq(&current->sighand->siglock);
-				current->jobctl &= ~JOBCTL_TASK_WORK;
-				recalc_sigpending();
-				spin_unlock_irq(&current->sighand->siglock);
-				continue;
-			}
 			ret = -EINTR;
 			break;
 		}
-		if (io_should_wake(&iowq, false))
-			break;
 		schedule();
-	} while (1);
+	}
 	finish_wait(&ctx->wait, &iowq.wq);
 
 	restore_saved_sigmask_unless(ret == -EINTR);
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..03cab8b9ddab 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -37,6 +37,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_TASKWORK
+# define _TIF_TASKWORK			(0)
+#endif
+
 /*
  * TIF flags handled in syscall_enter_from_usermode()
  */
@@ -69,7 +73,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK|	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
@@ -262,9 +266,19 @@ static __always_inline void arch_exit_to_user_mode(void) { }
  * arch_do_signal -  Architecture specific signal delivery function
  * @regs:	Pointer to currents pt_regs
  *
- * Invoked from exit_to_user_mode_loop().
+ * Invoked from exit_to_user_mode_loop(). Returns true if a signal was
+ * handled.
+ */
+bool arch_do_signal(struct pt_regs *regs);
+
+/**
+ * arch_restart_syscall -  Architecture specific syscall restarting
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from exit_to_user_mode_loop(), if we need to restart the current
+ * system call.
  */
-void arch_do_signal(struct pt_regs *regs);
+void arch_restart_syscall(struct pt_regs *regs);
 
 /**
  * arch_syscall_exit_tracehook - Wrapper around tracehook_report_syscall_exit()
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..d1a02749df74 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,17 @@ static inline int restart_syscall(void)
 
 static inline int signal_pending(struct task_struct *p)
 {
-	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+#ifdef TIF_TASKWORK
+	/*
+	 * TIF_TASKWORK isn't really a signal, but it requires the same
+	 * behavior of restarting the system call to force a kernel/user
+	 * transition.
+	 */
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+			test_tsk_thread_flag(p, TIF_TASKWORK));
+#else
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
+#endif
 }
 
 static inline int __fatal_signal_pending(struct task_struct *p)
@@ -501,10 +511,16 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
 
 static inline void restore_saved_sigmask_unless(bool interrupted)
 {
-	if (interrupted)
+	if (interrupted) {
+#ifdef TIF_TASKWORK
+		WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+			!test_thread_flag(TIF_TASKWORK));
+#else
 		WARN_ON(!test_thread_flag(TIF_SIGPENDING));
-	else
+#endif
+	} else {
 		restore_saved_sigmask();
+	}
 }
 
 static inline sigset_t *sigmask_to_save(void)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..d25ee8f7f071 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -135,11 +135,13 @@ static __always_inline void exit_to_user_mode(void)
 }
 
 /* Workaround to allow gradual conversion of architecture code */
-void __weak arch_do_signal(struct pt_regs *regs) { }
+bool __weak arch_do_signal(struct pt_regs *regs) { return true; }
 
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
+	bool restart_sys = (ti_work & (_TIF_SIGPENDING|_TIF_TASKWORK)) != 0;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
@@ -157,8 +159,11 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
-		if (ti_work & _TIF_SIGPENDING)
-			arch_do_signal(regs);
+		if (ti_work & _TIF_TASKWORK)
+			task_work_run();
+
+		if ((ti_work & _TIF_SIGPENDING) && arch_do_signal(regs))
+			restart_sys = false;
 
 		if (ti_work & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
@@ -178,6 +183,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		ti_work = READ_ONCE(current_thread_info()->flags);
 	}
 
+	if (restart_sys)
+		arch_restart_syscall(regs);
+
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
 }
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..3345a12df643 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
 	struct callback_head *head;
-	unsigned long flags;
 
 	do {
 		head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 	case TWA_RESUME:
 		set_notify_resume(task);
 		break;
-	case TWA_SIGNAL:
+	case TWA_SIGNAL: {
+#ifndef TIF_TASKWORK
+		unsigned long flags;
+
 		/*
 		 * Only grab the sighand lock if we don't already have some
 		 * task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,12 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 			signal_wake_up(task, 0);
 			unlock_task_sighand(task, &flags);
 		}
+#else
+		set_tsk_thread_flag(task, TIF_TASKWORK);
+		wake_up_process(task);
+#endif
 		break;
+		}
 	}
 
 	return 0;
@@ -110,6 +117,11 @@ void task_work_run(void)
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
+#ifdef TIF_TASKWORK
+	if (test_tsk_thread_flag(task, TIF_TASKWORK))
+		clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
  2020-10-01 15:03 [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals Jens Axboe
@ 2020-10-01 16:27 ` Oleg Nesterov
  2020-10-01 17:27   ` Jens Axboe
       [not found]   ` <20201002133813.3180-1-hdanton@sina.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Oleg Nesterov @ 2020-10-01 16:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-kernel, Peter Zijlstra, Thomas Gleixner

Jens,

I'll read this version tomorrow, but:

On 10/01, Jens Axboe wrote:
>
>  static inline int signal_pending(struct task_struct *p)
>  {
> -	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
> +#ifdef TIF_TASKWORK
> +	/*
> +	 * TIF_TASKWORK isn't really a signal, but it requires the same
> +	 * behavior of restarting the system call to force a kernel/user
> +	 * transition.
> +	 */
> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
> +			test_tsk_thread_flag(p, TIF_TASKWORK));
> +#else
> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
> +#endif

This change alone is already very wrong.

signal_pending(task) == T means that this task will do get_signal() as
soon as it can, and this basically means you can't "divorce" SIGPENDING
and TASKWORK.

Simple example. Suppose we have a single-threaded task T.

Someone does task_work_add(T, TWA_SIGNAL). This makes signal_pending()==T
and this is what we need.

Now suppose that another task sends a signal to T before T calls
task_work_run() and clears TIF_TASKWORK. In this case SIGPENDING won't
be set because signal_pending() is already set (see wants_signal), and
this means that T won't notice this signal.

Oleg.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
  2020-10-01 16:27 ` Oleg Nesterov
@ 2020-10-01 17:27   ` Jens Axboe
       [not found]   ` <20201002133813.3180-1-hdanton@sina.com>
  1 sibling, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2020-10-01 17:27 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: io-uring, linux-kernel, Peter Zijlstra, Thomas Gleixner

On 10/1/20 10:27 AM, Oleg Nesterov wrote:
> Jens,
> 
> I'll read this version tomorrow, but:
> 
> On 10/01, Jens Axboe wrote:
>>
>>  static inline int signal_pending(struct task_struct *p)
>>  {
>> -	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
>> +#ifdef TIF_TASKWORK
>> +	/*
>> +	 * TIF_TASKWORK isn't really a signal, but it requires the same
>> +	 * behavior of restarting the system call to force a kernel/user
>> +	 * transition.
>> +	 */
>> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
>> +			test_tsk_thread_flag(p, TIF_TASKWORK));
>> +#else
>> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
>> +#endif
> 
> This change alone is already very wrong.
> 
> signal_pending(task) == T means that this task will do get_signal() as
> soon as it can, and this basically means you can't "divorce" SIGPENDING
> and TASKWORK.
> 
> Simple example. Suppose we have a single-threaded task T.
> 
> Someone does task_work_add(T, TWA_SIGNAL). This makes signal_pending()==T
> and this is what we need.
> 
> Now suppose that another task sends a signal to T before T calls
> task_work_run() and clears TIF_TASKWORK. In this case SIGPENDING won't
> be set because signal_pending() is already set (see wants_signal), and
> this means that T won't notice this signal.

That's a good point, and I have been thinking along those lines. The
"problem" is the two different use cases:

1) The "should I return from schedule() or break out of schedule() loops
   kind of use cases".

2) Internal signal delivery use cases.

The former wants one that factors in TIF_TASKWORK, while the latter
should of course only look at TIF_SIGPENDING.

Now, my gut reaction would be to have __signal_pending() that purely
checks for TIF_SIGPENDING, and make sure we use that on the signal
delivery side of things. Or something with a better name than that, but
functionally the same. Ala:

static inline int __signal_pending(struct task_struct *p)
{
	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
}

static inline int signal_pending(struct task_struct *p)
{
#ifdef TIF_TASKWORK
	return unlikely(test_tsk_thread_flag(p, TIF_TASKWORK)||
			__signal_pending(p));
#else
	return __signal_pending(p));
#endif
}

and then use __signal_pending() on the signal delivery side.

It's still not great in the sense that renaming signal_pending() would
be a better choice, but that's a whole lot of churn...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
       [not found]   ` <20201002133813.3180-1-hdanton@sina.com>
@ 2020-10-02 13:44     ` Jens Axboe
  0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2020-10-02 13:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Oleg Nesterov, io-uring, LKML, Peter Zijlstra, Thomas Gleixner

On 10/2/20 7:38 AM, Hillf Danton wrote:
> 
> On Thu, 1 Oct 2020 11:27:04 -0600 Jens Axboe wrote:
>> On 10/1/20 10:27 AM, Oleg Nesterov wrote:
>>> Jens,
>>>
>>> I'll read this version tomorrow, but:
>>>
>>> On 10/01, Jens Axboe wrote:
>>>>
>>>>  static inline int signal_pending(struct task_struct *p)
>>>>  {
>>>> -	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
>>>> +#ifdef TIF_TASKWORK
>>>> +	/*
>>>> +	 * TIF_TASKWORK isn't really a signal, but it requires the same
>>>> +	 * behavior of restarting the system call to force a kernel/user
>>>> +	 * transition.
>>>> +	 */
>>>> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
>>>> +			test_tsk_thread_flag(p, TIF_TASKWORK));
>>>> +#else
>>>> +	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
>>>> +#endif
>>>
>>> This change alone is already very wrong.
>>>
>>> signal_pending(task) == T means that this task will do get_signal() as
>>> soon as it can, and this basically means you can't "divorce" SIGPENDING
>>> and TASKWORK.
>>>
>>> Simple example. Suppose we have a single-threaded task T.
>>>
>>> Someone does task_work_add(T, TWA_SIGNAL). This makes signal_pending()==T
>>> and this is what we need.
>>>
>>> Now suppose that another task sends a signal to T before T calls
>>> task_work_run() and clears TIF_TASKWORK. In this case SIGPENDING won't
>>> be set because signal_pending() is already set (see wants_signal), and
>>> this means that T won't notice this signal.
>>
>> That's a good point, and I have been thinking along those lines. The
>> "problem" is the two different use cases:
>>
>> 1) The "should I return from schedule() or break out of schedule() loops
>>    kind of use cases".
>>
>> 2) Internal signal delivery use cases.
>>
>> The former wants one that factors in TIF_TASKWORK, while the latter
>> should of course only look at TIF_SIGPENDING.
>>
>> Now, my gut reaction would be to have __signal_pending() that purely
>> checks for TIF_SIGPENDING, and make sure we use that on the signal
>> delivery side of things. Or something with a better name than that, but
>> functionally the same. Ala:
>>
>> static inline int __signal_pending(struct task_struct *p)
>> {
>> 	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
>> }
>>
>> static inline int signal_pending(struct task_struct *p)
>> {
>> #ifdef TIF_TASKWORK
>> 	return unlikely(test_tsk_thread_flag(p, TIF_TASKWORK)||
>> 			__signal_pending(p));
>> #else
>> 	return __signal_pending(p));
>> #endif
>> }
>>
>> and then use __signal_pending() on the signal delivery side.
>>
>> It's still not great in the sense that renaming signal_pending() would
>> be a better choice, but that's a whole lot of churn...
> 
> To avoid that churn, IIUC replace TWA_SIGNAL with TWA_RESUME on
> adding task work, which is compensated by adding a counter of
> event source in IO ctx and waiting for event to arrive instead
> of signal.

That doesn't work. If the task is waiting in cqring_wait(), then
there's no issue already. The problem is if it's waiting somewhere
else.

Imagine three threads, call them T1-3. T1 creates a pipe, and creates
a ring. T1 queues a poll request for the read end of the pipe, and now
does a wait for T2. T2 is a completer thread, so it ends up waiting
for events on the ring. T2 is now in cqring_wait(). T3 is created,
and it writes to the pipe. This write triggers the original poll
request from T1, and task_work is now queued for T1. This task work
needs to be processed for T2 to wakeup and complete, but it can't
since T1 is in pthread_join() for T2.

This is why TWA_SIGNAL is needed, we need it to break the T1 wait
loop and process this work. No amount of changes in io_uring can
fix this dependency, and if you look at the last series posted,
it does in fact not even have any io_uring changes.

Hence the goal is to have TWA_SIGNAL have the same kind of semantics
it does now, but decoupled from ->sighand since that is problematic
on particularly threaded setups.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-10-02 13:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 15:03 [PATCH RFC v2] kernel: decouple TASK_WORK TWA_SIGNAL handling from signals Jens Axboe
2020-10-01 16:27 ` Oleg Nesterov
2020-10-01 17:27   ` Jens Axboe
     [not found]   ` <20201002133813.3180-1-hdanton@sina.com>
2020-10-02 13:44     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).